CN112199994B - Method and device for detecting interaction of3D hand and unknown object in RGB video in real time - Google Patents
Method and device for detecting interaction of3D hand and unknown object in RGB video in real time Download PDFInfo
- Publication number
- CN112199994B CN112199994B CN202010916742.4A CN202010916742A CN112199994B CN 112199994 B CN112199994 B CN 112199994B CN 202010916742 A CN202010916742 A CN 202010916742A CN 112199994 B CN112199994 B CN 112199994B
- Authority
- CN
- China
- Prior art keywords
- hand
- neural network
- video
- gesture
- convolutional neural
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
- 238000000034 method Methods 0.000 title claims abstract description 53
- 230000003993 interaction Effects 0.000 title claims abstract description 41
- 230000009471 action Effects 0.000 claims abstract description 55
- 238000013527 convolutional neural network Methods 0.000 claims abstract description 38
- 125000004122 cyclic group Chemical group 0.000 claims abstract description 33
- 238000013528 artificial neural network Methods 0.000 claims abstract description 32
- 238000012549 training Methods 0.000 claims abstract description 30
- 230000002452 interceptive effect Effects 0.000 claims abstract description 28
- 239000013598 vector Substances 0.000 claims description 22
- 230000006870 function Effects 0.000 claims description 17
- 210000000707 wrist Anatomy 0.000 claims description 14
- 210000004247 hand Anatomy 0.000 claims description 11
- 238000004590 computer program Methods 0.000 claims description 6
- 238000011897 real-time detection Methods 0.000 claims description 4
- 230000004913 activation Effects 0.000 claims description 2
- 230000033001 locomotion Effects 0.000 abstract description 10
- 238000010586 diagram Methods 0.000 description 10
- 230000008569 process Effects 0.000 description 5
- 238000001514 detection method Methods 0.000 description 3
- 210000003811 finger Anatomy 0.000 description 3
- 230000003542 behavioural effect Effects 0.000 description 2
- 238000004422 calculation algorithm Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 235000015203 fruit juice Nutrition 0.000 description 2
- 239000008267 milk Substances 0.000 description 2
- 210000004080 milk Anatomy 0.000 description 2
- 235000013336 milk Nutrition 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- 210000003813 thumb Anatomy 0.000 description 2
- 230000000007 visual effect Effects 0.000 description 2
- NDTDVKKGYBULHF-UHFFFAOYSA-N 2-(1-hydroxy-3-phenylnaphthalen-2-yl)-3-phenylnaphthalen-1-ol Chemical compound C=1C2=CC=CC=C2C(O)=C(C=2C(=CC3=CC=CC=C3C=2O)C=2C=CC=CC=2)C=1C1=CC=CC=C1 NDTDVKKGYBULHF-UHFFFAOYSA-N 0.000 description 1
- 206010000117 Abnormal behaviour Diseases 0.000 description 1
- 240000007594 Oryza sativa Species 0.000 description 1
- 235000007164 Oryza sativa Nutrition 0.000 description 1
- 238000001069 Raman spectroscopy Methods 0.000 description 1
- 230000003190 augmentative effect Effects 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000007796 conventional method Methods 0.000 description 1
- 239000008358 core component Substances 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 239000003599 detergent Substances 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 238000003062 neural network model Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000011176 pooling Methods 0.000 description 1
- 230000000644 propagated effect Effects 0.000 description 1
- 230000000306 recurrent effect Effects 0.000 description 1
- 235000009566 rice Nutrition 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/20—Movements or behaviour, e.g. gesture recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Evolutionary Computation (AREA)
- General Engineering & Computer Science (AREA)
- Molecular Biology (AREA)
- Software Systems (AREA)
- Mathematical Physics (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Computing Systems (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Psychiatry (AREA)
- Social Psychology (AREA)
- Human Computer Interaction (AREA)
- Multimedia (AREA)
- Image Analysis (AREA)
Abstract
The invention relates to a method and a device for detecting interaction between a 3D hand and an unknown object in RGB video in real time. The method comprises the following steps: training a convolutional neural network by taking a video frame as an input, and predicting the 3D hand gesture, the 6D object gesture, the hand action and the object type of each frame of image by the convolutional neural network; training an interactive cyclic neural network by taking the 3D hand gesture and the 6D object gesture detected by the convolutional neural network as input, and obtaining the interaction category of the hand and the object in the video by using time sequence information in the video by the cyclic neural network; and inputting the video to be detected into the convolutional neural network and the interactive cyclic neural network after training is completed, and obtaining the 3D hand gesture, the 6D object gesture, the hand action, the object type and the interaction action of the hand and the object in the video. According to the invention, depth photos or real object gesture coordinates are not needed as input, so that the accuracy of hand motion recognition is improved, the recognition range is greatly improved, and the method is more convenient to apply to life.
Description
Technical Field
The invention relates to hand and object interaction identification, and aims to detect motion trail and interaction category of hand and unknown object in RGB video in real time.
Background
In recent years, with the development of computer vision and virtual reality technology and the increasing demand for intelligent living home, the "human-centered" action recognition and behavior understanding are becoming research hotspots in the field of computer vision. In the field of behavioral understanding, recognition of hand-object interactions is of paramount importance, and recognition of hand-object interactions includes recognition of hand action categories and object categories, so that with semantic information of hand-object interactions, we can better understand user intent and predict their next actions. Meanwhile, detecting real-time hand shape and motion tracking are always the most core components in a sign language recognition and gesture control system, and play an important role in partial augmented reality experience.
Currently, hand recognition can be largely categorized into vision-based non-contact type and sensor information-based contact type. The method based on the sensor information requires the operator to wear equipment such as data gloves and the like, and the parameters are required to be readjusted after the operator is replaced, and although the three-dimensional pose information of the gesture in space can be directly obtained in real time, the method has a certain difficulty in popularization in reality due to the inconvenience of operation. In contrast, vision-based gesture recognition enables operators to interact with human-machine in a more natural manner. In future human-machine interaction and monitoring, it is therefore highly desirable to rely on vision systems to make the robot perceive the intention of the robot, of which vision-based action recognition and behavioral understanding are particularly important.
However, while critical to the semantically meaningful interpretation of visual scenes, the problem of jointly understanding humans and objects is of little concern. Much research is currently focused on visual understanding that humans and objects are isolated from each other. Conventional methods of recognizing hand motion separate hands from a first view to recognize their gestures (G.Rogez, J.Supancic, and d. Raman. First-Person Pose Recognition Using Egocentric workspaces. In CVPR, 2015.) or hand gestures recognized in RGB images from the first and third views (U.Iqbal, P.Molchanov, T.Breuel, J.Gall, and j. Kautz. Hand Pose Estimation via Latent 2.5D Heatmap Regression.In ECCV,2018.), but these do not model objects interacting with the hands together. Some methods use object interactions as additional constraints when estimating hand motions (C.Choi, S.H.Yoon, C.Chen, and k. Ramani. Robust Hand Pose Estimation during the Interaction with an Unknown object. In ICCV, 2017.) that improve the accuracy of hand motion recognition, but rely on depth images as input. There are methods of pose reconstruction of an adversary and an object (Learning Joint Reconstruction of Hands and Manipulated objects, yana Hasson, gul vanol, dimitrios Tzionas, igor Kalevatykh, michael J.Black, ivan Laptev, cordelia Schmid. In CVPR, 2019), but no semantic information is learned. There are methods that can identify hand-object interactions (Tekin B, bogo F, bellifeys M.H +o: unified egocentric recognition of d hand-object interactions, in CVPR, 2019.) but can only identify known objects in the dataset, lacking generalization.
Although the existing method can analyze semantic information of hand and object interaction, the identifiable object types are limited by a hand data set, the object types of hand interaction in the existing hand movement data set are very limited, and a large amount of manpower and material resources are required to be consumed for marking new data. It is therefore practical to propose a method that can recognize the interaction of a hand with an unknown object from RGB video.
Disclosure of Invention
The invention aims to provide a method and a device capable of detecting the spatial gesture and interaction type of a 3D hand and an unknown object in real time according to RGB video.
The inventor finds that many methods in the prior art solve the gesture of the hand or the object in an isolated state, and the gesture recognition methods can only recognize the shape of the hand and some simple gestures (such as the gesture of standing up the thumb and winning the thumb) and cannot recognize the interaction relationship with the object; some methods for reconstructing the hand and object gestures restore the object edges well, but do not analyze the semantic information of the scene; some methods for identifying actions need to rely on the input of depth images, otherwise, the accuracy is very low; some methods for estimating the object pose do not directly calculate the 6D pose, but generate a 2D frame first, and then calculate the 6D pose through PnP algorithm, so that part of information is lost. The invention solves the problems, can complete a plurality of tasks at one time, is an end-to-end method, can simultaneously predict the gesture, the action and the category estimation of the 3D hand and the object by inputting RGB video, does not need a depth photograph or real object gesture coordinates as input, and improves the accuracy of hand action recognition.
As shown in fig. 1, the invention mainly comprises a Convolutional Neural Network (CNN) and an interactive cyclic neural network (interactive RNN), wherein the convolutional neural network is used for identifying the 3D hand gesture, the 6D object gesture (3D position and 3D direction of an object), the hand action (inversion, opening, closing and the like), the object type (milk, detergent, fruit juice box and the like) of each frame of image, and the cyclic neural network is used for extracting the time sequence characteristics in the integrated video to obtain the interactive type (inversion milk, opening fruit juice box and the like) of the hand and the object of the whole video. The method of the present invention is divided into a training process and a use process. In the training process stage, two steps of training are respectively carried out, firstly, a video frame is taken as input, a convolutional neural network is trained, 3D hand gestures, 6D object gestures, hand actions and object types of each frame of images are predicted, parameters of the 3D hand gestures, 6D object gestures, hand actions and object types are fixed after training, then, the cyclic neural network is trained, detected hand and object key point coordinates are taken as input of the cyclic neural network, and interaction type estimation of the hand and the object in the whole video is output. In the use process stage, the complete model takes a series of video frames as input, and 3D hand gesture and object gesture prediction of each frame and estimation of object and action types of the whole video frame sequence are output after passing through two neural networks.
The technical scheme adopted by the invention mainly comprises the following steps (if no special description exists, the following steps are executed by software and hardware of a computer and electronic equipment):
(1) Model building and training. When the model is used for the first time, a user firstly needs to train the convolutional neural network and the interactive cyclic neural network, and then can use the trained model to conduct action recognition.
(2) Video input. The model can detect the 3D position (namely 3D gesture) of the hand of each frame of image in the video, the 6D gesture of the object, the action of the hand, the object type estimation and the interaction action of the hand and the object in the whole video in real time.
Further, in the detailed design of the model, as shown in fig. 2, 21 key points of each of the hand and the object (the key points of the hand are four joints and wrist nodes of each finger, the key points of the object are eight vertices, a center point and a midpoint of 12 sides of the object bounding box) are specified, and the pose (i.e., the 3D hand pose and the 6D object pose) is determined by predicting the coordinates of the key points.
Further, the method for predicting the coordinates of the key points and predicting the hand actions and the object categories by the convolutional neural network comprises the following steps:
as shown in fig. 3 and 4, each picture frame is divided into h×w grids, and D grids (H, W, D respectively representing height, width, and depth) are extended to depth, each grid having a size of C in units of pixels (pixels) in a plane and meters (meters) in a depth direction u ×C v Pixel×C z And (5) rice. In this grid coordinate system, the upper left corner of the grid is taken as the origin of the coordinate system, and a grid is taken as a unit.
To enable simultaneous joint prediction of hand and object pose and category, as in FIG. 4, two vectors are stored in each cell (i.e., grid)To predict the characteristics of hand and object, respectively, wherein +.>Coordinates of key points of hand and object, respectively, < ->N c For the number of key points of hands or objects, +.>For action category probability, ++>N a For action category number->For object class probability ++>N o The method is characterized in that the method adds a background class for the object class number (if the object is an unknown object, the object is classified into the background class, and then the unknown object is identified by entering a zero-order learning classifier). Wherein the grid of wrist nodes and object center points is used to predict motion and object categories. />For confidence level->The two vectors stored per cell are derived from convolutional neural networks. The method comprises the steps of firstly determining the coordinates (u, v, z) of a cell where a key point is located, and then predicting the offsets delta u, delta v and delta z of the key point relative to the left upper corner of the cell where the key point is located in three dimensions, so that the coordinates of the key point in a grid coordinate system can be obtained:
wherein, since the cell where the wrist node and the object center point are located is responsible for predicting the action and the object category, g (x) is used to control the offset of the two points between [0,1], thereby determining the cell responsible for predicting the action and the object category. The g (x) expression is as follows:
wherein g (x) represents a function of constraining the offset of the wrist node from the center point of the object, x represents the offset deltau, deltav, deltaz, sigmoid of the key point relative to the upper left corner of the cell where it is located in three dimensions, and sigmoid represents an activation function, the value range is (0, 1), and a real number can be mapped to the interval of (0, 1), and the function is used to make the wrist node still located in the cell where it is located after the offset of the wrist node from the center point of the object to predict the action and the object category.
In addition, with the three-dimensional position in the grid coordinate system and the camera internal parameter K, the three-dimensional coordinates of the key points in the camera coordinate system can be calculated as follows:
further, a higher confidence is set for the grid in which the adversary or object is present, and a confidence function is set as:
wherein D is T (x) Is the Euclidean distance between the predicted point and the real point, alpha represents the super-parameter, d th Represents a set threshold value, D when the predicted value is closer to the true value T (x) The smaller c (x) is, the larger the confidence is, and conversely, the smaller the confidence is. The total confidence is:
wherein:
further, when the probability of the background class of the object is maximum, it is determined that the object belongs to an unknown class. As shown in fig. 6, unknown object categories are identified by introducing semantic information using a zero-order learning classifier module. The zero-order learning classifier module multiplies the probabilities of other prediction classes except the background by vectors in a semantic space respectively, adds the obtained semantic vectors to be used as final prediction semantic vectors, calculates the class and the similarity thereof in the semantic space, and considers that the unknown object belongs to the class with the highest similarity when the highest similarity value is not lower than a threshold value.
Further, the total loss function of the convolutional neural network of the present invention is:
wherein lambda is pose Loss function parameters, lambda, representing predicted hand and object positions conf Loss function parameter, lambda, representing confidence actcls Loss function parameter, lambda, representing predicted action class objcls Loss function parameters representing predicted object class, G t A regular fixed grid representing the divided pictures;representing predicted hand coordinates, +.>Representing predicted object coordinates +.>Confidence indicating manual work category of prediction, +.>Confidence representing predicted object class, +.>Representing the probability of a predicted object class,representing the predicted action category probability.
Further, since the convolutional network only learns the information of each frame image and does not utilize the timing information in the video, the present inventionAn interactive cyclic neural network part is added, as shown in figure 5, the key point coordinate vector of the hand and the object is calculated by a convolution networkInputting a multi-layer perceptron to model the relationship of the multi-layer perceptrons, and then taking the multi-layer perceptrons as the input of a cyclic neural network, wherein the model of the cyclic network is as follows:
wherein f φ Is a cyclic neural network model g θ Is a multi-layer perceptron model, and finally outputs the interaction category of the hand and the object in the video.
Based on the same inventive concept, the invention also provides a device for detecting interaction between a 3D hand and an unknown object in RGB video in real time by adopting the method, which comprises the following steps:
the model training module is used for taking the video frames as input and training a convolutional neural network, and the convolutional neural network predicts the 3D hand gesture, the 6D object gesture, the hand actions and the object types of each frame of image; the 3D hand gesture and the 6D object gesture detected by the convolutional neural network are used as input to train an interactive cyclic neural network, and the cyclic neural network utilizes time sequence information in the video to obtain the interactive category of the hand and the object in the video;
the real-time detection module is used for inputting the video to be detected into the convolutional neural network and the interactive cyclic neural network after training is completed, and obtaining the 3D hand gesture, the 6D object gesture, the hand action, the object type and the interaction action of the hand and the object in the video of each frame of image in the video.
The method for identifying the interaction between the 3D hand and the object in the RGB video greatly improves the practicability, and specifically comprises the following steps:
(1) The method does not need to rely on the depth image shot by the RGB-D camera, and can detect the hand-object interaction in the RGB video by inputting a series of frames, so that the application range in life is greatly increased.
(2) The method can detect the position track of the hand and the object and estimate the action category and the object category at the same time at real-time speed, and can be applied to abnormal behavior detection.
(3) The method can detect the unknown object types not in the training set, greatly improves the identification range, improves generalization and is more convenient to apply to life.
Drawings
FIG. 1 is a flow chart of a method for recognizing 3D hand interactions with objects based on RGB video; wherein I is 1 ~I N Representing N video frames, CNN is a convolutional neural network, RNN is a cyclic neural network.
FIG. 2 is a schematic diagram of hand and object keypoints; wherein (a) the graph illustrates the keypoints of 21 hands, (b) the graph illustrates the keypoints of 21 objects, (a) P, R, M, I, T in the graph represents 5 fingers, TIP represents fingertips, DIP represents distal knuckles, PIP represents proximal knuckles, MCP represents metacarpophalangerhans, and Wrist is Wrist.
FIG. 3 is a diagram of a grid coordinate system of an input image;
FIG. 4 is a schematic diagram of hand and object positions and their cell stored vectors in a grid coordinate system;
FIG. 5 is a schematic diagram of an interactive cyclic network in a model, where x 1 ~x N Representing the input of the interactive recurrent neural network.
Fig. 6 is a schematic diagram of a zero-order learning classifier module in a model.
Detailed Description
The process according to the invention is further described below with reference to the accompanying drawings and specific examples.
The hand motion recognition method does not need to rely on an external detection algorithm, and only needs to perform end-to-end training on a single image. After a single RGB image is input and feedforward transmission is carried out once through a neural network, the gestures of a 3D hand and an object can be estimated together, interaction of the three-dimensional hand and the object is modeled, the object and action categories are identified, and when the object category is identified as a background category, the unknown category of the object is predicted by calculating and searching the closest category in a semantic space through a zero-order learning classifier module. The pose information of the hand and object is then further combined and propagated in the time domain to infer interactions between the hand and object trajectories and to identify actions. The method takes a series of frames as input, and can output 3D hand and object gesture prediction of each frame and estimation of object and action category of the whole sequence.
Fig. 1 is a schematic flow chart of a method for identifying interaction between a 3D hand and an object based on RGB video, the method mainly comprises the following steps:
(1) And (5) model training. The model training is divided into two parts, namely training the convolutional neural network firstly and then fixing the training interactive cyclic neural network. The convolutional neural network is a YOLO-based architecture, the network has 31 layers in total, the rest layers are convolutional layers or pooling layers except the last layer which is a predictor, and a H x W x D x 2 (3 x N) c +1+N a +N o ) Corresponding to the vectors of the two hands and the object contained in each cell in the grid. In the method of the present embodiment, h=w=13, d=5. The size of the picture input in this embodiment is 416×416. After the convolutional network is trained, the key point vectors of the hand and the object obtained by each frame of image through the convolutional network are learned by a multi-layer perceptron of a hidden layer, then the interactive relationship is input into a cyclic neural network of two hidden layers, and finally the interactive category estimation is output. The training dataset of this embodiment is a First-Person Hand Action (FPHA) dataset, which is a publicly available 3D hand-object interaction recognition dataset that contains labels for 3D hand gestures, 6D object gestures, and motion categories. The FPHA contains videos belonging to 45 different activity categories of 6 actors, and the subject performs complex actions corresponding to daily human activities. One subset of the dataset contains annotations of the 6D pose of the object, and corresponding mesh models of 4 objects involving 10 different action categories. The training set is divided into two parts according to the object categories interacting with the hands, a training set and a test set, wherein the test set comprises object categories (unknown categories) which are not present in the training set.
(2) And (3) a detection stage. A series of video frames are input into the model, and the 3D gesture of the hand and the object of each frame of image and the interaction category of the hand and the object in the whole sequence can be estimated. When the predicted object is a background class, the unknown class of the object is predicted through the zero-order learning classifier.
Fig. 2 is a schematic diagram of key points of a hand and an object, and 21 key points are taken for convenience of unified calculation. The key points of the hand are the four joints of each finger, and the wrist nodes. The keypoints of an object take the eight vertices, the center point, and the midpoints of 12 sides of its bounding box. The grids of the wrist nodes and the center points of the objects are used for predicting the categories of the objects and actions.
Fig. 3 is a diagram of a grid coordinate system of an input image, assuming that the upper left corner of the grid is the origin of coordinates, each grid is a unit, and the grid coordinates are the number of grids offset from the upper left corner.
Fig. 4 is a schematic diagram of vectors stored in a grid coordinate system for hand and object positions and cells where the hand and object positions are located, and whether an object exists in a cell is determined by whether a key point falls into the cell or not.
FIG. 5 is a schematic diagram of an interactive cyclic network in a model, wherein each frame of image is first passed through a convolution network to obtain key point vectors of hands and objectsModeling the relationship by using a multi-layer perceptron, learning time sequence information in the video by using the obtained vector through a cyclic neural network of two hidden layers, and finally outputting interaction category estimation.
FIG. 6 is a schematic diagram of a zero-order learning classifier module in a model, when the probability of a background class is maximum, determining that the object belongs to an unknown class. Multiplying the probabilities of other prediction classes except the background by vectors in a semantic space respectively, adding the obtained semantic vectors to obtain a final prediction semantic vector, calculating the class and the similarity thereof in the semantic space, and considering that the unknown object belongs to the class with the highest similarity when the highest similarity value is not lower than a threshold value.
Even under a complex real scene, the method can also effectively identify the track, category and interaction action of hands and unknown objects in real time from RGB video, obtain semantic information and time sequence information of capturing sequences from the video, greatly improve the action identification efficiency, solve the problem that the semantic information interacted with the objects cannot be identified in the traditional gesture identification, and can also identify the unknown objects interacted with the hands without inputting depth images or real object coordinate data, thereby providing a good theoretical basis for wide application.
The module convolutional neural network, the zero-order learning classifier, and the cyclic neural network of the above embodiments may be arbitrarily combined, and for brevity of description, all possible combinations of the respective modules in the above embodiments are not described, however, as long as there is no contradiction between the combinations of the modules, they should be considered as the scope of the description of the present specification.
Based on the same inventive concept, another embodiment of the present invention provides an apparatus for detecting 3D hand interaction with an unknown object in RGB video in real time using the above method, comprising:
the model training module is used for taking the video frames as input and training a convolutional neural network, and the convolutional neural network predicts the 3D hand gesture, the 6D object gesture, the hand actions and the object types of each frame of image; the 3D hand gesture and the 6D object gesture detected by the convolutional neural network are used as input to train an interactive cyclic neural network, and the cyclic neural network utilizes time sequence information in the video to obtain the interactive category of the hand and the object in the video;
the real-time detection module is used for inputting the video to be detected into the convolutional neural network and the interactive cyclic neural network after training is completed, and obtaining the 3D hand gesture, the 6D object gesture, the hand action, the object type and the interaction action of the hand and the object in the video of each frame of image in the video.
Based on the same inventive concept, another embodiment of the present invention provides an electronic device (computer, server, smart phone, etc.) comprising a memory storing a computer program configured to be executed by the processor, and a processor, the computer program comprising instructions for performing the steps of the inventive method.
Based on the same inventive concept, another embodiment of the present invention provides a computer readable storage medium (e.g., ROM/RAM, magnetic disk, optical disk) storing a computer program which, when executed by a computer, implements the steps of the inventive method.
The above embodiments are only for illustrating the technical solution of the present invention and not for limiting the same, and those skilled in the art may modify or substitute the technical solution of the present invention without departing from the spirit and scope of the present invention, and the protection scope of the present invention shall be defined by the claims.
Claims (8)
1. A method for detecting interactions of a 3D hand in RGB video with an unknown object in real time, comprising the steps of:
training a convolutional neural network by taking a video frame as an input, wherein the convolutional neural network predicts the 3D hand gesture, the 6D object gesture, the hand action and the object type of each frame of image;
the method comprises the steps of training an interactive cyclic neural network by taking a 3D hand gesture and a 6D object gesture detected by a convolutional neural network as input, wherein the cyclic neural network utilizes time sequence information in a video to obtain the interactive category of the hand and the object in the video;
inputting the video to be detected into a convolutional neural network and an interactive cyclic neural network after training is completed, and obtaining the 3D hand gesture, the 6D object gesture, the hand action, the object type and the interaction action of the hand and the object in the video;
the convolutional neural network predicts the coordinates of key points and predicts hand actions and object categories by adopting the following steps:
dividing each picture frame into H×W grids, and extending D grids to depth, wherein the grids are in pixel unit on plane and in meter unit in depth direction, i.e. each grid has size of C u ×C v Pixel×C z In the grid coordinate system, the upper left corner of the grid is taken as the origin of the coordinate system, and a grid is taken as a unit;
two vectors are stored in each cellTo predict hand and object features, respectively, wherein Coordinates of key points of hand and object, respectively, < ->N c For the number of key points of hands or objects, +.>For action category probability, ++>N a For action category number->For object class probability ++>N o The number of object categories; the grids where the wrist nodes and the center points of the objects are positioned are used for predicting actions and categories of the objects; />For confidence level->The two vectors stored in each cell are obtained by a convolutional neural network;
the coordinates (u, v, z) of the cell where the key point is located are determined, and then the offsets deltau, deltav, deltaz of the key point relative to the upper left corner of the cell where the key point is located in three dimensions are predicted, so that the coordinates of the key point in a grid coordinate system can be obtained:
wherein, because the cell where the wrist node and the object center point are located is responsible for predicting the action and the object category, g (x) is used for controlling the offset of the two points to be between 0 and 1, thereby determining the cell responsible for predicting the action and the object category; the g (x) expression is as follows:
wherein g (x) represents a function of constraining the offset of the wrist node from the center point of the object, x represents the offset deltau, deltav, deltaz of the key point relative to the upper left corner of the cell where it is located in three dimensions, sigmoid represents an activation function, the value range is (0, 1), and it can map a real number to the interval of (0, 1);
the interactive cyclic neural network takes the key point coordinate vector of the hand and the object obtained by the convolutional neural network as input, models the interactive relation of the key point coordinate vector by a multi-layer perceptron as the input of the cyclic neural network, and finally outputs the interactive category of the hand and the object in the video.
2. The method of claim 1, wherein 21 keypoints of the hand and the object are specified, the convolutional neural network determining the 3D hand pose and the 6D object pose by predicting coordinates of the keypoints, wherein the keypoints of the hand are four joints and wrist nodes of each finger, and the keypoints of the object are midpoints of eight vertices, a center point, and 12 sides of an object bounding box.
3. The method of claim 1, wherein the grid in which the adversary or object is present is set to a higher confidence, and the confidence function is set to:
wherein D is T (x) Is the Euclidean distance between the predicted point and the real point, alpha represents the super-parameter, d th Representing a set threshold value; the total confidence is:
wherein:
4. the method according to claim 1, wherein when the probability of a background class of an object is maximum, it is determined that the object belongs to an unknown class, and the unknown class of the object is identified by introducing semantic information using a zero-order learning classifier; the zero-order learning classifier multiplies the probabilities of other prediction classes except the background by vectors in a semantic space, adds the obtained semantic vectors to be used as final prediction semantic vectors, calculates the class in the semantic space and the similarity thereof, and considers that the unknown object belongs to the class with the highest similarity when the highest similarity value is not lower than a threshold value.
5. The method of claim 1, wherein the convolutional neural network has a total loss function of:
wherein lambda is pose Loss function parameters, lambda, representing predicted hand and object positions conf Loss function parameter, lambda, representing confidence act Loss function parameter, lambda, representing predicted action class obj Loss function parameters representing predicted object class, G t A regular fixed grid representing the divided pictures;representing predicted hand coordinates, +.>Representing predicted object coordinates +.>Confidence indicating manual work category of prediction, +.>Confidence representing predicted object class, +.>Representing predicted object class probability, +.>Representing the predicted action category probability.
6. An apparatus for real-time detection of3D hand interactions with unknown objects in RGB video using the method of any one of claims 1-5, comprising:
the model training module is used for taking the video frames as input and training a convolutional neural network, and the convolutional neural network predicts the 3D hand gesture, the 6D object gesture, the hand actions and the object types of each frame of image; the 3D hand gesture and the 6D object gesture detected by the convolutional neural network are used as input to train an interactive cyclic neural network, and the cyclic neural network utilizes time sequence information in the video to obtain the interactive category of the hand and the object in the video;
the real-time detection module is used for inputting the video to be detected into the convolutional neural network and the interactive cyclic neural network after training is completed, and obtaining the 3D hand gesture, the 6D object gesture, the hand action, the object type and the interaction action of the hand and the object in the video of each frame of image in the video.
7. An electronic device comprising a memory and a processor, the memory storing a computer program configured to be executed by the processor, the computer program comprising instructions for performing the method of any of claims 1-5.
8. A computer readable storage medium, characterized in that the computer readable storage medium stores a computer program which, when executed by a computer, implements the method of any of claims 1-5.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010916742.4A CN112199994B (en) | 2020-09-03 | 2020-09-03 | Method and device for detecting interaction of3D hand and unknown object in RGB video in real time |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010916742.4A CN112199994B (en) | 2020-09-03 | 2020-09-03 | Method and device for detecting interaction of3D hand and unknown object in RGB video in real time |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112199994A CN112199994A (en) | 2021-01-08 |
CN112199994B true CN112199994B (en) | 2023-05-12 |
Family
ID=74005883
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010916742.4A Expired - Fee Related CN112199994B (en) | 2020-09-03 | 2020-09-03 | Method and device for detecting interaction of3D hand and unknown object in RGB video in real time |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112199994B (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112720504B (en) * | 2021-01-20 | 2023-03-28 | 清华大学 | Method and device for controlling learning of hand and object interactive motion from RGBD video |
CN112949501B (en) * | 2021-03-03 | 2023-12-08 | 安徽省科亿信息科技有限公司 | Method for learning availability of object from teaching video |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107168527B (en) * | 2017-04-25 | 2019-10-18 | 华南理工大学 | The first visual angle gesture identification and exchange method based on region convolutional neural networks |
CN107590432A (en) * | 2017-07-27 | 2018-01-16 | 北京联合大学 | A kind of gesture identification method based on circulating three-dimensional convolutional neural networks |
EP3467707B1 (en) * | 2017-10-07 | 2024-03-13 | Tata Consultancy Services Limited | System and method for deep learning based hand gesture recognition in first person view |
CN111104820A (en) * | 2018-10-25 | 2020-05-05 | 中车株洲电力机车研究所有限公司 | Gesture recognition method based on deep learning |
CN109919078B (en) * | 2019-03-05 | 2024-08-09 | 腾讯科技(深圳)有限公司 | Video sequence selection method, model training method and device |
-
2020
- 2020-09-03 CN CN202010916742.4A patent/CN112199994B/en not_active Expired - Fee Related
Also Published As
Publication number | Publication date |
---|---|
CN112199994A (en) | 2021-01-08 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Kwon et al. | H2o: Two hands manipulating objects for first person interaction recognition | |
Wang et al. | Atloc: Attention guided camera localization | |
Doosti et al. | Hope-net: A graph-based model for hand-object pose estimation | |
Cao et al. | Egocentric gesture recognition using recurrent 3d convolutional neural networks with spatiotemporal transformer modules | |
Molchanov et al. | Online detection and classification of dynamic hand gestures with recurrent 3d convolutional neural network | |
Han et al. | Enhanced computer vision with microsoft kinect sensor: A review | |
Wang et al. | Predicting camera viewpoint improves cross-dataset generalization for 3d human pose estimation | |
Wang et al. | Hmor: Hierarchical multi-person ordinal relations for monocular multi-person 3d pose estimation | |
Lao et al. | Automatic video-based human motion analyzer for consumer surveillance system | |
Deng et al. | MVF-Net: A multi-view fusion network for event-based object classification | |
Qiao et al. | Learning discriminative trajectorylet detector sets for accurate skeleton-based action recognition | |
CN110555481A (en) | Portrait style identification method and device and computer readable storage medium | |
Jayaraman et al. | End-to-end policy learning for active visual categorization | |
Kim et al. | Simvodis: Simultaneous visual odometry, object detection, and instance segmentation | |
Gupta et al. | Online detection and classification of dynamic hand gestures with recurrent 3d convolutional neural networks | |
WO2021098802A1 (en) | Object detection device, method, and systerm | |
Hu et al. | Semantic SLAM based on improved DeepLabv3⁺ in dynamic scenarios | |
CN112199994B (en) | Method and device for detecting interaction of3D hand and unknown object in RGB video in real time | |
Li et al. | Hmor: Hierarchical multi-person ordinal relations for monocular multi-person 3d pose estimation | |
Zhang et al. | Handsense: smart multimodal hand gesture recognition based on deep neural networks | |
Wu et al. | Context-aware deep spatiotemporal network for hand pose estimation from depth images | |
Ding et al. | Simultaneous body part and motion identification for human-following robots | |
Le et al. | A survey on 3D hand skeleton and pose estimation by convolutional neural network | |
Alcantarilla et al. | Visibility learning in large-scale urban environment | |
Liu et al. | Online human action recognition with spatial and temporal skeleton features using a distributed camera network |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20230512 |
|
CF01 | Termination of patent right due to non-payment of annual fee |