CN112199994B - Method and device for detecting interaction of3D hand and unknown object in RGB video in real time - Google Patents

Method and device for detecting interaction of3D hand and unknown object in RGB video in real time Download PDF

Info

Publication number
CN112199994B
CN112199994B CN202010916742.4A CN202010916742A CN112199994B CN 112199994 B CN112199994 B CN 112199994B CN 202010916742 A CN202010916742 A CN 202010916742A CN 112199994 B CN112199994 B CN 112199994B
Authority
CN
China
Prior art keywords
hand
neural network
video
gesture
convolutional neural
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN202010916742.4A
Other languages
Chinese (zh)
Other versions
CN112199994A (en
Inventor
薛聪
吴彦坤
向继
查达仁
王雷
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Information Engineering of CAS
Original Assignee
Institute of Information Engineering of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Information Engineering of CAS filed Critical Institute of Information Engineering of CAS
Priority to CN202010916742.4A priority Critical patent/CN112199994B/en
Publication of CN112199994A publication Critical patent/CN112199994A/en
Application granted granted Critical
Publication of CN112199994B publication Critical patent/CN112199994B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Molecular Biology (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Computing Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Psychiatry (AREA)
  • Social Psychology (AREA)
  • Human Computer Interaction (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to a method and a device for detecting interaction between a 3D hand and an unknown object in RGB video in real time. The method comprises the following steps: training a convolutional neural network by taking a video frame as an input, and predicting the 3D hand gesture, the 6D object gesture, the hand action and the object type of each frame of image by the convolutional neural network; training an interactive cyclic neural network by taking the 3D hand gesture and the 6D object gesture detected by the convolutional neural network as input, and obtaining the interaction category of the hand and the object in the video by using time sequence information in the video by the cyclic neural network; and inputting the video to be detected into the convolutional neural network and the interactive cyclic neural network after training is completed, and obtaining the 3D hand gesture, the 6D object gesture, the hand action, the object type and the interaction action of the hand and the object in the video. According to the invention, depth photos or real object gesture coordinates are not needed as input, so that the accuracy of hand motion recognition is improved, the recognition range is greatly improved, and the method is more convenient to apply to life.

Description

Method and device for detecting interaction of3D hand and unknown object in RGB video in real time
Technical Field
The invention relates to hand and object interaction identification, and aims to detect motion trail and interaction category of hand and unknown object in RGB video in real time.
Background
In recent years, with the development of computer vision and virtual reality technology and the increasing demand for intelligent living home, the "human-centered" action recognition and behavior understanding are becoming research hotspots in the field of computer vision. In the field of behavioral understanding, recognition of hand-object interactions is of paramount importance, and recognition of hand-object interactions includes recognition of hand action categories and object categories, so that with semantic information of hand-object interactions, we can better understand user intent and predict their next actions. Meanwhile, detecting real-time hand shape and motion tracking are always the most core components in a sign language recognition and gesture control system, and play an important role in partial augmented reality experience.
Currently, hand recognition can be largely categorized into vision-based non-contact type and sensor information-based contact type. The method based on the sensor information requires the operator to wear equipment such as data gloves and the like, and the parameters are required to be readjusted after the operator is replaced, and although the three-dimensional pose information of the gesture in space can be directly obtained in real time, the method has a certain difficulty in popularization in reality due to the inconvenience of operation. In contrast, vision-based gesture recognition enables operators to interact with human-machine in a more natural manner. In future human-machine interaction and monitoring, it is therefore highly desirable to rely on vision systems to make the robot perceive the intention of the robot, of which vision-based action recognition and behavioral understanding are particularly important.
However, while critical to the semantically meaningful interpretation of visual scenes, the problem of jointly understanding humans and objects is of little concern. Much research is currently focused on visual understanding that humans and objects are isolated from each other. Conventional methods of recognizing hand motion separate hands from a first view to recognize their gestures (G.Rogez, J.Supancic, and d. Raman. First-Person Pose Recognition Using Egocentric workspaces. In CVPR, 2015.) or hand gestures recognized in RGB images from the first and third views (U.Iqbal, P.Molchanov, T.Breuel, J.Gall, and j. Kautz. Hand Pose Estimation via Latent 2.5D Heatmap Regression.In ECCV,2018.), but these do not model objects interacting with the hands together. Some methods use object interactions as additional constraints when estimating hand motions (C.Choi, S.H.Yoon, C.Chen, and k. Ramani. Robust Hand Pose Estimation during the Interaction with an Unknown object. In ICCV, 2017.) that improve the accuracy of hand motion recognition, but rely on depth images as input. There are methods of pose reconstruction of an adversary and an object (Learning Joint Reconstruction of Hands and Manipulated objects, yana Hasson, gul vanol, dimitrios Tzionas, igor Kalevatykh, michael J.Black, ivan Laptev, cordelia Schmid. In CVPR, 2019), but no semantic information is learned. There are methods that can identify hand-object interactions (Tekin B, bogo F, bellifeys M.H +o: unified egocentric recognition of d hand-object interactions, in CVPR, 2019.) but can only identify known objects in the dataset, lacking generalization.
Although the existing method can analyze semantic information of hand and object interaction, the identifiable object types are limited by a hand data set, the object types of hand interaction in the existing hand movement data set are very limited, and a large amount of manpower and material resources are required to be consumed for marking new data. It is therefore practical to propose a method that can recognize the interaction of a hand with an unknown object from RGB video.
Disclosure of Invention
The invention aims to provide a method and a device capable of detecting the spatial gesture and interaction type of a 3D hand and an unknown object in real time according to RGB video.
The inventor finds that many methods in the prior art solve the gesture of the hand or the object in an isolated state, and the gesture recognition methods can only recognize the shape of the hand and some simple gestures (such as the gesture of standing up the thumb and winning the thumb) and cannot recognize the interaction relationship with the object; some methods for reconstructing the hand and object gestures restore the object edges well, but do not analyze the semantic information of the scene; some methods for identifying actions need to rely on the input of depth images, otherwise, the accuracy is very low; some methods for estimating the object pose do not directly calculate the 6D pose, but generate a 2D frame first, and then calculate the 6D pose through PnP algorithm, so that part of information is lost. The invention solves the problems, can complete a plurality of tasks at one time, is an end-to-end method, can simultaneously predict the gesture, the action and the category estimation of the 3D hand and the object by inputting RGB video, does not need a depth photograph or real object gesture coordinates as input, and improves the accuracy of hand action recognition.
As shown in fig. 1, the invention mainly comprises a Convolutional Neural Network (CNN) and an interactive cyclic neural network (interactive RNN), wherein the convolutional neural network is used for identifying the 3D hand gesture, the 6D object gesture (3D position and 3D direction of an object), the hand action (inversion, opening, closing and the like), the object type (milk, detergent, fruit juice box and the like) of each frame of image, and the cyclic neural network is used for extracting the time sequence characteristics in the integrated video to obtain the interactive type (inversion milk, opening fruit juice box and the like) of the hand and the object of the whole video. The method of the present invention is divided into a training process and a use process. In the training process stage, two steps of training are respectively carried out, firstly, a video frame is taken as input, a convolutional neural network is trained, 3D hand gestures, 6D object gestures, hand actions and object types of each frame of images are predicted, parameters of the 3D hand gestures, 6D object gestures, hand actions and object types are fixed after training, then, the cyclic neural network is trained, detected hand and object key point coordinates are taken as input of the cyclic neural network, and interaction type estimation of the hand and the object in the whole video is output. In the use process stage, the complete model takes a series of video frames as input, and 3D hand gesture and object gesture prediction of each frame and estimation of object and action types of the whole video frame sequence are output after passing through two neural networks.
The technical scheme adopted by the invention mainly comprises the following steps (if no special description exists, the following steps are executed by software and hardware of a computer and electronic equipment):
(1) Model building and training. When the model is used for the first time, a user firstly needs to train the convolutional neural network and the interactive cyclic neural network, and then can use the trained model to conduct action recognition.
(2) Video input. The model can detect the 3D position (namely 3D gesture) of the hand of each frame of image in the video, the 6D gesture of the object, the action of the hand, the object type estimation and the interaction action of the hand and the object in the whole video in real time.
Further, in the detailed design of the model, as shown in fig. 2, 21 key points of each of the hand and the object (the key points of the hand are four joints and wrist nodes of each finger, the key points of the object are eight vertices, a center point and a midpoint of 12 sides of the object bounding box) are specified, and the pose (i.e., the 3D hand pose and the 6D object pose) is determined by predicting the coordinates of the key points.
Further, the method for predicting the coordinates of the key points and predicting the hand actions and the object categories by the convolutional neural network comprises the following steps:
as shown in fig. 3 and 4, each picture frame is divided into h×w grids, and D grids (H, W, D respectively representing height, width, and depth) are extended to depth, each grid having a size of C in units of pixels (pixels) in a plane and meters (meters) in a depth direction u ×C v Pixel×C z And (5) rice. In this grid coordinate system, the upper left corner of the grid is taken as the origin of the coordinate system, and a grid is taken as a unit.
To enable simultaneous joint prediction of hand and object pose and category, as in FIG. 4, two vectors are stored in each cell (i.e., grid)
Figure BDA0002665281180000031
To predict the characteristics of hand and object, respectively, wherein +.>
Figure BDA0002665281180000032
Coordinates of key points of hand and object, respectively, < ->
Figure BDA0002665281180000033
N c For the number of key points of hands or objects, +.>
Figure BDA0002665281180000034
For action category probability, ++>
Figure BDA0002665281180000035
N a For action category number->
Figure BDA0002665281180000036
For object class probability ++>
Figure BDA0002665281180000037
N o The method is characterized in that the method adds a background class for the object class number (if the object is an unknown object, the object is classified into the background class, and then the unknown object is identified by entering a zero-order learning classifier). Wherein the grid of wrist nodes and object center points is used to predict motion and object categories. />
Figure BDA0002665281180000038
For confidence level->
Figure BDA0002665281180000039
The two vectors stored per cell are derived from convolutional neural networks. The method comprises the steps of firstly determining the coordinates (u, v, z) of a cell where a key point is located, and then predicting the offsets delta u, delta v and delta z of the key point relative to the left upper corner of the cell where the key point is located in three dimensions, so that the coordinates of the key point in a grid coordinate system can be obtained:
Figure BDA00026652811800000310
Figure BDA00026652811800000311
Figure BDA00026652811800000312
wherein, since the cell where the wrist node and the object center point are located is responsible for predicting the action and the object category, g (x) is used to control the offset of the two points between [0,1], thereby determining the cell responsible for predicting the action and the object category. The g (x) expression is as follows:
Figure BDA0002665281180000041
wherein g (x) represents a function of constraining the offset of the wrist node from the center point of the object, x represents the offset deltau, deltav, deltaz, sigmoid of the key point relative to the upper left corner of the cell where it is located in three dimensions, and sigmoid represents an activation function, the value range is (0, 1), and a real number can be mapped to the interval of (0, 1), and the function is used to make the wrist node still located in the cell where it is located after the offset of the wrist node from the center point of the object to predict the action and the object category.
In addition, with the three-dimensional position in the grid coordinate system and the camera internal parameter K, the three-dimensional coordinates of the key points in the camera coordinate system can be calculated as follows:
Figure BDA0002665281180000042
further, a higher confidence is set for the grid in which the adversary or object is present, and a confidence function is set as:
Figure BDA0002665281180000043
wherein D is T (x) Is the Euclidean distance between the predicted point and the real point, alpha represents the super-parameter, d th Represents a set threshold value, D when the predicted value is closer to the true value T (x) The smaller c (x) is, the larger the confidence is, and conversely, the smaller the confidence is. The total confidence is:
Figure BDA0002665281180000044
wherein:
Figure BDA0002665281180000045
further, when the probability of the background class of the object is maximum, it is determined that the object belongs to an unknown class. As shown in fig. 6, unknown object categories are identified by introducing semantic information using a zero-order learning classifier module. The zero-order learning classifier module multiplies the probabilities of other prediction classes except the background by vectors in a semantic space respectively, adds the obtained semantic vectors to be used as final prediction semantic vectors, calculates the class and the similarity thereof in the semantic space, and considers that the unknown object belongs to the class with the highest similarity when the highest similarity value is not lower than a threshold value.
Further, the total loss function of the convolutional neural network of the present invention is:
Figure BDA0002665281180000051
wherein lambda is pose Loss function parameters, lambda, representing predicted hand and object positions conf Loss function parameter, lambda, representing confidence actcls Loss function parameter, lambda, representing predicted action class objcls Loss function parameters representing predicted object class, G t A regular fixed grid representing the divided pictures;
Figure BDA0002665281180000052
representing predicted hand coordinates, +.>
Figure BDA0002665281180000053
Representing predicted object coordinates +.>
Figure BDA0002665281180000054
Confidence indicating manual work category of prediction, +.>
Figure BDA0002665281180000055
Confidence representing predicted object class, +.>
Figure BDA0002665281180000056
Representing the probability of a predicted object class,
Figure BDA0002665281180000057
representing the predicted action category probability.
Further, since the convolutional network only learns the information of each frame image and does not utilize the timing information in the video, the present inventionAn interactive cyclic neural network part is added, as shown in figure 5, the key point coordinate vector of the hand and the object is calculated by a convolution network
Figure BDA0002665281180000058
Inputting a multi-layer perceptron to model the relationship of the multi-layer perceptrons, and then taking the multi-layer perceptrons as the input of a cyclic neural network, wherein the model of the cyclic network is as follows:
Figure BDA0002665281180000059
wherein f φ Is a cyclic neural network model g θ Is a multi-layer perceptron model, and finally outputs the interaction category of the hand and the object in the video.
Based on the same inventive concept, the invention also provides a device for detecting interaction between a 3D hand and an unknown object in RGB video in real time by adopting the method, which comprises the following steps:
the model training module is used for taking the video frames as input and training a convolutional neural network, and the convolutional neural network predicts the 3D hand gesture, the 6D object gesture, the hand actions and the object types of each frame of image; the 3D hand gesture and the 6D object gesture detected by the convolutional neural network are used as input to train an interactive cyclic neural network, and the cyclic neural network utilizes time sequence information in the video to obtain the interactive category of the hand and the object in the video;
the real-time detection module is used for inputting the video to be detected into the convolutional neural network and the interactive cyclic neural network after training is completed, and obtaining the 3D hand gesture, the 6D object gesture, the hand action, the object type and the interaction action of the hand and the object in the video of each frame of image in the video.
The method for identifying the interaction between the 3D hand and the object in the RGB video greatly improves the practicability, and specifically comprises the following steps:
(1) The method does not need to rely on the depth image shot by the RGB-D camera, and can detect the hand-object interaction in the RGB video by inputting a series of frames, so that the application range in life is greatly increased.
(2) The method can detect the position track of the hand and the object and estimate the action category and the object category at the same time at real-time speed, and can be applied to abnormal behavior detection.
(3) The method can detect the unknown object types not in the training set, greatly improves the identification range, improves generalization and is more convenient to apply to life.
Drawings
FIG. 1 is a flow chart of a method for recognizing 3D hand interactions with objects based on RGB video; wherein I is 1 ~I N Representing N video frames, CNN is a convolutional neural network, RNN is a cyclic neural network.
FIG. 2 is a schematic diagram of hand and object keypoints; wherein (a) the graph illustrates the keypoints of 21 hands, (b) the graph illustrates the keypoints of 21 objects, (a) P, R, M, I, T in the graph represents 5 fingers, TIP represents fingertips, DIP represents distal knuckles, PIP represents proximal knuckles, MCP represents metacarpophalangerhans, and Wrist is Wrist.
FIG. 3 is a diagram of a grid coordinate system of an input image;
FIG. 4 is a schematic diagram of hand and object positions and their cell stored vectors in a grid coordinate system;
FIG. 5 is a schematic diagram of an interactive cyclic network in a model, where x 1 ~x N Representing the input of the interactive recurrent neural network.
Fig. 6 is a schematic diagram of a zero-order learning classifier module in a model.
Detailed Description
The process according to the invention is further described below with reference to the accompanying drawings and specific examples.
The hand motion recognition method does not need to rely on an external detection algorithm, and only needs to perform end-to-end training on a single image. After a single RGB image is input and feedforward transmission is carried out once through a neural network, the gestures of a 3D hand and an object can be estimated together, interaction of the three-dimensional hand and the object is modeled, the object and action categories are identified, and when the object category is identified as a background category, the unknown category of the object is predicted by calculating and searching the closest category in a semantic space through a zero-order learning classifier module. The pose information of the hand and object is then further combined and propagated in the time domain to infer interactions between the hand and object trajectories and to identify actions. The method takes a series of frames as input, and can output 3D hand and object gesture prediction of each frame and estimation of object and action category of the whole sequence.
Fig. 1 is a schematic flow chart of a method for identifying interaction between a 3D hand and an object based on RGB video, the method mainly comprises the following steps:
(1) And (5) model training. The model training is divided into two parts, namely training the convolutional neural network firstly and then fixing the training interactive cyclic neural network. The convolutional neural network is a YOLO-based architecture, the network has 31 layers in total, the rest layers are convolutional layers or pooling layers except the last layer which is a predictor, and a H x W x D x 2 (3 x N) c +1+N a +N o ) Corresponding to the vectors of the two hands and the object contained in each cell in the grid. In the method of the present embodiment, h=w=13, d=5. The size of the picture input in this embodiment is 416×416. After the convolutional network is trained, the key point vectors of the hand and the object obtained by each frame of image through the convolutional network are learned by a multi-layer perceptron of a hidden layer, then the interactive relationship is input into a cyclic neural network of two hidden layers, and finally the interactive category estimation is output. The training dataset of this embodiment is a First-Person Hand Action (FPHA) dataset, which is a publicly available 3D hand-object interaction recognition dataset that contains labels for 3D hand gestures, 6D object gestures, and motion categories. The FPHA contains videos belonging to 45 different activity categories of 6 actors, and the subject performs complex actions corresponding to daily human activities. One subset of the dataset contains annotations of the 6D pose of the object, and corresponding mesh models of 4 objects involving 10 different action categories. The training set is divided into two parts according to the object categories interacting with the hands, a training set and a test set, wherein the test set comprises object categories (unknown categories) which are not present in the training set.
(2) And (3) a detection stage. A series of video frames are input into the model, and the 3D gesture of the hand and the object of each frame of image and the interaction category of the hand and the object in the whole sequence can be estimated. When the predicted object is a background class, the unknown class of the object is predicted through the zero-order learning classifier.
Fig. 2 is a schematic diagram of key points of a hand and an object, and 21 key points are taken for convenience of unified calculation. The key points of the hand are the four joints of each finger, and the wrist nodes. The keypoints of an object take the eight vertices, the center point, and the midpoints of 12 sides of its bounding box. The grids of the wrist nodes and the center points of the objects are used for predicting the categories of the objects and actions.
Fig. 3 is a diagram of a grid coordinate system of an input image, assuming that the upper left corner of the grid is the origin of coordinates, each grid is a unit, and the grid coordinates are the number of grids offset from the upper left corner.
Fig. 4 is a schematic diagram of vectors stored in a grid coordinate system for hand and object positions and cells where the hand and object positions are located, and whether an object exists in a cell is determined by whether a key point falls into the cell or not.
FIG. 5 is a schematic diagram of an interactive cyclic network in a model, wherein each frame of image is first passed through a convolution network to obtain key point vectors of hands and objects
Figure BDA0002665281180000071
Modeling the relationship by using a multi-layer perceptron, learning time sequence information in the video by using the obtained vector through a cyclic neural network of two hidden layers, and finally outputting interaction category estimation.
FIG. 6 is a schematic diagram of a zero-order learning classifier module in a model, when the probability of a background class is maximum, determining that the object belongs to an unknown class. Multiplying the probabilities of other prediction classes except the background by vectors in a semantic space respectively, adding the obtained semantic vectors to obtain a final prediction semantic vector, calculating the class and the similarity thereof in the semantic space, and considering that the unknown object belongs to the class with the highest similarity when the highest similarity value is not lower than a threshold value.
Even under a complex real scene, the method can also effectively identify the track, category and interaction action of hands and unknown objects in real time from RGB video, obtain semantic information and time sequence information of capturing sequences from the video, greatly improve the action identification efficiency, solve the problem that the semantic information interacted with the objects cannot be identified in the traditional gesture identification, and can also identify the unknown objects interacted with the hands without inputting depth images or real object coordinate data, thereby providing a good theoretical basis for wide application.
The module convolutional neural network, the zero-order learning classifier, and the cyclic neural network of the above embodiments may be arbitrarily combined, and for brevity of description, all possible combinations of the respective modules in the above embodiments are not described, however, as long as there is no contradiction between the combinations of the modules, they should be considered as the scope of the description of the present specification.
Based on the same inventive concept, another embodiment of the present invention provides an apparatus for detecting 3D hand interaction with an unknown object in RGB video in real time using the above method, comprising:
the model training module is used for taking the video frames as input and training a convolutional neural network, and the convolutional neural network predicts the 3D hand gesture, the 6D object gesture, the hand actions and the object types of each frame of image; the 3D hand gesture and the 6D object gesture detected by the convolutional neural network are used as input to train an interactive cyclic neural network, and the cyclic neural network utilizes time sequence information in the video to obtain the interactive category of the hand and the object in the video;
the real-time detection module is used for inputting the video to be detected into the convolutional neural network and the interactive cyclic neural network after training is completed, and obtaining the 3D hand gesture, the 6D object gesture, the hand action, the object type and the interaction action of the hand and the object in the video of each frame of image in the video.
Based on the same inventive concept, another embodiment of the present invention provides an electronic device (computer, server, smart phone, etc.) comprising a memory storing a computer program configured to be executed by the processor, and a processor, the computer program comprising instructions for performing the steps of the inventive method.
Based on the same inventive concept, another embodiment of the present invention provides a computer readable storage medium (e.g., ROM/RAM, magnetic disk, optical disk) storing a computer program which, when executed by a computer, implements the steps of the inventive method.
The above embodiments are only for illustrating the technical solution of the present invention and not for limiting the same, and those skilled in the art may modify or substitute the technical solution of the present invention without departing from the spirit and scope of the present invention, and the protection scope of the present invention shall be defined by the claims.

Claims (8)

1. A method for detecting interactions of a 3D hand in RGB video with an unknown object in real time, comprising the steps of:
training a convolutional neural network by taking a video frame as an input, wherein the convolutional neural network predicts the 3D hand gesture, the 6D object gesture, the hand action and the object type of each frame of image;
the method comprises the steps of training an interactive cyclic neural network by taking a 3D hand gesture and a 6D object gesture detected by a convolutional neural network as input, wherein the cyclic neural network utilizes time sequence information in a video to obtain the interactive category of the hand and the object in the video;
inputting the video to be detected into a convolutional neural network and an interactive cyclic neural network after training is completed, and obtaining the 3D hand gesture, the 6D object gesture, the hand action, the object type and the interaction action of the hand and the object in the video;
the convolutional neural network predicts the coordinates of key points and predicts hand actions and object categories by adopting the following steps:
dividing each picture frame into H×W grids, and extending D grids to depth, wherein the grids are in pixel unit on plane and in meter unit in depth direction, i.e. each grid has size of C u ×C v Pixel×C z In the grid coordinate system, the upper left corner of the grid is taken as the origin of the coordinate system, and a grid is taken as a unit;
two vectors are stored in each cell
Figure FDA0004052160370000011
To predict hand and object features, respectively, wherein
Figure FDA0004052160370000012
Figure FDA0004052160370000013
Coordinates of key points of hand and object, respectively, < ->
Figure FDA0004052160370000014
N c For the number of key points of hands or objects, +.>
Figure FDA0004052160370000015
For action category probability, ++>
Figure FDA0004052160370000016
N a For action category number->
Figure FDA0004052160370000017
For object class probability ++>
Figure FDA0004052160370000018
N o The number of object categories; the grids where the wrist nodes and the center points of the objects are positioned are used for predicting actions and categories of the objects; />
Figure FDA0004052160370000019
For confidence level->
Figure FDA00040521603700000110
The two vectors stored in each cell are obtained by a convolutional neural network;
the coordinates (u, v, z) of the cell where the key point is located are determined, and then the offsets deltau, deltav, deltaz of the key point relative to the upper left corner of the cell where the key point is located in three dimensions are predicted, so that the coordinates of the key point in a grid coordinate system can be obtained:
Figure FDA00040521603700000111
Figure FDA00040521603700000112
Figure FDA00040521603700000113
wherein, because the cell where the wrist node and the object center point are located is responsible for predicting the action and the object category, g (x) is used for controlling the offset of the two points to be between 0 and 1, thereby determining the cell responsible for predicting the action and the object category; the g (x) expression is as follows:
Figure FDA00040521603700000114
wherein g (x) represents a function of constraining the offset of the wrist node from the center point of the object, x represents the offset deltau, deltav, deltaz of the key point relative to the upper left corner of the cell where it is located in three dimensions, sigmoid represents an activation function, the value range is (0, 1), and it can map a real number to the interval of (0, 1);
the interactive cyclic neural network takes the key point coordinate vector of the hand and the object obtained by the convolutional neural network as input, models the interactive relation of the key point coordinate vector by a multi-layer perceptron as the input of the cyclic neural network, and finally outputs the interactive category of the hand and the object in the video.
2. The method of claim 1, wherein 21 keypoints of the hand and the object are specified, the convolutional neural network determining the 3D hand pose and the 6D object pose by predicting coordinates of the keypoints, wherein the keypoints of the hand are four joints and wrist nodes of each finger, and the keypoints of the object are midpoints of eight vertices, a center point, and 12 sides of an object bounding box.
3. The method of claim 1, wherein the grid in which the adversary or object is present is set to a higher confidence, and the confidence function is set to:
Figure FDA0004052160370000021
wherein D is T (x) Is the Euclidean distance between the predicted point and the real point, alpha represents the super-parameter, d th Representing a set threshold value; the total confidence is:
Figure FDA0004052160370000022
wherein:
Figure FDA0004052160370000023
4. the method according to claim 1, wherein when the probability of a background class of an object is maximum, it is determined that the object belongs to an unknown class, and the unknown class of the object is identified by introducing semantic information using a zero-order learning classifier; the zero-order learning classifier multiplies the probabilities of other prediction classes except the background by vectors in a semantic space, adds the obtained semantic vectors to be used as final prediction semantic vectors, calculates the class in the semantic space and the similarity thereof, and considers that the unknown object belongs to the class with the highest similarity when the highest similarity value is not lower than a threshold value.
5. The method of claim 1, wherein the convolutional neural network has a total loss function of:
Figure FDA0004052160370000024
wherein lambda is pose Loss function parameters, lambda, representing predicted hand and object positions conf Loss function parameter, lambda, representing confidence act Loss function parameter, lambda, representing predicted action class obj Loss function parameters representing predicted object class, G t A regular fixed grid representing the divided pictures;
Figure FDA0004052160370000031
representing predicted hand coordinates, +.>
Figure FDA0004052160370000032
Representing predicted object coordinates +.>
Figure FDA0004052160370000033
Confidence indicating manual work category of prediction, +.>
Figure FDA0004052160370000034
Confidence representing predicted object class, +.>
Figure FDA0004052160370000035
Representing predicted object class probability, +.>
Figure FDA0004052160370000036
Representing the predicted action category probability.
6. An apparatus for real-time detection of3D hand interactions with unknown objects in RGB video using the method of any one of claims 1-5, comprising:
the model training module is used for taking the video frames as input and training a convolutional neural network, and the convolutional neural network predicts the 3D hand gesture, the 6D object gesture, the hand actions and the object types of each frame of image; the 3D hand gesture and the 6D object gesture detected by the convolutional neural network are used as input to train an interactive cyclic neural network, and the cyclic neural network utilizes time sequence information in the video to obtain the interactive category of the hand and the object in the video;
the real-time detection module is used for inputting the video to be detected into the convolutional neural network and the interactive cyclic neural network after training is completed, and obtaining the 3D hand gesture, the 6D object gesture, the hand action, the object type and the interaction action of the hand and the object in the video of each frame of image in the video.
7. An electronic device comprising a memory and a processor, the memory storing a computer program configured to be executed by the processor, the computer program comprising instructions for performing the method of any of claims 1-5.
8. A computer readable storage medium, characterized in that the computer readable storage medium stores a computer program which, when executed by a computer, implements the method of any of claims 1-5.
CN202010916742.4A 2020-09-03 2020-09-03 Method and device for detecting interaction of3D hand and unknown object in RGB video in real time Expired - Fee Related CN112199994B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010916742.4A CN112199994B (en) 2020-09-03 2020-09-03 Method and device for detecting interaction of3D hand and unknown object in RGB video in real time

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010916742.4A CN112199994B (en) 2020-09-03 2020-09-03 Method and device for detecting interaction of3D hand and unknown object in RGB video in real time

Publications (2)

Publication Number Publication Date
CN112199994A CN112199994A (en) 2021-01-08
CN112199994B true CN112199994B (en) 2023-05-12

Family

ID=74005883

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010916742.4A Expired - Fee Related CN112199994B (en) 2020-09-03 2020-09-03 Method and device for detecting interaction of3D hand and unknown object in RGB video in real time

Country Status (1)

Country Link
CN (1) CN112199994B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112720504B (en) * 2021-01-20 2023-03-28 清华大学 Method and device for controlling learning of hand and object interactive motion from RGBD video
CN112949501B (en) * 2021-03-03 2023-12-08 安徽省科亿信息科技有限公司 Method for learning availability of object from teaching video

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107168527B (en) * 2017-04-25 2019-10-18 华南理工大学 The first visual angle gesture identification and exchange method based on region convolutional neural networks
CN107590432A (en) * 2017-07-27 2018-01-16 北京联合大学 A kind of gesture identification method based on circulating three-dimensional convolutional neural networks
EP3467707B1 (en) * 2017-10-07 2024-03-13 Tata Consultancy Services Limited System and method for deep learning based hand gesture recognition in first person view
CN111104820A (en) * 2018-10-25 2020-05-05 中车株洲电力机车研究所有限公司 Gesture recognition method based on deep learning
CN109919078B (en) * 2019-03-05 2024-08-09 腾讯科技(深圳)有限公司 Video sequence selection method, model training method and device

Also Published As

Publication number Publication date
CN112199994A (en) 2021-01-08

Similar Documents

Publication Publication Date Title
Kwon et al. H2o: Two hands manipulating objects for first person interaction recognition
Wang et al. Atloc: Attention guided camera localization
Doosti et al. Hope-net: A graph-based model for hand-object pose estimation
Cao et al. Egocentric gesture recognition using recurrent 3d convolutional neural networks with spatiotemporal transformer modules
Molchanov et al. Online detection and classification of dynamic hand gestures with recurrent 3d convolutional neural network
Han et al. Enhanced computer vision with microsoft kinect sensor: A review
Wang et al. Predicting camera viewpoint improves cross-dataset generalization for 3d human pose estimation
Wang et al. Hmor: Hierarchical multi-person ordinal relations for monocular multi-person 3d pose estimation
Lao et al. Automatic video-based human motion analyzer for consumer surveillance system
Deng et al. MVF-Net: A multi-view fusion network for event-based object classification
Qiao et al. Learning discriminative trajectorylet detector sets for accurate skeleton-based action recognition
CN110555481A (en) Portrait style identification method and device and computer readable storage medium
Jayaraman et al. End-to-end policy learning for active visual categorization
Kim et al. Simvodis: Simultaneous visual odometry, object detection, and instance segmentation
Gupta et al. Online detection and classification of dynamic hand gestures with recurrent 3d convolutional neural networks
WO2021098802A1 (en) Object detection device, method, and systerm
Hu et al. Semantic SLAM based on improved DeepLabv3⁺ in dynamic scenarios
CN112199994B (en) Method and device for detecting interaction of3D hand and unknown object in RGB video in real time
Li et al. Hmor: Hierarchical multi-person ordinal relations for monocular multi-person 3d pose estimation
Zhang et al. Handsense: smart multimodal hand gesture recognition based on deep neural networks
Wu et al. Context-aware deep spatiotemporal network for hand pose estimation from depth images
Ding et al. Simultaneous body part and motion identification for human-following robots
Le et al. A survey on 3D hand skeleton and pose estimation by convolutional neural network
Alcantarilla et al. Visibility learning in large-scale urban environment
Liu et al. Online human action recognition with spatial and temporal skeleton features using a distributed camera network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20230512

CF01 Termination of patent right due to non-payment of annual fee