CN113312966B - Action recognition method and device based on first person viewing angle - Google Patents

Action recognition method and device based on first person viewing angle Download PDF

Info

Publication number
CN113312966B
CN113312966B CN202110430314.5A CN202110430314A CN113312966B CN 113312966 B CN113312966 B CN 113312966B CN 202110430314 A CN202110430314 A CN 202110430314A CN 113312966 B CN113312966 B CN 113312966B
Authority
CN
China
Prior art keywords
position information
video frames
rgb video
processed
video frame
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110430314.5A
Other languages
Chinese (zh)
Other versions
CN113312966A (en
Inventor
刘文印
田文浩
陈俊洪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong University of Technology
Original Assignee
Guangdong University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong University of Technology filed Critical Guangdong University of Technology
Priority to CN202110430314.5A priority Critical patent/CN113312966B/en
Publication of CN113312966A publication Critical patent/CN113312966A/en
Application granted granted Critical
Publication of CN113312966B publication Critical patent/CN113312966B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • G06V40/28Recognition of hand or arm movements, e.g. recognition of deaf sign language
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Multimedia (AREA)
  • General Health & Medical Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Psychiatry (AREA)
  • Social Psychology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a motion recognition method and a device based on a first person viewing angle, wherein the method comprises the following steps: acquiring an RGB video frame to be processed; the RGB video frame to be processed comprises hand motion image information based on a first person viewing angle; inputting all RGB video frames to be processed into a pre-trained HOPE-Net deep neural network to obtain corresponding hand joint point position information; selecting a preset number of target RGB video frames from all RGB video frames to be processed, inputting the target RGB video frames into an I3D model for recognition, and obtaining corresponding video frame characteristics; inputting the position information of the hand joint point into an AGCN model to obtain corresponding position information characteristics; and fusing the video frame features and the position information features in a one-to-one correspondence manner to obtain the probability of identifying the action instruction. The video frames are sequentially subjected to hand skeleton joint extraction, RGB and skeleton motion feature extraction, and finally feature fusion is performed to obtain motion instruction probability, so that dependence on external hardware equipment is eliminated, and strong robustness is provided for illumination and scene change.

Description

Action recognition method and device based on first person viewing angle
Technical Field
The present invention relates to the field of motion recognition technologies, and in particular, to a method and apparatus for motion recognition based on a first person perspective.
Background
While robots can learn actions from human demonstration videos so as to well understand human behavior intent and learn human behavior autonomously, in practical applications, learning human behavior by robots requires a careful understanding process, especially learning behaviors derived from daily activities, which is particularly challenging for robots, for example: in the first-person video shot based on the wearable camera, the robot can only acquire the operation actions of the human hand from a single angle, in this case, the robot is full of the operation actions such as the hand moving fast and the shielding phenomenon and the like during the hand operation, so that the unpredictability is generated to a great extent. Therefore, the process of recognizing the subtle differences of human motions by robots and learning and executing human motions is still a great difficulty in the field of robot technology at present, especially in the motion recognition direction of the first person's view, which is one of the research hotspots.
Based on the first person visual angle action recognition method, the current method mainly comprises three methods: (1) The method requires hardware support and requires an operator to demonstrate actions in a specific environment; (2) For demonstration video, the motion characteristics are represented by dense tracks, and the gesture characteristics are acquired by HOG, so that the method is often interfered by the background and camera movement, and the calculated amount is large; (3) The hands of the operator in the display video are segmented and input into the deep neural network for recognition, and the method can effectively reduce the background interference, but lacks most of original information. Obviously, the existing motion recognition method based on the first person view has certain defects.
In summary, the motion recognition scheme of the first-person viewing angle, which can get rid of the dependence on external hardware equipment and has strong robustness to illumination and scene change, has important significance.
Disclosure of Invention
The invention provides a motion recognition method and a motion recognition device based on a first person perspective, which can get rid of dependence on external hardware equipment and have strong robustness on illumination and scene change.
In a first aspect, the present invention provides a motion recognition method based on a first person perspective, including:
acquiring an RGB video frame to be processed; the RGB video frame to be processed comprises hand motion image information based on a first person viewing angle;
inputting all the RGB video frames to be processed into a pre-trained HOPE-Net depth neural network to obtain corresponding hand joint point position information;
selecting a preset number of target RGB video frames from all the RGB video frames to be processed, inputting the target RGB video frames into an I3D model for recognition, and obtaining corresponding video frame characteristics;
inputting the position information of the hand joint point into an AGCN model to obtain corresponding position information characteristics;
and fusing the video frame features and the position information features in a one-to-one correspondence manner to obtain the probability of identifying the action instruction.
Optionally, the HOPE-Net deep neural network comprises: resNet10 network and adaptive graph U-Net network; inputting all the RGB video frames into a pre-trained HOPE-Net deep neural network to obtain corresponding hand joint point position information, wherein the method comprises the following steps of:
coding and predicting all the RGB video frames through a ResNet network to obtain a plurality of corresponding target plane rectangular coordinate points;
and inputting a plurality of target plane rectangular coordinate points into a self-adaptive U-Net network to obtain hand joint point position information corresponding to the RGB video frame.
Optionally, encoding and predicting all the RGB video frames through a res net network to obtain a plurality of corresponding target plane rectangular coordinate points, including:
coding all the RGB video frames to obtain coded video frames;
predicting all the encoded video frames to obtain corresponding initial plane rectangular coordinate points;
and convolving all the initial plane rectangular coordinate points with the corresponding RGB video frames to obtain target plane rectangular coordinate points.
Optionally, acquiring the RGB video frame to be processed includes:
acquiring a video to be processed; the video to be processed comprises hand motion image information based on a first person viewing angle;
and converting the hand motion image information into the RGB video frame to be processed through OpenCV.
Optionally, fusing the video frame features and the position information features in a one-to-one correspondence manner to obtain the probability of identifying the action instruction, including:
analyzing the distance relation between the video frame features and the position information features through a pre-established relation diagram convolution network, and creating connection between each video frame feature and each position information feature based on the distance relation;
respectively inputting the video frame characteristics and the position information characteristics into a convolution network to obtain convolved video certificate characteristics and convolved position information characteristics;
and fusing the convolved video certificate features and the convolved position information features which are in the same connection to obtain fused information features, and inputting the fused information features into a full-connection layer network to obtain the probability of the recognition action instruction.
In a second aspect, the present invention provides an action recognition device based on a first person perspective, including:
the acquisition module is used for acquiring RGB video frames to be processed; the RGB video frame to be processed comprises hand motion image information based on a first person viewing angle;
the first input module is used for inputting all the RGB video frames to be processed into a pre-trained HOPE-Net depth neural network to obtain corresponding hand joint point position information;
the selecting module is used for selecting a preset number of target RGB video frames from all the RGB video frames to be processed, inputting the target RGB video frames into an I3D model for recognition, and obtaining corresponding video frame characteristics;
the second input module is used for inputting the hand joint point position information into an AGCN model to obtain corresponding position information characteristics;
and the fusion module is used for fusing the video frame features and the position information features in a one-to-one correspondence manner to obtain the probability of identifying the action instruction.
Optionally, the HOPE-Net deep neural network comprises: resNet10 network and adaptive graph U-Net network; the first input module includes:
the coding submodule is used for coding and predicting all the RGB video frames through a ResNet network to obtain a plurality of corresponding target plane rectangular coordinate points;
and the first input sub-module is used for inputting a plurality of target plane rectangular coordinate points into the self-adaptive U-Net network to obtain hand joint point position information corresponding to the RGB video frame.
Optionally, the encoding submodule includes:
the coding unit is used for coding all the RGB video frames to obtain coded video frames;
the prediction unit is used for predicting all the encoded video frames to obtain corresponding initial plane rectangular coordinate points;
and the convolution unit is used for convolving all the initial plane rectangular coordinate points with the corresponding RGB video frames to obtain target plane rectangular coordinate points.
Optionally, the acquiring module includes:
the acquisition sub-module is used for acquiring the video to be processed; the video to be processed comprises hand motion image information based on a first person viewing angle;
and the conversion sub-module is used for converting the hand motion image information into the RGB video frame to be processed through OpenCV.
Optionally, the fusion module includes:
the connection sub-module is used for analyzing the distance relation between the video frame characteristics and the position information characteristics through a pre-established relation diagram convolution network and creating connection between each video frame characteristic and each position information characteristic based on the distance relation;
the second input sub-module is used for inputting the video frame characteristics and the position information characteristics into a convolution network respectively to obtain convolved video evidence characteristics and convolved position information characteristics;
and the fusion sub-module is used for fusing the convolved video certificate features and the convolved position information features which are in the same connection to obtain fused information features, and inputting the fused information features into the full-connection layer network to obtain the probability of the recognition action instruction.
From the above technical scheme, the invention has the following advantages:
the method comprises the steps of obtaining RGB video frames to be processed; the RGB video frame to be processed comprises hand motion image information based on a first person viewing angle; inputting all the RGB video frames to be processed into a pre-trained HOPE-Net depth neural network to obtain corresponding hand joint point position information; selecting a preset number of target RGB video frames from all the RGB video frames to be processed, inputting the target RGB video frames into an I3D model for recognition, and obtaining corresponding video frame characteristics; inputting the position information of the hand joint point into an AGCN model to obtain corresponding position information characteristics; and fusing the video frame features and the position information features in a one-to-one correspondence manner to obtain the probability of identifying the action instruction. The video frames are sequentially subjected to hand skeleton joint extraction, RGB and skeleton motion feature extraction, and finally feature fusion is performed to obtain motion instruction probability, so that dependence on external hardware equipment is eliminated, and strong robustness is provided for illumination and scene change.
Drawings
For a clearer description of embodiments of the invention or of the solutions of the prior art, the drawings that are necessary for the description of the embodiments or of the prior art will be briefly described, it being obvious that the drawings in the description below are only some embodiments of the invention, from which, without inventive faculty, other drawings can be obtained for a person skilled in the art;
FIG. 1 is a flowchart illustrating a first-person perspective-based motion recognition method according to an embodiment of the present invention;
FIG. 2 is a flowchart illustrating a second example of a motion recognition method based on a first person perspective according to the present invention;
FIG. 3 is a schematic diagram of the processing of the present invention from video to be processed to hand joint position information;
FIG. 4 is a schematic diagram of an adaptive graph rolling module according to the present invention;
FIG. 5 is a schematic diagram of the use of a relationship graph convolutional network of the present invention;
fig. 6 is a block diagram illustrating an embodiment of a motion recognition apparatus based on a first person perspective according to the present invention.
Detailed Description
The embodiment of the invention provides a motion recognition method and device based on a first person perspective, which can get rid of dependence on external hardware equipment and has strong robustness on illumination and scene change.
In order to make the objects, features and advantages of the present invention more comprehensible, the technical solutions in the embodiments of the present invention are described in detail below with reference to the accompanying drawings, and it is apparent that the embodiments described below are only some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Referring to fig. 1, fig. 1 is a flowchart illustrating a first-person perspective-based motion recognition method according to an embodiment of the present invention, including:
s101, acquiring RGB video frames to be processed; the RGB video frame to be processed comprises hand motion image information based on a first person viewing angle;
s102, inputting all the RGB video frames to be processed into a pre-trained HOPE-Net deep neural network to obtain corresponding hand joint point position information;
s103, selecting a preset number of target RGB video frames from all the RGB video frames to be processed, inputting the target RGB video frames into an I3D model for recognition, and obtaining corresponding video frame characteristics;
s104, inputting the hand joint point position information into an AGCN model to obtain corresponding position information characteristics;
s105, fusing the video frame features and the position information features in a one-to-one correspondence manner, and obtaining the probability of identifying the action instruction.
In the embodiment of the invention, RGB video frames to be processed are acquired; the RGB video frame to be processed comprises hand motion image information based on a first person viewing angle; inputting all the RGB video frames to be processed into a pre-trained HOPE-Net depth neural network to obtain corresponding hand joint point position information; selecting a preset number of target RGB video frames from all the RGB video frames to be processed, inputting the target RGB video frames into an I3D model for recognition, and obtaining corresponding video frame characteristics; inputting the position information of the hand joint point into an AGCN model to obtain corresponding position information characteristics; and fusing the video frame features and the position information features in a one-to-one correspondence manner to obtain the probability of identifying the action instruction. The video frames are sequentially subjected to hand skeleton joint extraction, RGB and skeleton motion feature extraction, and finally feature fusion is performed to obtain motion instruction probability, so that dependence on external hardware equipment is eliminated, and strong robustness is provided for illumination and scene change.
Referring to fig. 2, fig. 2 is a flowchart illustrating steps of a second embodiment of an action recognition method based on a first person perspective according to the present invention, which specifically includes:
step S201, obtaining a video to be processed; the video to be processed comprises hand motion image information based on a first person viewing angle;
step S202, converting the hand motion image information into the RGB video frame to be processed through OpenCV;
in the embodiment of the invention, firstly, the video to be processed is converted into a plurality of RGB video frames to be processed by using OpenCV.
Note that OpenCV is a cross-platform computer vision and machine learning software library that can run on Linux, windows, android and Mac OS operating systems. OpenCV has a lightweight and efficient specification-consisting of a series of C functions and a small number of c++ classes, while providing an interface to languages such as Python, ruby, MATLAB, thereby enabling a number of general algorithms in terms of image processing and computer vision.
Step S203, coding and predicting all the RGB video frames through a ResNet network to obtain a plurality of corresponding target plane rectangular coordinate points;
in an alternative embodiment, encoding and predicting all the RGB video frames through a res net network to obtain a plurality of corresponding target plane rectangular coordinate points, including:
coding all the RGB video frames to obtain coded video frames;
predicting all the encoded video frames to obtain corresponding initial plane rectangular coordinate points;
and convolving all the initial plane rectangular coordinate points with the corresponding RGB video frames to obtain target plane rectangular coordinate points.
In the embodiment of the invention, the ResNet10 network is utilized to perform feature coding on all RGB video frames to obtain coded video frames, the initial plane rectangular coordinate point is obtained by prediction based on the coded video frames, and then the initial plane rectangular coordinate point and the RGB video frames are convolved to obtain more accurate target plane rectangular coordinate points.
Step S204, inputting a plurality of target plane rectangular coordinate points into a self-adaptive U-Net network to obtain hand joint point position information corresponding to the RGB video frame;
in the embodiment of the present invention, after the target plane rectangular coordinate point mentioned in step S203 is obtained, the target plane rectangular coordinate point is input into the adaptive U-Net network, and the depth value of the hand node, that is, the corresponding hand node position information, is calculated, where the hand node position information is the final three-dimensional rectangular coordinate information of 21 hand joints, so as to implement the transformation of the hand node from the target plane rectangular coordinate to the three-dimensional rectangular coordinate.
Referring to fig. 3, fig. 3 is a schematic diagram illustrating processing from a video to be processed to hand joint position information, wherein 1 is an hop-Net network, the hop-Net network includes a res Net10 network 2 and a U-Net network 3, an initial plane rectangular coordinate point is obtained with the assistance of the res Net10 network 2, then a target plane rectangular coordinate point is obtained with the assistance of the res Net10 network 2 again, and then hand joint position information corresponding to the RGB video frame is obtained with the assistance of the U-Net network 3.
Step S205, selecting a preset number of target RGB video frames from all the RGB video frames to be processed, inputting the target RGB video frames into an I3D model for recognition, and obtaining corresponding video frame characteristics;
in the embodiment of the invention, in order to extract more characteristic details from the RGB video frames to be processed, an I3D model is used for identifying target RGB video frames selected from the RGB video frames to be processed, the model is expanded from two-dimensional convolution to three-dimensional convolution, namely, a time dimension is added in a convolution kernel and a pooling layer, and the three-dimensional convolution is utilized to extract video frame characteristics corresponding to the target RGB video frames.
It should be noted that the three-dimensional convolved filter is n×n, i.e., the filter weights of n×n are repeated N times along the time dimension, and normalized by dividing by N, and the BN function and the Relu activation function are added after each layer convolution except for the last layer convolution.
In a specific implementation, 32 frames are selected from RGB video frames to be processed and are input into an I3D model as a group, and corresponding video frame characteristics are generated through the I3D model.
Step S206, inputting the hand joint point position information into an AGCN model to obtain corresponding position information characteristics;
it should be noted that, the AGCN model includes a 9-layer adaptive graph convolution combination module, and for different GCN units and different samples, different topology structures are automatically generated. Referring to fig. 4, fig. 4 is a schematic structural diagram of an adaptive graph convolution module according to the present invention, including a spatial graph convolution 4, a temporal graph convolution 5, and an additional dropout layer 6, where each graph convolution layer is followed by a BN layer 7 and a Relu layer 8, and a corresponding topology structure is generated by combining 5 different types of graph layers. To obtain a more stable effect, each layer of adaptive graph convolution combination of the AGCN model is connected by a residual error. Further, the feature of n×256 is obtained by the AGCN network model, where N is the number of samples.
In the embodiment of the invention, the natural skeleton structure of the hand mainly comprising the hand joint point position information is represented by using an AGCN model through a topological graph. The model is built on the basis of a series of hand skeleton diagrams, namely hand joint point position information, and each joint of the hand skeleton diagram represents one joint of a hand at one moment and is represented by three-dimensional coordinates. The edges of the graph are of two types, one being the spatial edge between the natural joints of the hand at a certain moment in time and one being the temporal edge connecting the same joints across successive time steps. On this basis, a multi-layer space-time diagram convolution is constructed, so that aggregation of information in space and time dimensions is realized.
Step S207, analyzing the distance relation between the video frame features and the position information features through a pre-established relation diagram convolution network, and creating connection between each video frame feature and each position information feature based on the distance relation;
step S208, inputting the video frame characteristics and the position information characteristics into a convolution network respectively to obtain convolved video certificate characteristics and convolved position information characteristics;
step S209, fusing the convolved video certificate features and the convolved position information features which are in the same connection to obtain fused information features, and inputting the fused information features into a full-connection layer network to obtain the probability of the identification action instruction.
Referring to fig. 5, fig. 5 is a schematic diagram of a relationship diagram convolutional network, in which 9 is a video frame feature, 10 is a position information feature, 11 is a GCN unit, 12 is a fused information feature, and 13 is a probability of identifying an action instruction. The video frame feature 9 and the position information feature 10 are respectively input into a plurality of GCN units for convolution, the convolved video evidence feature and the convolved position information feature are obtained, then the convolved video evidence feature and the convolved position information feature are fused to obtain a fused information feature 12, and further the probability 13 for identifying the action instruction is obtained. And further, the action recognition of the operation video is realized under the condition that the restriction on the demonstration video and the demonstration environment is not needed and the operation video is not dependent on an additional auxiliary sensor.
According to the motion recognition method based on the first person viewing angle, which is provided by the embodiment of the invention, RGB video frames to be processed are obtained; the RGB video frame to be processed comprises hand motion image information based on a first person viewing angle; inputting all the RGB video frames to be processed into a pre-trained HOPE-Net depth neural network to obtain corresponding hand joint point position information; selecting a preset number of target RGB video frames from all the RGB video frames to be processed, inputting the target RGB video frames into an I3D model for recognition, and obtaining corresponding video frame characteristics; inputting the position information of the hand joint point into an AGCN model to obtain corresponding position information characteristics; and fusing the video frame features and the position information features in a one-to-one correspondence manner to obtain the probability of identifying the action instruction. The video frames are sequentially subjected to hand skeleton joint extraction, RGB and skeleton motion feature extraction, and finally feature fusion is performed to obtain motion instruction probability, so that dependence on external hardware equipment is eliminated, and strong robustness is provided for illumination and scene change.
Referring to fig. 6, there is shown a block diagram of an embodiment of an action recognition device based on a first person perspective, the device comprising:
an obtaining module 101, configured to obtain an RGB video frame to be processed; the RGB video frame to be processed comprises hand motion image information based on a first person viewing angle;
the first input module 102 is configured to input all the RGB video frames to be processed into a pre-trained HOPE-Net deep neural network to obtain corresponding hand joint point position information;
a selecting module 103, configured to select a predetermined number of target RGB video frames from all the RGB video frames to be processed, and input the target RGB video frames to an I3D model for identification, so as to obtain corresponding video frame features;
the second input module 104 is configured to input the hand joint point position information into an AGCN model, to obtain a corresponding position information feature;
and the fusion module 105 is used for fusing the video frame features and the position information features in a one-to-one correspondence manner to obtain the probability of identifying the action instruction.
In an alternative embodiment, the HOPE-Net deep neural network comprises: resNet10 network and adaptive graph U-Net network; the first input module 102 includes:
the coding submodule is used for coding and predicting all the RGB video frames through a ResNet network to obtain a plurality of corresponding target plane rectangular coordinate points;
and the first input sub-module is used for inputting a plurality of target plane rectangular coordinate points into the self-adaptive U-Net network to obtain hand joint point position information corresponding to the RGB video frame.
In an alternative embodiment, the encoding submodule includes:
the coding unit is used for coding all the RGB video frames to obtain coded video frames;
the prediction unit is used for predicting all the encoded video frames to obtain corresponding initial plane rectangular coordinate points;
and the convolution unit is used for convolving all the initial plane rectangular coordinate points with the corresponding RGB video frames to obtain target plane rectangular coordinate points.
In an alternative embodiment, the obtaining module 101 includes:
the acquisition sub-module is used for acquiring the video to be processed; the video to be processed comprises hand motion image information based on a first person viewing angle;
and the conversion sub-module is used for converting the hand motion image information into the RGB video frame to be processed through OpenCV.
In an alternative embodiment, the fusion module 105 includes:
the connection sub-module is used for analyzing the distance relation between the video frame characteristics and the position information characteristics through a pre-established relation diagram convolution network and creating connection between each video frame characteristic and each position information characteristic based on the distance relation;
the second input sub-module is used for inputting the video frame characteristics and the position information characteristics into a convolution network respectively to obtain convolved video evidence characteristics and convolved position information characteristics;
and the fusion sub-module is used for fusing the convolved video certificate features and the convolved position information features which are in the same connection to obtain fused information features, and inputting the fused information features into the full-connection layer network to obtain the probability of the recognition action instruction.
For the device embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and reference is made to the description of the method embodiments for relevant points.
It will be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working process of the apparatus and modules described above may refer to the corresponding process in the foregoing method embodiment, which is not repeated herein.
In the several embodiments provided in this application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.
The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied essentially or in part or all of the technical solution or in part in the form of a software product stored in a storage medium, including instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
The above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims (4)

1. A method of motion recognition based on a first person perspective, comprising:
acquiring an RGB video frame to be processed; the RGB video frame to be processed comprises hand motion image information based on a first person viewing angle;
inputting all the RGB video frames to be processed into a pre-trained HOPE-Net depth neural network to obtain corresponding hand joint point position information;
selecting a preset number of target RGB video frames from all the RGB video frames to be processed, inputting the target RGB video frames into an I3D model for recognition, and obtaining corresponding video frame characteristics;
inputting the position information of the hand joint point into an AGCN model to obtain corresponding position information characteristics;
fusing the video frame features and the position information features in a one-to-one correspondence manner to obtain the probability of identifying the action instruction;
the HOPE-Net deep neural network comprises: resNet10 network and adaptive graph U-Net network; inputting all the RGB video frames into a pre-trained HOPE-Net deep neural network to obtain corresponding hand joint point position information, wherein the method comprises the following steps of:
coding and predicting all the RGB video frames through a ResNet network to obtain a plurality of corresponding target plane rectangular coordinate points;
inputting a plurality of target plane rectangular coordinate points into a self-adaptive U-Net network to obtain hand joint point position information corresponding to the RGB video frame;
encoding and predicting all the RGB video frames through a ResNet network to obtain a plurality of corresponding target plane rectangular coordinate points, wherein the method comprises the following steps:
coding all the RGB video frames to obtain coded video frames;
predicting all the encoded video frames to obtain corresponding initial plane rectangular coordinate points;
convolving all the initial plane rectangular coordinate points with the corresponding RGB video frames to obtain target plane rectangular coordinate points;
fusing the video frame features and the position information features in a one-to-one correspondence manner to obtain the probability of identifying the action instruction, wherein the method comprises the following steps:
analyzing the distance relation between the video frame features and the position information features through a pre-established relation diagram convolution network, and creating connection between each video frame feature and each position information feature based on the distance relation;
respectively inputting the video frame characteristics and the position information characteristics into a convolution network to obtain convolved video certificate characteristics and convolved position information characteristics;
and fusing the convolved video certificate features and the convolved position information features which are in the same connection to obtain fused information features, and inputting the fused information features into a full-connection layer network to obtain the probability of the recognition action instruction.
2. The method for motion recognition based on first person perspective as recited in claim 1, wherein obtaining RGB video frames to be processed comprises:
acquiring a video to be processed; the video to be processed comprises hand motion image information based on a first person viewing angle;
and converting the hand motion image information into the RGB video frame to be processed through OpenCV.
3. An action recognition device based on a first person perspective, comprising:
the acquisition module is used for acquiring RGB video frames to be processed; the RGB video frame to be processed comprises hand motion image information based on a first person viewing angle;
the first input module is used for inputting all the RGB video frames to be processed into a pre-trained HOPE-Net depth neural network to obtain corresponding hand joint point position information;
the selecting module is used for selecting a preset number of target RGB video frames from all the RGB video frames to be processed, inputting the target RGB video frames into an I3D model for recognition, and obtaining corresponding video frame characteristics;
the second input module is used for inputting the hand joint point position information into an AGCN model to obtain corresponding position information characteristics;
the fusion module is used for fusing the video frame features and the position information features in a one-to-one correspondence manner to obtain the probability of identifying the action instruction;
the HOPE-Net deep neural network comprises: resNet10 network and adaptive graph U-Net network; the first input module includes:
the coding submodule is used for coding and predicting all the RGB video frames through a ResNet network to obtain a plurality of corresponding target plane rectangular coordinate points;
the first input sub-module is used for inputting a plurality of target plane rectangular coordinate points into the self-adaptive U-Net network to obtain hand joint point position information corresponding to the RGB video frame;
the encoding submodule includes:
the coding unit is used for coding all the RGB video frames to obtain coded video frames;
the prediction unit is used for predicting all the encoded video frames to obtain corresponding initial plane rectangular coordinate points;
the convolution unit is used for convoluting all the initial plane rectangular coordinate points with the corresponding RGB video frames to obtain target plane rectangular coordinate points;
the fusion module comprises:
the connection sub-module is used for analyzing the distance relation between the video frame characteristics and the position information characteristics through a pre-established relation diagram convolution network and creating connection between each video frame characteristic and each position information characteristic based on the distance relation;
the second input sub-module is used for inputting the video frame characteristics and the position information characteristics into a convolution network respectively to obtain convolved video evidence characteristics and convolved position information characteristics;
and the fusion sub-module is used for fusing the convolved video certificate features and the convolved position information features which are in the same connection to obtain fused information features, and inputting the fused information features into the full-connection layer network to obtain the probability of the recognition action instruction.
4. The first person perspective based motion recognition apparatus of claim 3, wherein the acquisition module comprises:
the acquisition sub-module is used for acquiring the video to be processed; the video to be processed comprises hand motion image information based on a first person viewing angle;
and the conversion sub-module is used for converting the hand motion image information into the RGB video frame to be processed through OpenCV.
CN202110430314.5A 2021-04-21 2021-04-21 Action recognition method and device based on first person viewing angle Active CN113312966B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110430314.5A CN113312966B (en) 2021-04-21 2021-04-21 Action recognition method and device based on first person viewing angle

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110430314.5A CN113312966B (en) 2021-04-21 2021-04-21 Action recognition method and device based on first person viewing angle

Publications (2)

Publication Number Publication Date
CN113312966A CN113312966A (en) 2021-08-27
CN113312966B true CN113312966B (en) 2023-08-08

Family

ID=77372648

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110430314.5A Active CN113312966B (en) 2021-04-21 2021-04-21 Action recognition method and device based on first person viewing angle

Country Status (1)

Country Link
CN (1) CN113312966B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113902995B (en) * 2021-11-10 2024-04-02 中国科学技术大学 Multi-mode human behavior recognition method and related equipment
CN114973424A (en) * 2022-08-01 2022-08-30 深圳市海清视讯科技有限公司 Feature extraction model training method, hand action recognition method, device and electronic equipment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110096950A (en) * 2019-03-20 2019-08-06 西北大学 A kind of multiple features fusion Activity recognition method based on key frame
WO2019237708A1 (en) * 2018-06-15 2019-12-19 山东大学 Interpersonal interaction body language automatic generation method and system based on deep learning
CN112163480A (en) * 2020-09-16 2021-01-01 北京邮电大学 Behavior identification method and device
WO2021056516A1 (en) * 2019-09-29 2021-04-01 深圳市大疆创新科技有限公司 Method and device for target detection, and movable platform

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019237708A1 (en) * 2018-06-15 2019-12-19 山东大学 Interpersonal interaction body language automatic generation method and system based on deep learning
CN110096950A (en) * 2019-03-20 2019-08-06 西北大学 A kind of multiple features fusion Activity recognition method based on key frame
WO2021056516A1 (en) * 2019-09-29 2021-04-01 深圳市大疆创新科技有限公司 Method and device for target detection, and movable platform
CN112163480A (en) * 2020-09-16 2021-01-01 北京邮电大学 Behavior identification method and device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
多维度自适应3D卷积神经网络原子行为识别;高大鹏;朱建刚;;计算机工程与应用(04);第179-183页 *

Also Published As

Publication number Publication date
CN113312966A (en) 2021-08-27

Similar Documents

Publication Publication Date Title
JP7151016B2 (en) A Deep Machine Learning System for Cuboid Detection
CN111190981B (en) Method and device for constructing three-dimensional semantic map, electronic equipment and storage medium
CN113706699B (en) Data processing method and device, electronic equipment and computer readable storage medium
EP3710978A1 (en) Pose estimation and model retrieval for objects in images
Kraft et al. Birth of the object: Detection of objectness and extraction of object shape through object–action complexes
US11625646B2 (en) Method, system, and medium for identifying human behavior in a digital video using convolutional neural networks
CN109176512A (en) A kind of method, robot and the control device of motion sensing control robot
KR101347840B1 (en) Body gesture recognition method and apparatus
CN111402290A (en) Action restoration method and device based on skeleton key points
CN113312966B (en) Action recognition method and device based on first person viewing angle
CN113034652A (en) Virtual image driving method, device, equipment and storage medium
CN110648397A (en) Scene map generation method and device, storage medium and electronic equipment
CN113420719A (en) Method and device for generating motion capture data, electronic equipment and storage medium
US10970849B2 (en) Pose estimation and body tracking using an artificial neural network
CN111539983A (en) Moving object segmentation method and system based on depth image
CN110555383A (en) Gesture recognition method based on convolutional neural network and 3D estimation
CN117218246A (en) Training method and device for image generation model, electronic equipment and storage medium
EP3309713B1 (en) Method and device for interacting with virtual objects
CN110717384B (en) Video interactive behavior recognition method and device
CN110008873B (en) Facial expression capturing method, system and equipment
Kiyokawa et al. Efficient collection and automatic annotation of real-world object images by taking advantage of post-diminished multiple visual markers
CN117218713A (en) Action resolving method, device, equipment and storage medium
CN107025433B (en) Video event human concept learning method and device
CN114647361A (en) Touch screen object positioning method and device based on artificial intelligence
CN113780215A (en) Information processing method and device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant