CN117373116A - Human body action detection method based on lightweight characteristic reservation of graph neural network - Google Patents

Human body action detection method based on lightweight characteristic reservation of graph neural network Download PDF

Info

Publication number
CN117373116A
CN117373116A CN202311302738.9A CN202311302738A CN117373116A CN 117373116 A CN117373116 A CN 117373116A CN 202311302738 A CN202311302738 A CN 202311302738A CN 117373116 A CN117373116 A CN 117373116A
Authority
CN
China
Prior art keywords
lightweight
neural network
human
model
vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311302738.9A
Other languages
Chinese (zh)
Inventor
卫星
蒋文豪
翟琰
周浩伟
钟浩然
刘敏睿
夏炅
杨帆
赵冲
陆阳
毕翔
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hefei University of Technology
Original Assignee
Hefei University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hefei University of Technology filed Critical Hefei University of Technology
Priority to CN202311302738.9A priority Critical patent/CN117373116A/en
Publication of CN117373116A publication Critical patent/CN117373116A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • G06V40/23Recognition of whole body movements, e.g. for sport training
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/042Knowledge-based neural networks; Logical representations of neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • G06N3/0442Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/082Learning methods modifying the architecture, e.g. adding, deleting or silencing nodes or connections
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • G06V10/443Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
    • G06V10/449Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters
    • G06V10/451Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters with interaction between the filter responses, e.g. cortical complex cells
    • G06V10/454Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Multimedia (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computational Linguistics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Biophysics (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Human Computer Interaction (AREA)
  • Biodiversity & Conservation Biology (AREA)
  • Social Psychology (AREA)
  • Psychiatry (AREA)
  • Probability & Statistics with Applications (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a human body action detection method based on lightweight characteristic reservation of a graph neural network, which belongs to the technical fields of computer vision and artificial intelligence, and comprises the following steps: acquiring a human behavior video data set, and processing the human behavior video data set to acquire a skeleton diagram data set; constructing a backbone space-time diagram convolutional neural network, inputting the skeleton diagram data set into the backbone space-time diagram convolutional neural network for training and then optimizing to obtain a lightweight motion recognition model; constructing a cyclic generation network, inputting the skeleton diagram data set into the cyclic generation network, training and optimizing to obtain an action recognition model of important characteristics; and fusing the lightweight motion recognition model and the motion recognition model of the important features to obtain a human motion detection model, and detecting human motion based on the human motion detection model. Compared with the current mainstream motion detection method, the model provided by the invention has better performance.

Description

Human body action detection method based on lightweight characteristic reservation of graph neural network
Technical Field
The invention belongs to the field of computer vision and artificial intelligence, and particularly relates to a human body action detection method based on lightweight characteristic preservation of a graph neural network.
Background
In recent years, human motion gesture recognition has become a hot problem of research, and is a very challenging task in the field of computer vision at present. Human motion recognition mainly refers to recognizing the motion and motion category of human individuals from visual information such as the posture and the motion of human bodies through computer vision and machine learning algorithms, and classifying or describing the motion and the motion category. Human motion recognition technology can be applied to many fields such as intelligent monitoring, video content understanding, man-machine interaction, virtual reality and the like.
Human behavior is very complex and diverse, with extremely high dimensionality and variability. The behavior of humans in everyday life encompasses many aspects, with many varying details, and various background noise and interference may occur in different environments and situations. Early studies focused on extracting hand-made spatial and temporal features from skeleton sequences for human motion recognition, hand-made based methods can be broadly divided into joint-based and body-part-based methods, depending on the feature extraction technique used. These methods merely represent skeletal data as a sequence of vectors processed by the RNN, and do not fully mimic the complex spatiotemporal configuration and correlation of body joints.
Most of the methods only use the characteristic information of the self skeleton movement, but the problems of sensitivity to ambient light, easy error in the process of identifying the actual actions and local information loss in the process of extracting the information exist. Meanwhile, the existing network model based on GCN has large parameter quantity and calculation quantity, and cannot be deployed in embedded equipment and edge computing equipment.
Disclosure of Invention
The invention aims to provide a human body action detection method based on lightweight characteristic preservation of a graph neural network, so as to solve the problems in the prior art.
In order to achieve the above object, the present invention provides a human motion detection method based on lightweight class feature preservation of a graph neural network, comprising:
acquiring a human behavior video data set, and processing the human behavior video data set to acquire a skeleton diagram data set;
constructing a backbone space-time diagram convolutional neural network, inputting the skeleton diagram data set into the backbone space-time diagram convolutional neural network for training and then optimizing to obtain a lightweight motion recognition model;
constructing a cyclic generation network, inputting the skeleton diagram data set into the cyclic generation network for training and then optimizing to obtain an action recognition model of important characteristics;
and fusing the lightweight motion recognition model and the motion recognition model of the important features to obtain a human motion detection model, and detecting human motion based on the human motion detection model.
Preferably, the process of obtaining the skeleton map dataset includes:
processing the human behavior data set into a serialized picture set;
carrying out gesture estimation on the serialized picture set to obtain a human body joint point set and a joint point edge set;
and constructing a non-directional space-time diagram of the human body joint point set and the joint point edge set to obtain the skeleton diagram data set.
Preferably, the process of obtaining the lightweight motion recognition model includes:
constructing the backbone space-time diagram convolutional neural network, and inputting the skeleton diagram data set to a time domain and a space domain of the backbone space-time diagram convolutional neural network for training;
cutting and compressing the trained trunk space-time diagram convolutional neural network based on a singular value decomposition method to obtain a lightweight model;
and performing cyclic training on the lightweight model to obtain the lightweight motion recognition model.
Preferably, the expression of the total objective function for clipping and compressing the trained main space-time diagram convolutional neural network based on the singular value decomposition method is as follows:
wherein,is to decompose training loss of network hierarchy, +.>Is the sparsity loss function of the vector s,representing an orthogonal regularization process on the matrix to be decomposed. B is the total number of network hierarchies, lambda 0 And lambda (lambda) h Is an attenuation parameter->s is a decomposition variable representing the original convolution kernel or weight matrix, where +.> The number of input channels of the convolution layer is +.>The number of output channels is +.>Convolution kernel size k 1 ×k 2 J represents the matrix +.>Andis a rank of (c).
Preferably, the process of obtaining the motion recognition model of the important feature includes:
constructing a gating-based loop generation network, adding an attention mechanism to the gating-based loop generation network, and obtaining a GRU network with the attention mechanism;
and inputting the skeleton diagram data set into the GRU network with the attention adding mechanism for training, and obtaining the action recognition model of the important features.
Preferably, the expression of the attention mechanism is:
e ij =w i tanh(W i h i-1 +V i x j +b i )
wherein e ij Is a verification model e Represents the i-th hidden layer state vector h i The determined value of the attention probability distribution of each node, j represents the node sequence number, x j Representing the attention value of the j-th node. a, a ij For calculating intermediate quantity, normalizing node value x by a calculation mode similar to softMax j WhereinAnd the sum of all node distribution values at the ith moment is represented. h is a i-1 Represents the hidden state at the i-1 th moment, w i ,W i ,V i Respectively represent the total weight coefficient matrix at the ith moment and the hidden state h i-1 And node value x j Weight coefficient matrix of b) i Indicating the offset of the corresponding time. Calculating an important feature vector S containing node information at the ith moment by the formula i
Preferably, the process of obtaining the human motion detection model includes:
the weight calculation is carried out on the feature vectors in the lightweight motion recognition model and the motion recognition model of the important features based on a gradient descent algorithm, so that lightweight vectors and important feature vectors are obtained;
calculating the lightweight vector and the important feature vector to obtain a fusion vector;
and calculating the fusion vector to obtain the human motion detection model.
Preferably, the process of obtaining the fusion vector includes:
splicing the lightweight vector and the important feature vector to obtain a spliced feature vector;
convolving the lightweight vector and the important feature vector based on a preset convolution kernel to obtain two new feature vectors, and adding the two new feature vectors to obtain a convolution feature vector;
and adding the spliced characteristic vector and the convolution characteristic vector, and then calculating to obtain the human motion detection model.
The invention has the technical effects that:
the application provides a human body action detection method based on lightweight characteristic preservation of a graph neural network. The model trained by the method is provided with a feature retaining module, and the important features in human body movement are retained through the gating unit and the attention mechanism, so that the problem of possible feature loss in general training is solved.
In the training of the neural network, the model in training is decomposed by using an SVD compression method, so that the complexity of the model is greatly reduced on the premise of ensuring the accuracy, and the model can be deployed on edge equipment for application.
The model is subjected to experiments on a data set disclosed in the current action detection field, and compared with the current mainstream action detection method, the proposed model has better performance, can be deployed into embedded equipment for application, and has strong robustness and generalization capability.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application, illustrate and explain the application and are not to be construed as limiting the application. In the drawings:
FIG. 1 is a block flow diagram of a human motion detection method based on lightweight feature preservation of a graph neural network in an embodiment of the invention;
FIG. 2 is a block diagram of a step S4 feature fusion module according to an embodiment of the present invention
FIG. 3 is a constructed spatiotemporal skeleton diagram in an embodiment of the invention;
FIG. 4 is an overall block diagram of the present method in an embodiment of the present invention;
FIG. 5 is a schematic diagram of a singular value-based model compression module in an embodiment of the present invention;
fig. 6 is a schematic diagram of a feature preserving module based on a gating unit and an attention mechanism according to an embodiment of the present invention.
Detailed Description
It should be noted that, in the case of no conflict, the embodiments and features in the embodiments may be combined with each other. The present application will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.
It should be noted that the steps illustrated in the flowcharts of the figures may be performed in a computer system such as a set of computer executable instructions, and that although a logical order is illustrated in the flowcharts, in some cases the steps illustrated or described may be performed in an order other than that illustrated herein.
Example 1
The human body motion detection method disclosed by the embodiment will be described with reference to fig. 1 and fig. 4, and the human body motion detection method based on the lightweight characteristic reservation of the graph neural network comprises the following steps:
step S1, acquiring a human behavior video data set with a category label, and preprocessing the video data set to obtain a skeleton diagram.
And S2, inputting skeleton information into a main space-time diagram convolutional neural network for training, extracting general features of human body movement, decomposing vectors of a convolutional kernel by using a singular value decomposition-based method, and compressing a model.
And S3, respectively inputting the human skeleton information into a characteristic retaining module based on the gate control unit for training.
And S4, fusing the obtained two feature vectors with different channel numbers by a novel feature fusion method, and distinguishing the importance of different features by using weight parameters to obtain a model with stronger expression capability.
In step S1, the data set required for training should include RGB images and depth images, and basic information such as height and age of the subject, and key point labeling information, and the data set should cover the daily activities of the human being.
It should be noted that we use PP-TinyPose human body gesture algorithm to obtain dynamic skeleton diagram for the input video, and extract 18 nodes of human body. PP-TinyPose uses spatiotemporal graphics to form a hierarchical representation of the skeletal sequence and constructs a non-oriented spatiotemporal graph g= (V, E) on the skeletal sequence with N nodes and T frames. Where V represents the set of nodes and E represents the set of edges connecting the nodes. Each node is represented as a triplet having three dimensions (x, y, t), where x and y are the positions of the joint in space and t is the frame index of the joint on the time axis. The space-time graph models the spatial and temporal relationship between joint points in a skeleton sequence, and provides a basis for subsequent motion recognition and gesture estimation tasks. In this undirected time-space diagram, the node set v= { V ti T=1, …, T, i=1, …, N } contains all nodes in the bone sequence. Node v ti Comprises the coordinate vector and the estimated confidence of the ith node on the T-th frame. The edge set E includes two parts, one part representing the physical and spatial connections between the various nodes within a single frame, if (v ti ,v tj ) E V then (V) ti ,v tj )∈E S . Another part represents the connection in the time dimension between the same nodes in different frames if (v) ti ,v si ) E V and (V) ti ,v si )∈E T . As shown in figure 3, the constructed space-time skeleton diagramI.e. can be used in the training of the graph neural network.
After all the motion skeleton diagrams in the video data set are processed, executing step 2 to train the backbone network so as to obtain a lightweight motion detection model, wherein the step specifically comprises the following steps:
step S201, the obtained human skeleton information is sent to a multi-mode graph neural network ST-GCN for training, and learning is performed on a time domain and a space domain respectively;
step S202, after each convolutional network training is completed, a model is cut by a singular value decomposition-based method, and parameters and calculated amount of the model are reduced;
referring to fig. 5, in the compression process of the model, the number of input channels isThe number of output channels is +.>Convolution kernel size k 1 ×k 2 Is->The layer can be->Interpreted as a linear layer->The corresponding rank j is approximately shaped as +.>And->Is a linear layer of the first layer. If it is mapped back to convolution, it corresponds to oneConvolution kernel followed by one/>And (5) a convolution kernel.
If the accuracy similar to the original model is to be obtained, then the model is to be compared with the original modelWith SVD full rank decomposition, we enable the decomposition to be performed more easily by adding orthogonal regularization to the training process. />Decomposing to obtain-> And->The weight matrix of (2) can be directly derived from +.>And->And (5) reconstructing. The regularization loss in the training process is shown in the following formula:
wherein I II F Represents the F-norm and j represents the matrixAnd->I represents the identity matrix.
It should be noted that in the SVD training process, each hierarchy uses decomposition variablesInstead of the original convolution kernel or weight matrix. Forward transfer is by way of +.>The conversion to two successive network layers is accomplished, reverse transfer and optimization is directly at +.>And (3) performing the process. When singular vector matrix->When orthogonal, reducing the rank of the decomposition network amounts to making the singular value vector s for each network layer as sparse as possible. The sparsity loss during training is shown in the following formula:
based on the above analysis we propose a total objective function for the decomposition training:
wherein the method comprises the steps ofIs the training loss to resolve the network hierarchy, and B is the total number of network hierarchies. Lambda (lambda) 0 And lambda (lambda) h Is a decay parameter, a trade-off can be made between accuracy and parameter quantity,to obtain a low rank model.
Wherein,is to decompose training loss of network hierarchy, +.>Is the sparsity loss function of the vector s,representing an orthogonal regularization process on the matrix to be decomposed. B is the total number of network hierarchies, lambda 0 And lambda (lambda) h Is an attenuation parameter->s is a decomposition variable representing the original convolution kernel or weight matrix, where +.> The number of input channels of the convolution layer is +.>The number of output channels is +.>Convolution kernel size k 1 ×k 2 J represents the matrix +.>Andis a rank of (c).
Repeating the steps S201 and S202, circularly training until the model learns general characteristics of skeleton actions, and obtaining a lightweight action recognition model by a singular value decomposition method.
Step S3 is performed to preserve some features that may be ignored by the neural network by a gating cell based approach. Wherein, step S3 includes:
step S301: the input skeleton graph is processed using a gating-based loop generation network GRU.
Step S302: and adding an attention mechanism in the GRU network, and distributing different weights to the input at each moment when the model processes the input sequence, so that the model pays attention to a specific part in the skeleton sequence, and an action recognition model with the important characteristics reserved is obtained.
It should be noted that, referring to fig. 6, in step S301, the GRU has two gating units: an update gate and a reset gate. The update gate controls the degree of update between the input features at the current time and the hidden states at the previous time, thereby helping the network to better capture the long-term dependencies of the nodes in the skeleton graph. The reset gate then controls whether the hidden state should be reset to the original state, thereby helping the network to better handle different skeleton sequences. The procedure of feature preservation obeys the following calculation formula:
z t =σ(W xz f+W hz h t-1 )
r t =σ(W xr f+W hr h t-1 )
h t ′=tanh(W hx f+r t eW hh h t-1 )
wherein W represents a weight vector, z t Representing an update gate, r t Representing a forgetting gate, σ represents a sigmoid function by which data is transformed into a value in the range 0-1, acting as a gating signal, h t-1 The hidden layer state at time t-1 is indicated and includes past information. h is a t ' is candidate hidden layer state, when r t When approaching zero, the model will discard the past hidden information, leaving only the information currently entered. r is (r) t When approaching 1, the past information is considered to be useful and added to the current information.
Note that, in step S302, the calculation formula of the attention mechanism added for the GRU is as follows:
e ij =w i tanh(W i h i-1 +V i x j +b i )
wherein e ij Is a verification model e Represents the i-th hidden layer state vector h i The determined value of the attention probability distribution of each node, j represents the node sequence number, x j Representing the attention value of the j-th node. a, a ij For calculating intermediate quantity, normalizing node value x by a calculation mode similar to softMax j WhereinAnd the sum of all node distribution values at the ith moment is represented. h is a i-1 Represents the hidden state at the i-1 th moment, w i ,W i ,V i Respectively represent the total weight coefficient matrix at the ith moment and the hidden state h i-1 And node value x j Weight coefficient matrix of b) i Indicating the offset of the corresponding time. Calculating an important feature vector S containing node information at the ith moment by the formula i
Finally, after the models learn the general features and the key retention features of the human actions in step S2 and step S3, the two models are fused in step S4, please refer to fig. 2, and step S4 includes the following steps:
in step S401, two feature vectors are respectively assigned with different weights lambda and 1-lambda by means of channel weight fusion, wherein lambda is a parameter automatically updated by a gradient descent algorithm.
Step S402, splicing the two feature vectors obtained in the step S401 in the channel dimension to obtain a new feature vector.
Step S403, using the dimensions asIs convolved with respect to the two eigenvectors of the input to generate two new eigenvectors, and adds the two eigenvectors.
And step S404, adding the feature vectors obtained in the step S402 and the step S403 to finally obtain a lightweight human motion detection model with the feature reserved.
In summary, the invention discloses a human motion detection method with light weight feature preservation, which comprises the following steps: preprocessing a human motion video data set to obtain a space-time skeleton diagram of motion; cutting a model trained by a backbone network by using a singular value decomposition-based method, and obtaining a lightweight model on the premise of ensuring the progress; the important features in training are prevented from losing by using a gating unit and a feature retaining module of an attention mechanism; and splicing the models with the same dimension but different channels by using a channel weight fusion method to fuse the models. The model is tested on several public human body action detection data sets of the main stream and human body action recognition data sets in a locomotive driving scene, and compared with the model of the current main stream, better results are obtained, and the model can be deployed on edge embedded equipment to obtain better detection results, so that the model is proved to have higher performance and good applicability.
Example two
The embodiment provides a human body action detection method based on lightweight characteristic reservation of a graph neural network, which comprises the following steps:
the method for detecting the human body motion based on the lightweight characteristic reservation of the graph neural network is characterized by comprising the following steps of:
step S1, using a human behavior video data set with a category label, wherein the data set contains RGB images and depth images, basic information such as the height and age of a subject and key point labeling information, and the data set covers human daily activity related actions. Preprocessing the video data set to obtain a skeleton diagram.
And S2, inputting human skeleton information into a main space-time diagram convolutional neural network for training, extracting general characteristics of human motion, decomposing vectors of a convolutional kernel by using a singular value decomposition-based method, compressing the decomposed vectors, and effectively reducing the complexity of a model on the premise of ensuring the accuracy.
And S3, inputting the human skeleton information into a characteristic retaining module based on a gate control unit for training. The feature preservation module automatically selects feature vectors to be preserved and feature vectors to be preserved through a forgetting gate and an updating gate in the gating unit, and preserves important features in the skeleton data through continuous iteration.
And S4, fusing the obtained two feature vectors with different channel numbers by a novel feature fusion method, and distinguishing the importance of different features by using weight parameters to obtain a model with stronger expression capability.
Further optimizing the scheme, the step S1 comprises the following steps:
step S101, data preprocessing: processing human body action videos into a serialized picture set by taking frames as units, carrying out gesture estimation on each frame of picture by using a PP-TinyPose algorithm, extracting 18 joints of a human body, and constructing a non-directional space-time diagram G= (V, E) of a skeleton sequence, wherein V represents a joint set and V= { V ti T=1, …, T, i=1, …, N }. E denotes the set of edges connecting these nodes, each node being represented as a triplet having three dimensions (x, y, t), where x and y are the positions of the joint in space and t is the frame index of the joint on the time axis.
Further optimizing the scheme, the step S2 comprises the following steps:
step S201: and sending the obtained human skeleton information to a multi-modal-diagram neural network ST-GCN for training, and respectively learning in a time domain and a space domain.
Step S202: after each convolutional network training is completed, a singular value decomposition-based method is used for cutting the model, and parameters and calculated amount of the model are reduced.
Step S203: and repeating the steps S201 and S202, and circularly training until the model learns general characteristics of skeleton actions, thereby obtaining the lightweight action recognition model.
The method for detecting a lightweight characteristic preserving human motion according to claim 1, wherein step S3 comprises:
step S301: the input skeleton graph is processed using a gating-based loop generation network GRU.
Step S302: and adding an attention mechanism in the GRU network, and distributing different weights to the input at each moment when the model processes the input sequence, so that the model pays attention to a specific part in the skeleton sequence, and an action recognition model with the important characteristics reserved is obtained.
Further optimizing the scheme, the step S4 comprises the following steps:
step S401: and respectively giving different weights lambda and 1-lambda to the two feature vectors in a channel weight fusion mode, wherein lambda is a parameter automatically updated by a gradient descent algorithm.
Step S402: and (3) splicing the two feature vectors obtained in the step (S401) in the channel dimension to obtain a new feature vector.
Step S403: the size of the used dimension isIs convolved with respect to the two eigenvectors of the input to generate two new eigenvectors, and adds the two eigenvectors.
Step S404: the feature vectors obtained in step S402 and step S403 are added.
The model compression method based on singular value decomposition is specifically as follows:
the weight matrix of the convolution kernel is decomposed into products of three matrices, and gradient update is directly carried out on the three matrices in the fine tuning process. For an input channel number ofThe number of output channels is +.>Convolution kernel size k 1 ×k 2 Is a convolution layer of (2)We can add this layer->Interpreted as a linear layer->The corresponding rank j-is approximately shaped as +.>And->Is a linear layer of the first layer. Mapping back the convolution, which corresponds to one +.>Convolution followed by oneAnd (5) convolution. To avoid repetition of SVD operations in each step, we are +.>SVD total rank decomposition is carried out to obtain At the same time, orthogonal regularization is introduced to solve the problem that matrix orthogonality property is not easy to obtain in decomposition, and the decomposition method is shown in the formula:
wherein I II F Represents the F-norm and j represents the matrixAnd->Is a rank of (c).
When singular vector matrixWhen the two layers are orthogonal, reducing the rank of the decomposition network is equivalent to making the singular value vector s of each network layer as sparse as possible, so the sparsity is represented by the following formula:
the resulting overall objective function of the decomposition training is:
wherein the method comprises the steps ofIs the training loss to resolve the network hierarchy, and B is the total number of network hierarchies. Lambda (lambda) 0 And lambda (lambda) h Is a fading parameter, and can be weighted between accuracy and parameter quantity to obtain low rankAnd (5) a model. Finally, a lightweight graph rolling network action recognition model is obtained after multiple times of cyclic training.
Further optimizing scheme, skeleton information processing network based on door control unit includes:
the GRU has two gating units: an update gate (update gate) and a reset gate (reset gate). The update gate controls the degree of update between the input features at the current time and the hidden states at the previous time, thereby helping the network to better capture the long-term dependencies of the nodes in the skeleton graph. The reset gate then controls whether the hidden state should be reset to the original state, thereby helping the network to better handle different skeleton sequences. The specific calculation rules follow the following formula:
z t =σ(W xz f+W hz h t-1 )
r t =σ(W xr f+W hr h t-1 )
h t ′=tanh(W hx f+r t ⊙W hh h t-1 )
wherein W represents a weight vector, z t Representing an update gate, r t Representing a forgetting gate, σ represents a sigmoid function by which data is transformed into a value in the range 0-1, acting as a gating signal, h t-1 The hidden layer state at time t-1 is indicated and includes past information. h is a t ' is candidate hidden layer state, when r t When approaching zero, the model will discard the past hidden information, leaving only the information currently entered. r is (r) t When approaching 1, the past information is considered to be useful and added to the current information. By using update and reset gates, the GRU can effectively retain the important features of the input, thereby improving the performance and application of the model.
Further optimizing scheme, the method for introducing the attention mechanism in the GRU network comprises the following steps:
the model assigns a different weight to each input at each time instant when processing the input sequence, which tells the model which parts of the input skeleton sequence should be focused on, thereby extracting useful feature information more efficiently. Wherein the calculation formula of the attention mechanism is as follows:
e ij =w i tanh(W i h i-1 +V i x j +b i )
wherein e ij Is a verification model e Represents the i-th hidden layer state vector h i The determined value of the attention probability distribution of each node, j represents the node sequence number, x j Representing the attention value of the j-th node. a, a ij For calculating intermediate quantity, normalizing node value x by a calculation mode similar to softMax j WhereinAnd the sum of all node distribution values at the ith moment is represented. h is a i-1 Represents the hidden state at the i-1 th moment, w i ,W i ,V i Respectively represent the total weight coefficient matrix at the ith moment and the hidden state h i-1 And node value x j Weight coefficient matrix of b) i Indicating the offset of the corresponding time. Calculating an important feature vector S containing node information at the ith moment by the formula i
The foregoing is merely a preferred embodiment of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions easily contemplated by those skilled in the art within the technical scope of the present application should be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (8)

1. The human body action detection method based on the lightweight characteristic reservation of the graph neural network is characterized by comprising the following steps of:
acquiring a human behavior video data set, and processing the human behavior video data set to acquire a skeleton diagram data set;
constructing a backbone space-time diagram convolutional neural network, inputting the skeleton diagram data set into the backbone space-time diagram convolutional neural network for training and then optimizing to obtain a lightweight motion recognition model;
constructing a cyclic generation network, inputting the skeleton diagram data set into the cyclic generation network for training and then optimizing to obtain an action recognition model of important characteristics;
and fusing the lightweight motion recognition model and the motion recognition model of the important features to obtain a human motion detection model, and detecting human motion based on the human motion detection model.
2. The human action detection method based on lightweight feature preservation of a graph neural network of claim 1, wherein the process of obtaining a skeleton graph dataset comprises:
processing the human behavior data set into a serialized picture set;
carrying out gesture estimation on the serialized picture set to obtain a human body joint point set and a joint point edge set;
and constructing a non-directional space-time diagram of the human body joint point set and the joint point edge set to obtain the skeleton diagram data set.
3. The human motion detection method based on lightweight feature preservation of a graph neural network of claim 1, wherein the process of obtaining a lightweight motion recognition model comprises:
constructing the backbone space-time diagram convolutional neural network, and inputting the skeleton diagram data set to a time domain and a space domain of the backbone space-time diagram convolutional neural network for training;
cutting and compressing the trained trunk space-time diagram convolutional neural network based on a singular value decomposition method to obtain a lightweight model;
and performing cyclic training on the lightweight model to obtain the lightweight motion recognition model.
4. The human motion detection method based on the lightweight class feature preservation of the graph neural network according to claim 3, wherein the expression of the total objective function for clipping and compressing the trained backbone space-time graph convolutional neural network based on the singular value decomposition method is:
wherein,is to decompose training loss of network hierarchy, +.>Is the sparsity loss function of vector s, +.>Representing an orthogonal regularization process on the matrix to be decomposed; b is the total number of network hierarchies, lambda 0 And lambda (lambda) h Is an attenuation parameter->s is a decomposition variable representing the original convolution kernel or weight matrix, where +.> The number of input channels of the convolution layer is +.>The number of output channels is +.>Convolution kernel size k 1 ×k 2 J represents the matrix +.>And->Is used for the control of the rank of (c),representing the overall objective function of clipping and compression.
5. The human motion detection method based on lightweight feature preservation of a graph neural network according to claim 1, wherein the process of obtaining the motion recognition model of important features comprises:
constructing a gating-based loop generation network, adding an attention mechanism to the gating-based loop generation network, and obtaining a GRU network with the attention mechanism;
and inputting the skeleton diagram data set into the GRU network with the attention adding mechanism for training, and obtaining the action recognition model of the important features.
6. The human motion detection method based on lightweight feature preservation of a graph neural network of claim 5, wherein the expression of the attention mechanism is:
e ij =w i tanh(W i h i-1 +V i x j +b i )
wherein e ij Is a verification model e Represents the i-th hidden layer state vector h i The determined value of the attention probability distribution of each node, j represents the node sequence number, x j An attention value representing the j-th node; a, a ij For calculating intermediate quantity, normalizing node value x by a calculation mode similar to softMax j WhereinRepresenting the sum of all node distribution values at the ith moment; h is a i-1 Represents the hidden state at the i-1 th moment, w i ,W i ,V i Respectively represent the total weight coefficient matrix at the ith moment and the hidden state h i-1 And node value x j Weight coefficient matrix of b) i Representing the offset of the corresponding time; calculating an important feature vector S containing node information at the ith moment by the formula i
7. The human motion detection method based on lightweight feature preservation of a graph neural network of claim 1, wherein the process of obtaining a human motion detection model comprises:
the weight calculation is carried out on the feature vectors in the lightweight motion recognition model and the motion recognition model of the important features based on a gradient descent algorithm, so that lightweight vectors and important feature vectors are obtained;
calculating the lightweight vector and the important feature vector to obtain a fusion vector;
and calculating the fusion vector to obtain the human motion detection model.
8. The human motion detection method based on lightweight feature preservation of a graph neural network of claim 7, wherein the process of obtaining a fusion vector comprises:
splicing the lightweight vector and the important feature vector to obtain a spliced feature vector;
convolving the lightweight vector and the important feature vector based on a preset convolution kernel to obtain two new feature vectors, and adding the two new feature vectors to obtain a convolution feature vector;
and adding the spliced characteristic vector and the convolution characteristic vector, and then calculating to obtain the human motion detection model.
CN202311302738.9A 2023-10-10 2023-10-10 Human body action detection method based on lightweight characteristic reservation of graph neural network Pending CN117373116A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311302738.9A CN117373116A (en) 2023-10-10 2023-10-10 Human body action detection method based on lightweight characteristic reservation of graph neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311302738.9A CN117373116A (en) 2023-10-10 2023-10-10 Human body action detection method based on lightweight characteristic reservation of graph neural network

Publications (1)

Publication Number Publication Date
CN117373116A true CN117373116A (en) 2024-01-09

Family

ID=89397477

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311302738.9A Pending CN117373116A (en) 2023-10-10 2023-10-10 Human body action detection method based on lightweight characteristic reservation of graph neural network

Country Status (1)

Country Link
CN (1) CN117373116A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116894115A (en) * 2023-06-12 2023-10-17 国网湖北省电力有限公司经济技术研究院 Automatic archiving method for power grid infrastructure files

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116894115A (en) * 2023-06-12 2023-10-17 国网湖北省电力有限公司经济技术研究院 Automatic archiving method for power grid infrastructure files
CN116894115B (en) * 2023-06-12 2024-05-24 国网湖北省电力有限公司经济技术研究院 Automatic archiving method for power grid infrastructure files

Similar Documents

Publication Publication Date Title
Zheng et al. Unsupervised representation learning with long-term dynamics for skeleton based action recognition
CN107609460B (en) Human body behavior recognition method integrating space-time dual network flow and attention mechanism
CN107492121B (en) Two-dimensional human body bone point positioning method of monocular depth video
CN112597883B (en) Human skeleton action recognition method based on generalized graph convolution and reinforcement learning
CN109948475B (en) Human body action recognition method based on skeleton features and deep learning
CN112395945A (en) Graph volume behavior identification method and device based on skeletal joint points
CN111814719A (en) Skeleton behavior identification method based on 3D space-time diagram convolution
CN112464865A (en) Facial expression recognition method based on pixel and geometric mixed features
CN113283298B (en) Real-time behavior identification method based on time attention mechanism and double-current network
CN111461063B (en) Behavior identification method based on graph convolution and capsule neural network
CN114529984A (en) Bone action recognition method based on learnable PL-GCN and ECLSTM
CN111881802B (en) Traffic police gesture recognition method based on double-branch space-time graph convolutional network
CN112036276A (en) Artificial intelligent video question-answering method
CN114821640A (en) Skeleton action identification method based on multi-stream multi-scale expansion space-time diagram convolution network
CN117373116A (en) Human body action detection method based on lightweight characteristic reservation of graph neural network
CN113688765B (en) Action recognition method of self-adaptive graph rolling network based on attention mechanism
CN114220154A (en) Micro-expression feature extraction and identification method based on deep learning
CN111723667A (en) Human body joint point coordinate-based intelligent lamp pole crowd behavior identification method and device
CN107341471B (en) A kind of Human bodys' response method based on Bilayer condition random field
Yin et al. Msa-gcn: Multiscale adaptive graph convolution network for gait emotion recognition
CN116246338B (en) Behavior recognition method based on graph convolution and transducer composite neural network
CN117115911A (en) Hypergraph learning action recognition system based on attention mechanism
CN116453025A (en) Volleyball match group behavior identification method integrating space-time information in frame-missing environment
CN116148864A (en) Radar echo extrapolation method based on DyConvGRU and Unet prediction refinement structure
CN115798055A (en) Violent behavior detection method based on corersort tracking algorithm

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination