CN111401106B

CN111401106B - Behavior identification method, device and equipment

Info

Publication number: CN111401106B
Application number: CN201910000953.0A
Authority: CN
Inventors: 丁晓璐
Original assignee: China Mobile Communications Group Co Ltd; China Mobile Communications Ltd Research Institute
Current assignee: China Mobile Communications Group Co Ltd; China Mobile Communications Ltd Research Institute
Priority date: 2019-01-02
Filing date: 2019-01-02
Publication date: 2023-03-31
Anticipated expiration: 2039-01-02
Also published as: CN111401106A

Abstract

The invention provides a behavior identification method, a behavior identification device and behavior identification equipment, wherein the behavior identification method comprises the following steps: acquiring key frame sequence data to be identified from bone sequence data to be identified; and identifying the behavior action corresponding to the key frame sequence data to be identified by utilizing a time-space graph convolutional network. The scheme acquires the sequence data of the key frames to be identified from the sequence data of the bones to be identified; identifying behavior actions corresponding to the key frame sequence data to be identified by utilizing a time-space graph convolutional network; redundant information interference can be reduced, and the workload of behavior identification is reduced; the time domain and space domain characteristics of the skeleton sequence are simultaneously excavated, the model design complexity is reduced, and the identification efficiency and accuracy are improved.

Description

Behavior identification method, device and equipment

Technical Field

The invention relates to the technical field of human behavior recognition, in particular to a behavior recognition method, a behavior recognition device and behavior recognition equipment.

Background

The following behavior recognition schemes mainly exist in the prior art:

in the first scheme, video behavior recognition: and directly taking video image data as input, and performing behavior recognition by using a deep learning method. The method can be approximately regarded as classifying images of each frame in multi-frame images, and providing a behavior identification result according to the classification result of all the images.

Second approach, bone sequence behavior recognition: the human skeleton node has higher robustness on illumination and visual angle change, the data volume is less, the calculation resource consumption is less, along with the improvement of the equipment precision, the skeleton node can obtain more accurate coordinates through a depth camera or a motion capture system, and the identification effect can be improved by taking the skeleton sequence as the input of a depth network.

The third scheme is based on the behavior recognition of RNN (Recurrent Neural Networks): due to the good nature of RNN itself, behavior recognition using RNN does not require separate consideration of time dimension modeling. The existing scheme is mainly based on an LSTM (Long Short Term Memory Network), and the identification accuracy is continuously improved by introducing a trust door mechanism, adding an attention model and the like.

A fourth scheme, based on CNN (Convolutional Neural Networks) behavior recognition: the input of the CNN is generally euclidean structure data arranged in a matrix form, and in the existing documents, the 2D CNN for image classification is expanded into 3D CNN for video identification, and methods such as video segmentation and multitask parallel computation are adopted to achieve good effects.

However, the above four behavior recognition schemes have the following disadvantages:

for the first scheme, video behavior recognition: the calculation amount is huge, about 5M of an image classification (101-type) network is obtained, and the number of network parameters reaches about 33M after the image classification (101-type) network is expanded to video classification; the global long-distance context information is difficult to extract, the video classification result not only depends on the identification result of a single picture but also depends on the dynamic change among picture sequences, but the limited storage capacity and computing capacity are difficult to capture the global context dynamic information at long distance; sensitive to illumination and visual angle change.

For the second approach described above, the bone sequence behavior recognition: the existing bone behavior identification method mainly comprises two types, one type is that the joint dynamics is captured by manually extracting features, and the features are skillfully designed and a large amount of manpower is needed; one is to use a deep learning method, but the existing documents mostly start from the perspective of dividing body parts, and extraction of deeper level feature association is limited to a certain extent.

For the third scheme, based on the behavior recognition of the RNN network: the RNN can well acquire time dimension information, but has limited ability to extract data features; the existing method is mainly based on multilayer RNN stacking (stacked RNN), and is difficult to train in practical application.

For the fourth scheme, based on the behavior recognition of the CNN network: the existing algorithm generally calculates each frame of data input by CNN without distinction, which wastes calculation capacity and brings noise interference; the CNN network needs to model the time dimension independently, so that the difficulty of model design is increased; the input requirement of CNN networks is the euclidean structure, which, if identified based on skeletal behavior, loses the natural connectivity properties between skeletal nodes.

Therefore, the existing human behavior recognition scheme has the disadvantages of large calculation amount, low recognition efficiency, complex network and a plurality of difficulties in practical application.

Disclosure of Invention

The invention aims to provide a behavior recognition method, a behavior recognition device and behavior recognition equipment, and solves the problems of large calculation amount, low recognition efficiency and complex network of a human behavior recognition scheme in the prior art.

In order to solve the above technical problem, an embodiment of the present invention provides a behavior identification method, including:

acquiring key frame sequence data to be identified from bone sequence data to be identified;

and identifying the behavior action corresponding to the key frame sequence data to be identified by utilizing a time-space graph convolutional network.

Optionally, the identifying, by using a space-time graph convolutional network, a behavior action corresponding to the sequence data of the key frames to be identified includes:

constructing the key frame sequence data to be identified into a bone sequence space-time diagram to be identified;

extracting a spatiotemporal feature to be identified from the bone sequence spatiotemporal image to be identified by utilizing a spatiotemporal image convolution network;

and identifying the behavior action corresponding to the space-time characteristic to be identified.

Optionally, the identifying a behavior action corresponding to the to-be-identified spatiotemporal feature includes:

and obtaining the behavior action corresponding to the space-time characteristic to be identified by utilizing the normalized exponential function.

Optionally, the acquiring the sequence data of the key frame to be identified from the bone sequence data to be identified includes:

and acquiring the sequence data of the key frame to be identified from the bone sequence data to be identified by utilizing a frame rectification network.

Optionally, before acquiring the sequence data of the key frame to be identified from the bone sequence data to be identified by using the frame rectification network, the method further includes:

and training the frame rectification network and the space-time graph convolution network.

Optionally, the training the frame rectification network and the space-time graph convolution network includes:

acquiring key frame sequence training data from the bone sequence data corresponding to the preset behavior and action by using the frame rectification network;

training the space-time graph convolutional network by using the key frame sequence training data, and identifying training behavior actions corresponding to the key frame sequence training data by using the trained space-time graph convolutional network;

adjusting the frame rectification network reversely according to the training behavior action;

acquiring key frame sequence training data from the bone sequence data corresponding to the preset behavior action again by using the adjusted frame rectification network;

and training the time-space diagram convolutional network again by using the adjusted key frame sequence training data acquired by the frame rectifying network.

Optionally, after the time-space graph convolutional network is trained again by using the adjusted training data of the sequence of key frames acquired by the frame rectification network, the method further includes:

identifying training behavior actions corresponding to the adjusted key frame sequence training data obtained by the frame rectification network by using the retrained space-time graph convolution network;

if the obtained continuous times of the training behavior action consistent with the preset behavior action are larger than a first threshold value, storing parameter information of the frame rectification network and the space-time diagram convolution network; otherwise, returning to the action according to the training behavior, and reversely adjusting the operation of the frame rectification network.

Optionally, the reversely adjusting the frame rectification network according to the training behavior action includes:

acquiring a return function value according to the training behavior action;

obtaining a loss function according to the return function value;

and carrying out gradient descent on the loss function, and updating the network weight of the frame rectification network.

Optionally, the loss function is specifically:

l＝α(r _t +γ×max Q(s _t+1 ，ω)-Q(s _t ，ω))；

where l denotes a loss function, α denotes a preset learning speed, r _t Represents the return function value when the frame rectification network is trained for the t time, gamma represents a preset attenuation value,

Q(s _t+1 ω) and Q(s) _t ω) each represent a function of motion value, s _t+1 Representing the bone sequence state, s, corresponding to the key frame sequence training data when the frame rectification network is trained for the t +1 th time _t And representing the bone sequence state corresponding to the key frame sequence training data when the frame rectification network is trained for the t time, wherein omega represents the network weight of the frame rectification network.

An embodiment of the present invention further provides a behavior recognition apparatus, including:

the first acquisition module is used for acquiring the key frame sequence data to be identified from the bone sequence data to be identified;

and the first identification module is used for identifying the behavior action corresponding to the key frame sequence data to be identified by utilizing a space-time graph convolutional network.

Optionally, the first identification module includes:

the first construction submodule is used for constructing the key frame sequence data to be identified into a bone sequence space-time diagram to be identified;

the first extraction submodule is used for extracting the spatiotemporal features to be identified from the bone sequence spatiotemporal image to be identified by utilizing a spatiotemporal image convolution network;

and the first identification submodule is used for identifying the behavior action corresponding to the space-time characteristic to be identified.

Optionally, the first identification submodule includes:

and the first processing unit is used for obtaining the behavior action corresponding to the space-time feature to be identified by utilizing the normalized exponential function.

Optionally, the first obtaining module includes:

and the first acquisition sub-module is used for acquiring the key frame sequence data to be identified from the bone sequence data to be identified by using the frame rectification network.

Optionally, the method further includes:

the first training module is used for training the frame rectification network and the space-time graph convolution network before acquiring the key frame sequence data to be identified from the bone sequence data to be identified by using the frame rectification network.

Optionally, the first training module includes:

the second acquisition submodule is used for acquiring key frame sequence training data from the bone sequence data corresponding to the preset behavior action by using the frame rectification network;

the first processing submodule is used for training the space-time graph convolutional network by utilizing the key frame sequence training data and identifying a training behavior action corresponding to the key frame sequence training data by utilizing the trained space-time graph convolutional network;

the first adjusting submodule is used for adjusting the frame rectification network reversely according to the training behavior action;

the third acquisition sub-module is used for acquiring the key frame sequence training data from the bone sequence data corresponding to the preset behavior action again by using the adjusted frame rectification network;

and the first training submodule is used for retraining the space-time diagram convolutional network again by using the adjusted key frame sequence training data acquired by the frame rectification network.

Optionally, the method further includes:

the second identification module is used for identifying the training behavior action corresponding to the adjusted key frame sequence training data obtained by the frame rectification network by using the retrained space-time graph convolution network after retraining the space-time graph convolution network by using the adjusted key frame sequence training data obtained by the frame rectification network;

the first processing module is used for storing the parameter information of the frame rectification network and the space-time diagram convolution network if the continuous times that the obtained training behavior action is consistent with the preset behavior action are larger than a first threshold value; otherwise, returning to the action according to the training behavior, and reversely adjusting the operation of the frame rectification network.

acquiring a return function value according to the training behavior action;

obtaining a loss function according to the return function value;

Optionally, the loss function is specifically:

l＝α(r _t +γ×max Q(s _t+1 ，ω)-Q(s _t ，ω))；

where l denotes a loss function, α denotes a preset learning speed, r _t Representing the value of a return function when the frame rectification network is trained for the t time, gamma represents a preset attenuation value,

Q(s _t+1 ω) and Q(s) _t ω) each represent a function of motion value, s _t+1 Represents bones corresponding to the key frame sequence training data when the frame rectification network is trained for the (t + 1) th timeSkeletal sequence State, s _t And representing the bone sequence state corresponding to the key frame sequence training data when the frame rectification network is trained for the t time, wherein omega represents the network weight of the frame rectification network.

The embodiment of the invention also provides behavior recognition equipment, which comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor; the processor implements the above-described behavior recognition method when executing the program.

Embodiments of the present invention further provide a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the steps in the behavior recognition method.

The technical scheme of the invention has the following beneficial effects:

in the scheme, the behavior recognition method obtains the key frame sequence data to be recognized from the bone sequence data to be recognized; identifying behavior actions corresponding to the key frame sequence data to be identified by utilizing a time-space graph convolutional network; redundant information interference can be reduced, and the workload of behavior identification is reduced; the time domain and space domain characteristics of the skeleton sequence are simultaneously excavated, the model design complexity is reduced, and the identification efficiency and accuracy are improved.

Drawings

FIG. 1 is a flow chart of a behavior recognition method according to an embodiment of the present invention;

FIG. 2 is a schematic diagram illustrating a modeling method of a time-space domain diagram of a bone sequence according to an embodiment of the present invention;

fig. 3 is a first diagram illustrating a specific implementation of the behavior recognition method according to the embodiment of the present invention;

fig. 4 is a schematic diagram illustrating a specific implementation of the behavior recognition method according to the embodiment of the present invention;

FIG. 5 is a schematic structural diagram of a behavior recognition device according to an embodiment of the present invention;

fig. 6 is a schematic structural diagram of a behavior recognition device according to an embodiment of the present invention.

Detailed Description

In order to make the technical problems, technical solutions and advantages of the present invention more apparent, the following detailed description is given with reference to the accompanying drawings and specific embodiments.

The invention provides a behavior recognition method aiming at the problems of large calculation amount, low recognition efficiency and complex network of a human behavior recognition scheme in the prior art, as shown in figure 1, the method comprises the following steps:

step 11: acquiring key frame sequence data to be identified from bone sequence data to be identified;

step 12: and identifying the behavior action corresponding to the key frame sequence data to be identified by utilizing a time-space graph convolutional network.

The behavior recognition method provided by the embodiment of the invention obtains the key frame sequence data to be recognized from the bone sequence data to be recognized; identifying behavior actions corresponding to the key frame sequence data to be identified by utilizing a time-space graph convolutional network; redundant information interference can be reduced, and the workload of behavior identification is reduced; the time domain and space domain characteristics of the skeleton sequence are simultaneously excavated, the model design complexity is reduced, and the identification efficiency and accuracy are improved.

The method for identifying the behavior action corresponding to the to-be-identified key frame sequence data by utilizing the space-time graph convolutional network comprises the following steps of: constructing the key frame sequence data to be identified into a bone sequence space-time diagram to be identified; extracting a spatiotemporal feature to be identified from the bone sequence spatiotemporal image to be identified by utilizing a spatiotemporal image convolution network; and identifying the behavior action corresponding to the space-time characteristic to be identified.

Specifically, the identifying a behavior corresponding to the to-be-identified spatiotemporal feature includes: and obtaining the behavior action corresponding to the space-time characteristic to be identified by utilizing the normalized exponential function.

In order to extract a frame with rich information content, more identification and representativeness; in an embodiment of the present invention, the acquiring the sequence data of the key frame to be identified from the bone sequence data to be identified includes: and acquiring the key frame sequence data to be identified from the bone sequence data to be identified by utilizing a frame rectification network.

Further, before obtaining the sequence data of the key frames to be identified from the bone sequence data to be identified by using the frame rectification network, the method further comprises the following steps: and training the frame rectification network and the time-space diagram convolution network.

Wherein training the frame rectification network and the space-time graph convolution network comprises: acquiring key frame sequence training data from the bone sequence data corresponding to the preset behavior and action by using the frame rectification network; training the space-time graph convolutional network by using the key frame sequence training data, and identifying training behavior actions corresponding to the key frame sequence training data by using the trained space-time graph convolutional network; adjusting the frame rectification network reversely according to the training behavior action;

acquiring key frame sequence training data from the bone sequence data corresponding to the preset behavior action again by using the adjusted frame rectification network; and training the time-space diagram convolutional network again by using the adjusted key frame sequence training data acquired by the frame rectifying network.

Further, after the time-space graph convolutional network is trained again by using the adjusted training data of the key frame sequence acquired by the frame rectification network, the method further includes: identifying training behavior actions corresponding to the adjusted key frame sequence training data obtained by the frame rectification network by using the retrained space-time graph convolution network; if the obtained continuous times of the training behavior action consistent with the preset behavior action are larger than a first threshold value, storing parameter information of the frame rectification network and the space-time diagram convolution network; otherwise, returning to the action according to the training behavior, and reversely adjusting the operation of the frame rectification network.

Wherein the adjusting the frame rectification network in a reverse direction according to the training behavior comprises: acquiring a return function value according to the training behavior action; obtaining a loss function according to the return function value; and carrying out gradient descent on the loss function, and updating the network weight of the frame rectification network.

Specifically, the loss function is specifically: l = a (r) _t +γ×max Q(s _t+1 ，ω)-Q(s _t ω)); where l denotes a loss function and α denotes a preset learning speed，r _t Represents the value of a return function when the frame rectification network is trained for the t time, gamma represents a preset attenuation value, Q(s) _t+1 ω) and Q (st, ω) both represent a motion value function, s _t+1 Representing the bone sequence state, s, corresponding to the key frame sequence training data when the frame rectification network is trained for the t +1 th time _t And representing the bone sequence state corresponding to the key frame sequence training data when the frame rectification network is trained for the t time, wherein omega represents the network weight of the frame rectification network.

The behavior recognition method provided by the embodiment of the invention is further described below.

In view of the above technical problems, embodiments of the present invention provide a behavior recognition method, which may specifically be a human skeleton sequence behavior recognition method based on a space-time graph convolutional network; the method aims at the skeleton sequence data, utilizes the space natural connection (space skeleton node connection) of nodes in a single frame and the time connection of the same node among multiple frames to carry out space-time graph network modeling, provides a frame rectification algorithm to extract frames which are rich in information content, high in discrimination and large in correlation with the whole video behavior, designs a graph convolution kernel and a space-time graph convolution network algorithm, and completes human behavior identification together.

Human behavior recognition is considered as a basic technology in various application fields such as human-computer interaction, intelligent monitoring, robots and the like. Taking the monitoring of the old who lives alone as an example, the intelligent behavior recognition system judges whether the old has a meal normally, sleeps on time, takes medicine according to medical advice and whether abnormal conditions such as falling, myocardial infarction, coma and the like occur or not by detecting the daily activities of the old, so that the old and the young are informed in time and the old are delivered with medical advice in time, and the old can have higher life quality in the life alone. And as for a fitness evaluation and medical rehabilitation system, motion improvement suggestions are given by identifying motions and comparing the motions with correct postures, so that the fitness efficiency and the rehabilitation effect are improved. However, the traditional human behavior recognition method based on video has the disadvantages of large calculation amount, complex network and high requirement on image quality, and has a lot of difficulties in practical application. The method comprises the steps of performing behavior identification based on bone data, selecting RNN, CNN or graph convolution network, wherein the RNN has limited data feature extraction capability, matrix form data structures required by CNN lose some good attributes of the bone data, and the graph convolution network makes full use of natural connection among bone nodes to effectively extract richer features. Therefore, the embodiment of the invention provides the skeleton sequence behavior recognition based on the space-time diagram convolution, and the frame rectification algorithm is added to improve the recognition efficiency and accuracy. The scheme provided by the embodiment of the invention mainly relates to the following three parts:

the first part, a skeleton sequence time-space domain graph modeling method;

the modeling method of the skeleton sequence time-space domain diagram in the embodiment of the invention can be shown in fig. 2, the time-space domain structure of the skeleton sequence is divided into an intra-frame structure and an inter-frame structure, the intra-frame structure mainly describes the space domain structure of the skeleton, and the inter-frame structure mainly describes the time domain structure of the skeleton. Specifically, the skeleton nodes form a node set V of the graph, edges connected between different nodes in the same frame and edges connected between different frames of the same node together form an edge set E, and the edge set and the node set form a time-space graph G = (V, E) of the skeleton. Node set V = { V = _ti T =1.. T, i =1.. N }, where T is the number of sequence frames and N is the number of skeletal nodes. Edge set E = { E = { (E) _S ∪E _T In which E _S ＝{v _ti v _tj L (i, j) belongs to S, and S is a body joint naturally connected in the frame; e _T ＝{v _ti v _pi And l (T, p) belongs to T ', and T' is the extracted key frame.

A second part, a key frame preprocessing network of the space-time graph convolution network;

the Frame Distillation Network (FDNet) in fig. 3 is a preprocessing Network for extracting a key Frame in the space-time convolutional Network according to an embodiment of the present invention.

The keyframe extraction Process is a Markov Decision Process (MDP) consisting of a triplet M = (S, a, R), where S = { S = { S = _i Is a state set (the data of which is frame sequential data); a = { a = _i The method comprises the following steps of (1) keeping a current frame unchanged, selecting a previous frame and selecting a next frame; r is a set of reward functions after (s, a) transitions to the next state. MDP initializationThe process is that the input skeleton sequence is uniformly sampled to obtain an initial state s ₁ Performing a random state transition action a ₁ Is converted to the state s ₂ And calculating a value r of the return function ₁ . The calculation of the value of the reward function requires the use of ST-GCN (space-time graph convolutional network), i.e. the state s _i Inputting the corresponding skeleton frame sequence into pre-trained ST-GCN to obtain a recognition result, comparing the recognition result with the behavior label (such as walking and sitting), and correctly returning a function value r _i Positive, otherwise negative.

The training procedure for the frame rectification network may be specifically as follows:

initializing FDNet, and randomly generating a network weight omega;

initializing a bone sequence state s ₁ ；

And (5) circularly traversing t:

selection action a _t ＝maxQ(s _t ，ω)；

Execution of a _t Generating a new state s _t+1 Calculating the reward r using ST-GCN _t ；

If (r) _t ＞0)&&(r _i ＞0for i＝(t-N)......t)；

Ending;

otherwise:

calculating the loss function l = a (r) _t +γ×max Q(s _t+1 ，ω)-Q(s _t ，ω))；

Gradient descending is carried out on the loss function, and omega is updated;

and returning to the step of circularly traversing t.

Wherein, t represents the training times of the sample (i.e. the training times of the skeleton sequence, which is equal to the training times of ST-GCN); r is _t Representing the value of the return function, r, during the t-th training _i Represents the value of the return function in the ith training process, (r) _t ＞0)&&(r _i > 0for i = (t-N) · t), which means that the reward function value is positive, and the number of times of continuously being positive reaches N times (that is, the behavior recognition result is consistent with the behavior tag corresponding to the segment of bone sequence, and the consistent number of times reaches N times continuously); n represents the system threshold (customizable), Q(s) _t ω) represents a motion value function; α in the loss function l represents a learning speed (learning rate); gamma represents an attenuation value; alpha is more than 0 and less than 1,0 and less than or equal to gamma and less than or equal to 1.

According to the embodiment of the invention, the key frame can be acquired by utilizing the FDNet network after determining omega.

The FDNet network is used as a preprocessing network of the ST-GCN, plays a role in rectifying the most abundant and representative frames of information content, reduces the calculated amount of the ST-GCN network, and effectively reduces noise interference brought by information redundant frames.

A third part, a time-space diagram convolution network based on the skeleton sequence;

fig. 4 is a diagram convolution network constructed based on a time-space domain diagram of a skeleton sequence, which is called a time-space Graph convolution network (ST-GCN) in the embodiment of the present invention.

Similar to the 2D CNN network, the essence of the skeleton graph convolution network is to use a convolution kernel sharing parameters to realize weighted summation between the central node and the neighboring nodes to achieve the purpose of extracting features, so the skeleton graph convolution network focuses on the design of the sampling function and convolution kernel function of the neighboring nodes.

The picture sampling function is defined as the function of other pixels within a certain range around the collection center pixel, and similarly, the sampling function of the skeleton map can be defined as the function of other nodes at a certain distance connected with the collection center node, namely v _ti Neighbor node set B (v) _ti )＝{v _qj |d(v _ti ，v _ti ) K is less than or equal to K, | q-t | < D }, wherein D (v) _tj ，v _ti ) Indicating the slave v within the same frame _ti To v _tj The length of the shortest path, K being a distance selection criterion (predefinable), | q-t | representing the frame time span before and after the center node, D representing a time selection criterion (predefinable), e.g. K =1, D =40, then the sampling function represents the selection of the distance center node v _ti V of only one unit length _tj And node v of 40 frames before and after the current frame _qi Performing weighting calculation, and expressing the sampling function of the bone map as p (v) by using a formula _ti )＝v _qj . Setting of neighbor node setThe space-time characteristics of the skeleton map are fully embodied.

The convolution kernel function mainly comprises the size of convolution kernel (equal to K multiplied by K) and the weight function w (v) of convolution kernel _ti ). The convolution kernel on the picture is generally a square with fixed size and weight to be optimized, and the convolution kernel of the skeleton picture is designed to be from v _ti Neighbor node set B (v) _ti ) Mapping to K tags/ _ti ：B(v _ti ) K-1, which means that neighboring nodes of the central node are divided into K labels (subsets) according to a preset rule (e.g., the distance between the central node and the neighboring nodes relative to the center of gravity of the skeleton), and weighted values corresponding to the labels are combined to form a weight function w (v) to be optimized _ti ). The weight function w (v) to be optimized can be treated by adopting a back propagation mode and the like _ti ) And (6) optimizing. The design of the convolution kernel solves the problem that the input of the graph convolution is non-euclidean structure data (non-matrix form).

The preset rule may be specifically any one of the following rules, but is not limited to the following rules:

unified label partition rules: the central node and the neighbor nodes belong to a subset (label);

according to the distance division rule: the central node is a subset, and other neighbor nodes are subsets;

partitioning rules according to spatial positions: based on the distance from the center node to the center of gravity of the whole skeleton node, the skeleton node is divided into three subsets which are larger than, equal to and smaller than the reference.

The following illustrates a skeleton sequence behavior identification method based on a space-time graph convolutional network according to an embodiment of the present invention, as shown in fig. 4, the flow includes:

(1) The bone sequence is uniformly sampled as the FDNet initialization input key frame sequence (i.e., initialization data).

(2) And determining the network weight omega of the FDNet, and obtaining the state S by using the action a through a Markov decision process.

The initial state s is determined according to the uniform sampling of the bone sequence during initialization ₁ (i.e., determining the initial key frame), randomly determining the netInitial value of the weight ω of the complex, according to a _t ＝maxQ(s _t ω) determining an initial action a ₁ ；

Followed by determining the current state from the last action (e.g., from the initial action a) ₁ Determining the State s ₂ ) Determining the action corresponding to the current time (such as a) according to the current state ₂ ＝maxQ(s ₂ ω)) corresponding to the formula a _t ＝maxQ(s _t ，ω)。

In the FDNet training process, ω in the FDNet training process is updated after each use (after an action is determined according to the updated ω, a state is obtained correspondingly, and a key frame sequence is obtained), which is specifically referred to the content in the second section above.

(3) And constructing a corresponding key frame in the state s into a bone space-time diagram according to the content (a bone sequence space-time diagram modeling mode) of the first part.

(4) Training ST-GCN (through operations of calculating loss and/or error, back propagation and the like) by using the skeleton space-time diagram in (3) as an input, wherein the training content comprises the weight function w (v) _ti ) And extracting space-time characteristics by using the algorithm in the third part (the skeleton sequence-based space-time diagram convolutional network), and then obtaining a behavior recognition result by using a SoftMax function (normalized exponential function).

(5) And (3) reversely adjusting the FDNet (including updating the network weight omega) according to the behavior recognition result in the step (4) by using the algorithm of the second part (the key frame preprocessing network of the space-time graph convolutional network), and optimizing the key frame sequence selection result.

(6) And (2) circularly executing (2) to (5), and cross-adjusting parameters of the two networks (FDNet and ST-GCN) until the behavior recognition result is not obviously changed any more (specifically, the recognition result is not changed any more), and the recognition result is consistent with the skeletal sequence tag (the skeletal sequence in (1)), so as to obtain the final FDNet and ST-GCN.

Specifically, the obtained behavior recognition result may be compared with the tag of the bone sequence in (1) before performing the reverse adjustment FDNet; if the comparison results are consistent and the consistent continuous times (the times of the continuous occurrence of the comparison results) reach a preset threshold value N, the reverse adjustment is not carried out; otherwise, the adjustment is continued.

The timing for executing the comparison operation may be after the FDNet is reversely adjusted for a preset number of times in the training; or the training is performed after the first execution (4) in the training; and is not limited herein.

(7) And (5) carrying out human behavior recognition by using the final FDNet and ST-GCN.

In the embodiment of the invention, two networks (FDNet and ST-GCN) are mutually optimized and mutually promoted, the preprocessing network FDNet provides key frame training data for the time-space diagram convolutional network ST-GCN, and the higher the representativeness of the extracted key frame is, the richer the information content is, the more accurate the trained ST-GCN parameters are; similarly, the higher the ST-GCN recognition result is, the more accurate the data for reversely adjusting FDNet is, and the network is further optimized to obtain a higher-quality key frame sequence.

In the embodiment of the present invention, the depth camera may be used to directly acquire the bone sequence, but not limited thereto.

Therefore, the bone sequence behavior identification method based on the space-time graph convolutional network in the embodiment of the invention comprises the following steps: aiming at the problems that the video behavior identification computation amount is huge, redundant frame information containing noise is contained, and the modeling difficulty is high when a skeleton sequence is used for deep learning behavior identification, the method provides graph modeling of a time-space domain aiming at the skeleton sequence, and provides a time-space graph convolution network containing a key frame preprocessing network, wherein the two networks (the key frame preprocessing network and the time-space graph convolution network) are mutually matched and optimized to identify human behaviors;

the key frame preprocessing network of the time-space graph convolution network comprises the following steps:

the preprocessing network FDNet for extracting the key frames adopts a Markov decision process, performs state conversion by executing different actions to obtain a return function, guides the execution of the next action according to the return function, and performs cyclic operation to determine the key frame sequence with most representativeness and most abundant information content.

The time-space diagram convolution network based on the skeleton sequence comprises the following steps:

and constructing a space-time-space graph convolution network based on a space-time domain graph model of the skeleton sequence. The sampling function of the graph convolution network is essentially the structure of a neighbor node set of a central node and is divided into an intra-frame subset and an inter-frame subset, wherein the intra-frame subset mainly comprises other naturally connected nodes which are within a specified range of the distance from the central node, and the inter-frame subset mainly comprises other nodes which are within a certain range before and after the frame where the central node is located and correspond to the same position. The convolution kernel function designs a weight function in a key mode, nodes in the neighbor nodes are divided into different subsets according to a certain rule, and each subset corresponds to different weight parameters. And finally, carrying out multilayer weighted summation on the time-space domain graph model of the bone sequence according to the sampling function and the convolution kernel function, and extracting time domain and space domain characteristics.

In summary, the embodiment of the invention provides a human body bone sequence behavior identification method based on a space-time diagram convolutional network. The method fully utilizes the nature and the characteristics of natural connection of skeleton nodes to establish a time-space domain graph model, so that the model has stronger generalization capability without artificially defining body parts; the original skeleton sequence is processed by adopting the preprocessing network for extracting the key frames, so that frames with rich information content, more identification degree and representativeness are extracted, the calculated amount of the graph convolution network is reduced, the interference of redundant information is reduced, and the model training efficiency is improved; a time-space graph convolution network based on a skeleton key frame sequence is adopted, and time domain and space domain characteristics of the skeleton sequence are excavated simultaneously, so that the model design complexity is reduced; by adopting an organization method that the pre-training network and the graph convolution network are mutually matched and optimized, the overall efficiency and accuracy of behavior recognition are improved.

The scheme adopts a deep learning method, and solves the problem of model training of large-scale data; the skeleton sequence is described by adopting a time-space domain graph model, the natural connection characteristic of the skeleton sequence is reserved, and the characteristics of richer expressive force can be extracted;

the skeleton sequence is directly obtained by using the depth camera for behavior recognition, so that the calculated amount of deep network skeleton extraction is reduced; the graph convolution network and the SoftMax function are used for identification and classification, the result is more accurate, and the model is more generalized;

the method does not need to manually extract various features, but uses a graph modeling method to describe the whole sequence, and uses a graph convolution method to extract time domain and space domain features, so that the method has fewer links needing manual participation and is more intelligent in model; and a preprocessing network for extracting key frames is added, so that the workload of behavior identification is further reduced.

An embodiment of the present invention further provides a behavior recognition apparatus, as shown in fig. 5, including:

a first obtaining module 51, configured to obtain the key frame sequence data to be identified from the bone sequence data to be identified;

and the first identification module 52 is configured to identify a behavior action corresponding to the to-be-identified key frame sequence data by using a space-time graph convolutional network.

The behavior recognition device provided by the embodiment of the invention acquires the sequence data of the key frames to be recognized from the bone sequence data to be recognized; identifying behavior actions corresponding to the key frame sequence data to be identified by utilizing a time-space graph convolution network; redundant information interference can be reduced, and the workload of behavior identification is reduced; the time domain and space domain characteristics of the skeleton sequence are simultaneously excavated, the model design complexity is reduced, and the identification efficiency and accuracy are improved.

Wherein the first identification module comprises: the first construction submodule is used for constructing the key frame sequence data to be identified into a bone sequence space-time diagram to be identified; the first extraction submodule is used for extracting the spatiotemporal features to be identified from the bone sequence spatiotemporal image to be identified by utilizing a spatiotemporal image convolution network; and the first identification submodule is used for identifying the behavior action corresponding to the space-time characteristic to be identified.

Specifically, the first identification submodule includes: and the first processing unit is used for obtaining the behavior action corresponding to the space-time characteristic to be identified by utilizing the normalized exponential function.

In order to extract a frame with rich information content, more identification and representativeness; in an embodiment of the present invention, the first obtaining module includes: and the first acquisition sub-module is used for acquiring the key frame sequence data to be identified from the bone sequence data to be identified by using the frame rectification network.

Further, the behavior recognition device further includes: the first training module is used for training the frame rectification network and the space-time graph convolution network before acquiring the key frame sequence data to be identified from the bone sequence data to be identified by using the frame rectification network.

Wherein the first training module comprises: the second acquisition submodule is used for acquiring key frame sequence training data from the bone sequence data corresponding to the preset behavior action by using the frame rectification network; the first processing submodule is used for training the space-time graph convolutional network by utilizing the key frame sequence training data and identifying a training behavior action corresponding to the key frame sequence training data by utilizing the trained space-time graph convolutional network; the first adjusting submodule is used for adjusting the frame rectification network reversely according to the training behavior action;

the third acquisition submodule is used for acquiring the key frame sequence training data from the bone sequence data corresponding to the preset behavior action again by using the adjusted frame rectification network; and the first training submodule is used for retraining the space-time diagram convolutional network again by using the adjusted key frame sequence training data acquired by the frame rectification network.

Further, the behavior recognition device further includes: the second identification module is used for identifying the training behavior action corresponding to the adjusted key frame sequence training data obtained by the frame rectification network by using the retrained space-time graph convolution network after retraining the space-time graph convolution network by using the adjusted key frame sequence training data obtained by the frame rectification network; the first processing module is used for storing the parameter information of the frame rectification network and the space-time diagram convolution network if the continuous times that the obtained training behavior action is consistent with the preset behavior action are larger than a first threshold value; otherwise, returning to the action according to the training behavior, and reversely adjusting the operation of the frame rectification network.

Specifically, the loss function is specifically: l = a (r) _t +γ×max Q(s _t+1 ，ω)-Q(s _t ω), and; where l denotes a loss function, α denotes a preset learning speed, r _t Represents the value of a return function when the frame rectification network is trained for the t time, gamma represents a preset attenuation value, Q(s) _t+1 ω) and Q(s) _t ω) both represent a function of the action value, s _t+1 Representing the bone sequence state, s, corresponding to the key frame sequence training data when the frame rectification network is trained for the t +1 th time _t And representing the bone sequence state corresponding to the key frame sequence training data when the frame rectification network is trained for the t time, wherein omega represents the network weight of the frame rectification network.

The implementation embodiments of the behavior recognition method are all suitable for the embodiment of the behavior recognition device, and the same technical effect can be achieved.

An embodiment of the present invention further provides a behavior recognition device, as shown in fig. 6, including a memory 61, a processor 62, and a computer program 63 stored in the memory 61 and capable of running on the processor; the processor 62, when executing the program, implements the behavior recognition method described above.

The implementation embodiments of the behavior recognition method are all applicable to the embodiment of the computer-readable storage medium, and the same technical effect can be achieved.

It should be noted that many of the functional components described in this specification are referred to as modules/sub-modules/units in order to more particularly emphasize their implementation independence.

In embodiments of the present invention, the modules/sub-modules/units may be implemented in software for execution by various types of processors. An identified module of executable code may, for instance, comprise one or more physical or logical blocks of computer instructions which may, for instance, be constructed as an object, procedure, or function. Nevertheless, the executables of an identified module need not be physically located together, but may comprise disparate instructions stored in different bits which, when joined logically together, comprise the module and achieve the stated purpose for the module.

Indeed, a module of executable code may be a single instruction, or many instructions, and may even be distributed over several different code segments, among different programs, and across several memory devices. Likewise, operational data may be identified within the modules and may be embodied in any suitable form and organized within any suitable type of data structure. The operational data may be collected as a single data set, or may be distributed over different locations including over different storage devices, and may exist, at least partially, merely as electronic signals on a system or network.

When a module can be implemented by software, considering the level of existing hardware technology, a module implemented by software may build a corresponding hardware circuit to implement a corresponding function, without considering cost, and the hardware circuit may include a conventional Very Large Scale Integration (VLSI) circuit or a gate array and an existing semiconductor such as a logic chip, a transistor, or other discrete components. A module may also be implemented in programmable hardware devices such as field programmable gate arrays, programmable array logic, programmable logic devices or the like.

While the foregoing is directed to the preferred embodiment of the present invention, it will be appreciated by those skilled in the art that various changes and modifications may be made therein without departing from the principles of the invention as set forth in the appended claims.

Claims

1. A method of behavior recognition, comprising:

acquiring key frame sequence data to be identified from the bone sequence data to be identified;

identifying behavior actions corresponding to the key frame sequence data to be identified by utilizing a time-space graph convolutional network;

the method for identifying the behavior action corresponding to the to-be-identified key frame sequence data by utilizing the space-time graph convolutional network comprises the following steps of:

identifying the behavior action corresponding to the space-time characteristic to be identified;

wherein the acquiring of the sequence data of the key frame to be identified from the bone sequence data to be identified comprises:

acquiring key frame sequence data to be identified from the bone sequence data to be identified by using a frame rectification network;

before acquiring the key frame sequence data to be identified from the bone sequence data to be identified by using the frame rectification network, the method further comprises the following steps:

training the frame rectification network and the space-time graph convolution network;

the training of the frame rectification network and the space-time graph convolution network comprises the following steps:

acquiring key frame sequence training data from the bone sequence data corresponding to the preset behavior action by using the frame rectification network;

adjusting the frame rectification network reversely according to the training behavior;

2. The behavior recognition method according to claim 1, wherein the recognizing the behavior action corresponding to the spatiotemporal feature to be recognized comprises:

3. The behavior recognition method according to claim 1, further comprising, after retraining the space-time graph convolutional network with the adjusted training data of the sequence of key frames obtained by the frame rectification network, the following steps:

4. The behavior recognition method according to claim 1 or 3, wherein the adjusting the frame rectification network in reverse according to the training behavior action comprises:

acquiring a return function value according to the training behavior action;

obtaining a loss function according to the return function value;

5. The behavior recognition method according to claim 4, wherein the loss function is specifically:

l＝α(r _t +γ×maxQ(s _t+1 ,ω)-Q(s _t ,ω))；

where l represents a loss function, α represents a preset learning speed, r _t Representing the value of a return function when the frame rectification network is trained for the t time, gamma represents a preset attenuation value,

6. A behavior recognition apparatus, comprising:

the first identification module is used for identifying the behavior action corresponding to the key frame sequence data to be identified by utilizing a time-space diagram convolution network;

wherein the first identification module comprises:

the first identification submodule is used for identifying the behavior action corresponding to the space-time characteristic to be identified;

wherein, the first obtaining module comprises:

the first acquisition submodule is used for acquiring the key frame sequence data to be identified from the bone sequence data to be identified by utilizing the frame rectification network;

the behavior recognition device further comprises:

the first training module is used for training the frame rectification network and the space-time graph convolution network before acquiring the key frame sequence data to be identified from the bone sequence data to be identified by using the frame rectification network;

the first training module comprising:

the second acquisition sub-module is used for acquiring key frame sequence training data from the bone sequence data corresponding to the preset behavior and action by using the frame rectification network;

the third acquisition submodule is used for acquiring the key frame sequence training data from the bone sequence data corresponding to the preset behavior action again by using the adjusted frame rectification network;

7. The behavior recognition device according to claim 6, wherein the first recognition submodule includes:

8. The behavior recognition device according to claim 6, further comprising:

9. The behavior recognition device according to claim 7 or 8, wherein the adjusting the frame rectification network in reverse according to the training behavior action comprises:

acquiring a return function value according to the training behavior action;

obtaining a loss function according to the return function value;

10. The behavior recognition device according to claim 9, wherein the loss function is specifically:

l＝α(r _t +γ×maxQ(s _t+1 ,ω)-Q(s _t ,ω))；

Q(s _t+1 ω) and Q(s) _t ω) each represent a function of motion value, s _t+1 Representing the bone sequence state, s, corresponding to the key frame sequence training data when the frame rectification network is trained for the t +1 th time _t And the skeleton sequence state corresponding to the key frame sequence training data when the frame rectification network is trained for the t time is represented, and omega represents the network weight of the frame rectification network.

11. A behavior recognition device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor; characterized in that the processor, when executing the program, implements the behavior recognition method according to any one of claims 1 to 5.

12. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of a method for behavior recognition according to any one of claims 1 to 5.