CN111401106A

CN111401106A - Behavior identification method, device and equipment

Info

Publication number: CN111401106A
Application number: CN201910000953.0A
Authority: CN
Inventors: 丁晓璐
Original assignee: China Mobile Communications Group Co Ltd; China Mobile Communications Ltd Research Institute
Current assignee: China Mobile Communications Group Co Ltd; China Mobile Communications Ltd Research Institute
Priority date: 2019-01-02
Filing date: 2019-01-02
Publication date: 2020-07-10
Anticipated expiration: 2039-01-02
Also published as: CN111401106B

Abstract

The invention provides a behavior identification method, a behavior identification device and behavior identification equipment, wherein the behavior identification method comprises the following steps: acquiring key frame sequence data to be identified from bone sequence data to be identified; and identifying the behavior action corresponding to the key frame sequence data to be identified by utilizing a time-space graph convolutional network. The scheme acquires the sequence data of the key frames to be identified from the sequence data of the bones to be identified; identifying behavior actions corresponding to the key frame sequence data to be identified by utilizing a time-space graph convolutional network; redundant information interference can be reduced, and the workload of behavior identification is reduced; the time domain and space domain characteristics of the skeleton sequence are simultaneously excavated, the model design complexity is reduced, and the identification efficiency and accuracy are improved.

Description

Behavior identification method, device and equipment

Technical Field

The invention relates to the technical field of human behavior recognition, in particular to a behavior recognition method, a behavior recognition device and behavior recognition equipment.

Background

The following behavior recognition schemes mainly exist in the prior art:

in the first scheme, video behavior recognition: and directly taking video image data as input, and performing behavior recognition by using a deep learning method. The method can be approximately regarded as image classification for each frame in multi-frame images, and provides a behavior identification result according to the classification result of all the images.

Second approach, bone sequence behavior recognition: the human skeleton node has higher robustness on illumination and visual angle change, the data volume is small, the computing resource consumption is small, along with the improvement of the equipment precision, the skeleton node can obtain more accurate coordinates through a depth camera or a motion capture system, and the recognition effect can be improved by taking a skeleton sequence as the input of a depth network.

The third scheme is based on behavior recognition of RNN (Recurrent Neural Networks). due to the good property of RNN, the RNN is used for behavior recognition without independently considering time dimension modeling, and the existing scheme is mostly based on L STM (L ong Short Term Memory Network, long and Short Term Memory Network), and the recognition accuracy is continuously improved by introducing a trust gate mechanism, increasing an attention model and other methods.

A fourth scheme, CNN (Convolutional Neural Networks) based behavior recognition: the input of the CNN is generally Euclidean structure data which is regularly arranged in a matrix form, and the existing documents mostly adopt the method of expanding 2D CNN of image classification into 3D CNN of video identification, and adding video segmentation, multitask parallel computation and the like to achieve good effects.

However, the above four behavior recognition schemes have the following disadvantages, respectively:

for the first scheme, video behavior recognition: the calculation amount is huge, about 5M of an image classification (101-type) network is obtained, and the number of network parameters reaches about 33M after the image classification (101-type) network is expanded to video classification; the global long-distance context information is difficult to extract, the video classification result not only depends on the identification result of a single picture but also depends on the dynamic change among picture sequences, but the long-distance global context dynamic information is difficult to capture due to limited storage capacity and computing capacity; sensitive to illumination and visual angle change.

For the second approach described above, the bone sequence behavior recognition: the existing bone behavior identification method mainly comprises two types, one type is that the joint dynamics is captured by manually extracting features, and the features are skillfully designed and a large amount of manpower is needed; one is to use a deep learning method, but the existing documents mostly start from the perspective of dividing body parts, and extraction of deeper level feature association is limited to a certain extent.

For the third scheme, based on the behavior recognition of the RNN network: the RNN can well acquire time dimension information, but has limited ability to extract data features; the existing method is mainly based on multilayer RNN stacking (stacked RNN), and is difficult to train in practical application.

For the fourth scheme, based on the behavior recognition of the CNN network: the existing algorithm generally calculates each frame of data input by CNN without distinction, which wastes calculation capacity and brings noise interference; the CNN network needs to model the time dimension independently, so that the difficulty of model design is increased; the input requirement of CNN networks is the euclidean structure, which, if identified based on skeletal behavior, loses the natural connectivity properties between skeletal nodes.

Therefore, the existing human behavior recognition scheme has the disadvantages of large calculation amount, low recognition efficiency, complex network and a plurality of difficulties in practical application.

Disclosure of Invention

The invention aims to provide a behavior recognition method, a behavior recognition device and behavior recognition equipment, and solves the problems of large calculation amount, low recognition efficiency and complex network of a human behavior recognition scheme in the prior art.

In order to solve the above technical problem, an embodiment of the present invention provides a behavior identification method, including:

acquiring key frame sequence data to be identified from bone sequence data to be identified;

and identifying the behavior action corresponding to the key frame sequence data to be identified by utilizing a time-space graph convolutional network.

Optionally, the identifying, by using a space-time graph convolutional network, a behavior action corresponding to the sequence data of the key frames to be identified includes:

constructing the key frame sequence data to be identified into a bone sequence space-time diagram to be identified;

extracting a spatiotemporal feature to be identified from the bone sequence spatiotemporal image to be identified by utilizing a spatiotemporal image convolution network;

and identifying the behavior action corresponding to the space-time characteristic to be identified.

Optionally, the identifying the behavior action corresponding to the to-be-identified spatio-temporal feature includes:

and obtaining the behavior action corresponding to the space-time characteristic to be identified by utilizing the normalized exponential function.

Optionally, the acquiring the sequence data of the key frame to be identified from the bone sequence data to be identified includes:

and acquiring the key frame sequence data to be identified from the bone sequence data to be identified by utilizing a frame rectification network.

Optionally, before acquiring the sequence data of the key frame to be identified from the bone sequence data to be identified by using the frame rectification network, the method further includes:

and training the frame rectification network and the space-time graph convolution network.

Optionally, the training the frame rectification network and the space-time graph convolution network includes:

acquiring key frame sequence training data from the bone sequence data corresponding to the preset behavior and action by using the frame rectification network;

training the space-time graph convolutional network by using the key frame sequence training data, and identifying training behavior actions corresponding to the key frame sequence training data by using the trained space-time graph convolutional network;

adjusting the frame rectification network reversely according to the training behavior action;

acquiring key frame sequence training data from the bone sequence data corresponding to the preset behavior action again by using the adjusted frame rectification network;

and training the time-space diagram convolutional network again by using the adjusted key frame sequence training data acquired by the frame rectifying network.

Optionally, after the time-space graph convolutional network is trained again by using the adjusted training data of the sequence of key frames acquired by the frame rectification network, the method further includes:

identifying training behavior actions corresponding to the adjusted key frame sequence training data obtained by the frame rectification network by using the retrained space-time graph convolution network;

if the obtained continuous times of the training behavior action consistent with the preset behavior action are larger than a first threshold value, storing parameter information of the frame rectification network and the space-time diagram convolution network; otherwise, returning to the action according to the training behavior, and reversely adjusting the operation of the frame rectification network.

Optionally, the reversely adjusting the frame rectification network according to the training behavior action includes:

acquiring a return function value according to the training behavior action;

obtaining a loss function according to the return function value;

and carrying out gradient descent on the loss function, and updating the network weight of the frame rectification network.

Optionally, the loss function is specifically:

l＝α(r_t+γ×max Q(s_t+1，ω)-Q(s_t，ω))；

where l denotes a loss function, α denotes a preset learning speed, r_tRepresenting the value of a return function when the frame rectification network is trained for the t time, gamma represents a preset attenuation value,

Q(s_t+1ω) and Q(s)_tω) each represent a function of motion value, s_t+1Representing the bone sequence state, s, corresponding to the key frame sequence training data when the frame rectification network is trained for the t +1 th time_tAnd representing the bone sequence state corresponding to the key frame sequence training data when the frame rectification network is trained for the t time, wherein omega represents the network weight of the frame rectification network.

An embodiment of the present invention further provides a behavior recognition apparatus, including:

the first acquisition module is used for acquiring the key frame sequence data to be identified from the bone sequence data to be identified;

and the first identification module is used for identifying the behavior action corresponding to the key frame sequence data to be identified by utilizing a space-time graph convolutional network.

Optionally, the first identification module includes:

the first construction submodule is used for constructing the key frame sequence data to be identified into a bone sequence space-time diagram to be identified;

the first extraction submodule is used for extracting the spatiotemporal features to be identified from the bone sequence spatiotemporal image to be identified by utilizing a spatiotemporal image convolution network;

and the first identification submodule is used for identifying the behavior action corresponding to the space-time characteristic to be identified.

Optionally, the first identification submodule includes:

and the first processing unit is used for obtaining the behavior action corresponding to the space-time feature to be identified by utilizing the normalized exponential function.

Optionally, the first obtaining module includes:

and the first acquisition sub-module is used for acquiring the key frame sequence data to be identified from the bone sequence data to be identified by using the frame rectification network.

Optionally, the method further includes:

the first training module is used for training the frame rectification network and the space-time graph convolution network before acquiring the key frame sequence data to be identified from the bone sequence data to be identified by using the frame rectification network.

Optionally, the first training module includes:

the second acquisition submodule is used for acquiring key frame sequence training data from the bone sequence data corresponding to the preset behavior action by using the frame rectification network;

the first processing submodule is used for training the space-time graph convolutional network by utilizing the key frame sequence training data and identifying a training behavior action corresponding to the key frame sequence training data by utilizing the trained space-time graph convolutional network;

the first adjusting submodule is used for adjusting the frame rectification network reversely according to the training behavior action;

the third acquisition submodule is used for acquiring the key frame sequence training data from the bone sequence data corresponding to the preset behavior action again by using the adjusted frame rectification network;

and the first training submodule is used for retraining the space-time diagram convolutional network again by using the adjusted key frame sequence training data acquired by the frame rectification network.

Optionally, the method further includes:

the second identification module is used for identifying the training behavior action corresponding to the adjusted key frame sequence training data obtained by the frame rectification network by using the retrained space-time graph convolution network after retraining the space-time graph convolution network by using the adjusted key frame sequence training data obtained by the frame rectification network;

the first processing module is used for storing the parameter information of the frame rectification network and the space-time diagram convolution network if the continuous times that the obtained training behavior action is consistent with the preset behavior action are larger than a first threshold value; otherwise, returning to the action according to the training behavior, and reversely adjusting the operation of the frame rectification network.

acquiring a return function value according to the training behavior action;

obtaining a loss function according to the return function value;

Optionally, the loss function is specifically:

l＝α(r_t+γ×max Q(s_t+1，ω)-Q(s_t，ω))；

whereinL denotes a loss function, α denotes a preset learning speed, r_tRepresenting the value of a return function when the frame rectification network is trained for the t time, gamma represents a preset attenuation value,

The embodiment of the invention also provides behavior recognition equipment, which comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor; the processor implements the above-described behavior recognition method when executing the program.

Embodiments of the present invention further provide a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the steps in the behavior recognition method.

The technical scheme of the invention has the following beneficial effects:

in the scheme, the behavior recognition method obtains the key frame sequence data to be recognized from the bone sequence data to be recognized; identifying behavior actions corresponding to the key frame sequence data to be identified by utilizing a time-space graph convolutional network; redundant information interference can be reduced, and the workload of behavior identification is reduced; the time domain and space domain characteristics of the skeleton sequence are simultaneously excavated, the model design complexity is reduced, and the identification efficiency and accuracy are improved.

Drawings

FIG. 1 is a flow chart of a behavior recognition method according to an embodiment of the present invention;

FIG. 2 is a schematic diagram illustrating a modeling method of a time-space domain diagram of a bone sequence according to an embodiment of the present invention;

fig. 3 is a first diagram illustrating a specific implementation of the behavior recognition method according to the embodiment of the present invention;

fig. 4 is a schematic diagram illustrating a specific implementation of the behavior recognition method according to the embodiment of the present invention;

FIG. 5 is a schematic structural diagram of a behavior recognition device according to an embodiment of the present invention;

fig. 6 is a schematic structural diagram of a behavior recognition device according to an embodiment of the present invention.

Detailed Description

In order to make the technical problems, technical solutions and advantages of the present invention more apparent, the following detailed description is given with reference to the accompanying drawings and specific embodiments.

The invention provides a behavior recognition method aiming at the problems of large calculation amount, low recognition efficiency and complex network of a human behavior recognition scheme in the prior art, as shown in figure 1, the method comprises the following steps:

step 11: acquiring key frame sequence data to be identified from bone sequence data to be identified;

step 12: and identifying the behavior action corresponding to the key frame sequence data to be identified by utilizing a time-space graph convolutional network.

The behavior recognition method provided by the embodiment of the invention obtains the key frame sequence data to be recognized from the bone sequence data to be recognized; identifying behavior actions corresponding to the key frame sequence data to be identified by utilizing a time-space graph convolutional network; redundant information interference can be reduced, and the workload of behavior identification is reduced; the time domain and space domain characteristics of the skeleton sequence are simultaneously excavated, the model design complexity is reduced, and the identification efficiency and accuracy are improved.

The method for identifying the behavior action corresponding to the to-be-identified key frame sequence data by utilizing the space-time graph convolutional network comprises the following steps of: constructing the key frame sequence data to be identified into a bone sequence space-time diagram to be identified; extracting a spatiotemporal feature to be identified from the bone sequence spatiotemporal image to be identified by utilizing a spatiotemporal image convolution network; and identifying the behavior action corresponding to the space-time characteristic to be identified.

Specifically, the identifying the behavior action corresponding to the to-be-identified spatio-temporal feature includes: and obtaining the behavior action corresponding to the space-time characteristic to be identified by utilizing the normalized exponential function.

In order to extract a frame with rich information content, more identification and representativeness; in an embodiment of the present invention, the acquiring the sequence data of the key frame to be identified from the bone sequence data to be identified includes: and acquiring the key frame sequence data to be identified from the bone sequence data to be identified by utilizing a frame rectification network.

Further, before obtaining the key frame sequence data to be identified from the bone sequence data to be identified by using the frame rectification network, the method further comprises the following steps: and training the frame rectification network and the space-time graph convolution network.

Wherein training the frame rectification network and the space-time graph convolution network comprises: acquiring key frame sequence training data from the bone sequence data corresponding to the preset behavior and action by using the frame rectification network; training the space-time graph convolutional network by using the key frame sequence training data, and identifying training behavior actions corresponding to the key frame sequence training data by using the trained space-time graph convolutional network; adjusting the frame rectification network reversely according to the training behavior action;

acquiring key frame sequence training data from the bone sequence data corresponding to the preset behavior action again by using the adjusted frame rectification network; and training the time-space diagram convolutional network again by using the adjusted key frame sequence training data acquired by the frame rectifying network.

Further, after the time-space graph convolutional network is trained again by using the adjusted training data of the key frame sequence acquired by the frame rectification network, the method further includes: identifying training behavior actions corresponding to the adjusted key frame sequence training data obtained by the frame rectification network by using the retrained space-time graph convolution network; if the obtained continuous times of the training behavior action consistent with the preset behavior action are larger than a first threshold value, storing parameter information of the frame rectification network and the space-time diagram convolution network; otherwise, returning to the action according to the training behavior, and reversely adjusting the operation of the frame rectification network.

Wherein the adjusting the frame rectification network in reverse direction according to the training behavior action comprises: acquiring a return function value according to the training behavior action; obtaining a loss function according to the return function value; and carrying out gradient descent on the loss function, and updating the network weight of the frame rectification network.

In particular, the loss function is specifically represented by l- α (r)_t+γ×max Q(s_t+1，ω)-Q(s_tω), where l represents a loss function, α represents a predetermined learning rate, r_tRepresents the value of a return function when the frame rectification network is trained for the t time, gamma represents a preset attenuation value, Q(s)_t+1ω) and Q (st, ω) both represent a motion value function, s_t+1Representing the bone sequence state, s, corresponding to the key frame sequence training data when the frame rectification network is trained for the t +1 th time_tAnd representing the bone sequence state corresponding to the key frame sequence training data when the frame rectification network is trained for the t time, wherein omega represents the network weight of the frame rectification network.

The behavior recognition method provided by the embodiment of the invention is further described below.

In view of the above technical problems, embodiments of the present invention provide a behavior recognition method, which may specifically be a human skeleton sequence behavior recognition method based on a space-time graph convolutional network; the method aims at the skeleton sequence data, utilizes the space natural connection (space skeleton node connection) of nodes in a single frame and the time connection of the same node among multiple frames to carry out space-time graph network modeling, provides a frame rectification algorithm to extract frames which are rich in information content, high in discrimination and large in correlation with the whole video behavior, designs a graph convolution kernel and a space-time graph convolution network algorithm, and completes human behavior identification together.

Human behavior recognition is considered as a basic technology in various application fields such as human-computer interaction, intelligent monitoring, robots and the like. Taking the monitoring of the old who lives alone as an example, the intelligent behavior recognition system judges whether the old has a meal normally, sleeps on time, takes medicine according to medical advice and whether abnormal conditions such as falling, myocardial infarction, coma and the like occur or not by detecting the daily activities of the old, so that the old and the young are informed in time and the old are delivered with medical advice in time, and the old can have higher life quality in the life alone. And as for a fitness evaluation and medical rehabilitation system, motion improvement suggestions are given by identifying motions and comparing the motions with correct postures, so that the fitness efficiency and the rehabilitation effect are improved. However, the traditional human behavior recognition method based on video has large calculation amount, complex network and high requirement on image quality, and has a plurality of difficulties in practical application. The method comprises the steps of performing behavior identification based on bone data, selecting an RNN, a CNN or a graph convolution network, wherein the RNN has limited data feature extraction capability, a matrix form data structure required by the CNN loses some good attributes of the bone data, and the graph convolution network makes full use of natural connection among bone nodes to effectively extract richer features. Therefore, the embodiment of the invention provides the skeleton sequence behavior recognition based on the space-time diagram convolution, and the frame rectification algorithm is added to improve the recognition efficiency and accuracy. The scheme provided by the embodiment of the invention mainly relates to the following three parts:

the first part, a skeleton sequence time-space domain graph modeling method;

the modeling method of the skeleton sequence time-space domain diagram in the embodiment of the invention can be shown in fig. 2, wherein the time-space domain structure of the skeleton sequence is divided into an intra-frame structure and an inter-frame structure, the intra-frame structure mainly describes the space domain structure of the skeleton, and the inter-frame structure mainly describes the time domain structure of the skeleton. Specifically, the skeleton nodes form a node set V of the graph, edges connected between different nodes in the same frame and edges connected between different frames by the same node together form an edge set E, and the edge set and the node set form a time-space graph G of the skeleton. Node set V ═ V_tiT, i 1.. N }, where T is the number of frame sequences and N is the number of bone nodes. Edge set E ═ E_S∪E_TIn which E_S＝{v_tiv_tjI (i, j) ∈ S, S being the body joint naturally connected within the frame, E_T＝{v_tiv_piAnd | T, p ∈ T ', T' is the extracted key frame.

A second part, a key frame preprocessing network of the space-time graph convolution network;

the Frame Distillation Network (FDNet) in fig. 3 is a preprocessing Network for extracting a key Frame in the space-time convolutional Network according to an embodiment of the present invention.

The key frame extraction Process is a Markov Decision Process (MDP) consisting of a triplet M ═ S, a, R, where S ═ S { (S)_iIs a state set (the data of which is frame sequential data); a ═ a_iThe method comprises the following steps of (1) keeping a current frame unchanged, selecting a previous frame and selecting a next frame; r is a set of reward functions after (s, a) transitions to the next state. The MDP initialization process is that the input skeleton sequence is uniformly sampled to obtain an initial state s₁Performing a random state transition action a₁Is switched to state s₂And calculating a value r of the return function₁. The calculation of the value of the reward function requires the use of ST-GCN (space-time graph convolutional network), i.e. the state s_iInputting the corresponding skeleton frame sequence into pre-trained ST-GCN to obtain a recognition result, comparing the recognition result with the behavior label (such as walking and sitting), and correctly returning a function value r_iPositive, otherwise negative.

The training procedure for the frame rectification network may be specifically as follows:

initializing FDNet, and randomly generating a network weight omega;

initializing a bone sequence state s₁；

And (5) circularly traversing t:

selection action a_t＝maxQ(s_t，ω)；

Execution of a_tGenerating a new state s_t+1Calculating the reward r using ST-GCN_t；

If (r)_t＞0)&&(r_i＞0for i＝(t-N)......t)；

Finishing;

otherwise:

calculate loss function l α (r)_t+γ×max Q(s_t+1，ω)-Q(s_t，ω))；

Gradient descending is carried out on the loss function, and omega is updated;

and returning to the step of circularly traversing t.

Wherein t represents the number of sample training times (i.e., the number of training times of the bone sequence, which is equal to the number of training times of ST-GCN); r is_tIndicating during the t-th trainingValue of the return function r_iRepresents the value of the return function in the ith training process, (r)_t＞0)&&(r_iThe value of the reward function is positive, and the number of times of continuous positive reaches N times (namely, the behavior recognition result is consistent with the behavior tag corresponding to the section of skeleton sequence, and the consistent number of times reaches N times continuously); n represents the system threshold (customizable), Q(s)_tω) represents an operation value function, α in the loss function l represents a learning rate (learning rate), γ represents a attenuation value, 0 < α < 1, and 0. ltoreq. γ.ltoreq.1.

According to the embodiment of the invention, the key frame can be acquired by utilizing the FDNet network after determining omega.

The FDNet network is used as a preprocessing network of the ST-GCN, plays a role in rectifying the most abundant and representative frames of information content, reduces the calculated amount of the ST-GCN network, and effectively reduces noise interference caused by information redundant frames.

A third part, a time-space diagram convolution network based on the skeleton sequence;

fig. 4 is a Graph convolution network constructed based on a time-space domain Graph of a skeleton sequence, which is referred to as a time-space Graph convolution network (ST-GCN) in the embodiment of the present invention.

Similar to the 2D CNN network, the essence of the skeleton graph convolution network is to use a convolution kernel sharing parameters to realize weighted summation between the central node and the neighboring nodes to achieve the purpose of extracting features, so the skeleton graph convolution network focuses on the design of the sampling function and convolution kernel function of the neighboring nodes.

The picture sampling function is defined as the function of other pixels within a certain range around the collection center pixel, and similarly, the sampling function of the skeleton map can be defined as the function of other nodes at a certain distance connected with the collection center node, namely v_tiNeighbor node set B (v)_ti)＝{v_qj|d(v_ti，v_ti) K is less than or equal to K, | q-t | < D }, wherein D (v)_tj，v_ti) Indicating the slave v within the same frame_tiTo v_tjThe length of the shortest path, K is a distance selection criterion (which can be predefined), | q-t | represents the centerThe time span of the frames before and after the node, D represents a time selection criterion (which may be predefined), such as K1 and D40, the sampling function represents the selection of the distance center node v_tiV of only one unit length_tjAnd node v of 40 frames before and after the current frame_qiPerforming weighting calculation, and formulating the sampling function of the bone map into p (v)_ti)＝v_qj. The design of the neighbor node set fully embodies the spatio-temporal characteristics of the skeleton map.

The convolution kernel function mainly comprises the size of convolution kernel (equal to K × K) and the weight function w (v) of convolution kernel_ti). The convolution kernel on the picture is generally a square with fixed size and weight to be optimized, and the convolution kernel of the skeleton picture is designed to be from v_tiNeighbor node set B (v)_ti) Mapping to K tags/_ti：B(v_ti) K-1, which means that neighboring nodes of the central node are divided into K labels (subsets) according to a preset rule (e.g., the distance between the central node and the neighboring nodes is far and near relative to the center of gravity of the skeleton), and weighted values corresponding to the labels are combined to form a weight function w (v) to be optimized_ti). The optimized weight function w (v) can be treated in a mode of back propagation and the like_ti) And (6) optimizing. The design of the convolution kernel solves the problem that the input of the graph convolution is non-euclidean structure data (non-matrix form).

The preset rule may be specifically any one of the following rules, but is not limited to the following rules:

unified label partition rules: the central node and the neighbor nodes belong to a subset (label);

according to the distance division rule: the central node is a subset, and other neighbor nodes are subsets;

dividing rules according to spatial positions: based on the distance from the center node to the center of gravity of the whole skeleton node, the center node is divided into three subsets which are larger than, equal to and smaller than the reference.

The following illustrates a skeleton sequence behavior identification method based on a space-time graph convolutional network according to an embodiment of the present invention, as shown in fig. 4, the flow includes:

(1) the bone sequence is uniformly sampled as the FDNet initialization input key frame sequence (i.e., initialization data).

(2) And determining the network weight omega of the FDNet, and obtaining the state S by using the action a through a Markov decision process.

The initial state s is determined according to the uniform sampling of the bone sequence during initialization₁(i.e., determining the initial key frame), randomly determining the initial value of the network weight ω according to a_t＝maxQ(s_tω) determining an initial action a₁；

Followed by determining the current state from the last action (e.g., from the initial action a)₁Determining the State s₂) Determining the action corresponding to the current time (such as a) according to the current state₂＝maxQ(s₂ω)) corresponding to the formula a_t＝maxQ(s_t，ω)。

In the FDNet training process, ω in the FDNet training process is updated after each use (after an action is determined according to the updated ω, a state is obtained correspondingly, and a key frame sequence is obtained), which is specifically referred to the content in the second section above.

(3) And constructing the corresponding key frame in the state s into a bone space-time diagram according to the content (the bone sequence space-time diagram modeling mode) of the first part.

(4) Training ST-GCN (through operations of calculating loss and/or error, back propagation and the like) by using the skeleton space-time diagram in (3) as an input, wherein the training content comprises the weight function w (v)_ti) And extracting space-time characteristics by using the algorithm in the third part (the skeleton sequence-based space-time diagram convolutional network), and then obtaining a behavior recognition result by using a SoftMax function (normalized exponential function).

(5) And (3) reversely adjusting the FDNet (including updating the network weight omega) according to the behavior recognition result in the step (4) by using the algorithm of the second part (the key frame preprocessing network of the space-time graph convolutional network), and optimizing the key frame sequence selection result.

(6) And (2) circularly executing (2) to (5), and cross-adjusting parameters of the two networks (FDNet and ST-GCN) until the behavior recognition result is not obviously changed any more (specifically, the recognition result is not changed any more), and the recognition result is consistent with the skeletal sequence tag (the skeletal sequence in (1)), so as to obtain the final FDNet and ST-GCN.

Specifically, the obtained behavior recognition result may be compared with the tag of the bone sequence in (1) before performing the reverse adjustment FDNet; if the comparison results are consistent and the consistent continuous times (the times of continuous occurrence of the comparison results) reach a preset threshold value N, reverse adjustment is not performed; otherwise, continuing to adjust.

The timing for executing the comparison operation may be after the FDNet is reversely adjusted for a preset number of times in the training; or the training can be performed after the first execution (4) in the training; and is not limited herein.

(7) And (5) carrying out human behavior recognition by using the final FDNet and ST-GCN.

In the embodiment of the invention, two networks (FDNet and ST-GCN) are mutually optimized and mutually promoted, the preprocessing network FDNet provides key frame training data for the time-space diagram convolutional network ST-GCN, and the higher the representativeness of the extracted key frame is, the richer the information content is, the more accurate the trained ST-GCN parameters are; similarly, the higher the ST-GCN recognition result is, the more accurate the data for reversely adjusting FDNet is, and the network is further optimized to obtain a higher-quality key frame sequence.

In the embodiment of the present invention, the depth camera may be used to directly acquire the bone sequence, but not limited thereto.

Therefore, the bone sequence behavior identification method based on the space-time graph convolutional network in the embodiment of the invention comprises the following steps: aiming at the problems that the video behavior identification computation amount is huge, redundant frame information containing noise is contained, and the modeling difficulty is high when a skeleton sequence is used for deep learning behavior identification, the method provides graph modeling of a time-space domain aiming at the skeleton sequence, and provides a time-space graph convolution network containing a key frame preprocessing network, wherein the two networks (the key frame preprocessing network and the time-space graph convolution network) are mutually matched and optimized to identify human behaviors;

the key frame preprocessing network of the time-space graph convolution network comprises the following steps:

the preprocessing network FDNet for extracting the key frames adopts a Markov decision process, performs state conversion by executing different actions to obtain a return function, guides the execution of the next action according to the return function, and performs cyclic operation to determine the key frame sequence with most representativeness and most abundant information content.

The time-space diagram convolution network based on the skeleton sequence comprises the following steps:

and constructing a space-time-space graph convolution network based on a space-time domain graph model of the skeleton sequence. The sampling function of the graph convolution network is essentially the structure of a neighbor node set of a central node and is divided into an intra-frame subset and an inter-frame subset, wherein the intra-frame subset mainly comprises other naturally connected nodes which are within a specified range of the distance from the central node, and the inter-frame subset mainly comprises other nodes which are within a certain range before and after the frame where the central node is located and correspond to the same position. The convolution kernel function designs a weight function in a key mode, nodes in the neighbor nodes are divided into different subsets according to a certain rule, and each subset corresponds to different weight parameters. And finally, carrying out multilayer weighted summation on the time-space domain graph model of the bone sequence according to the sampling function and the convolution kernel function, and extracting time domain and space domain characteristics.

In summary, the embodiment of the invention provides a human body bone sequence behavior identification method based on a space-time diagram convolutional network. The method fully utilizes the nature and the characteristics of natural connection of skeleton nodes to establish a time-space domain graph model, so that the model has stronger generalization capability and does not need to artificially define body parts; the original skeleton sequence is processed by adopting the preprocessing network for extracting the key frames, so that frames with rich information content, more identification degree and representativeness are extracted, the calculated amount of the graph convolution network is reduced, the interference of redundant information is reduced, and the model training efficiency is improved; a time-space graph convolution network based on a skeleton key frame sequence is adopted, and time domain and space domain characteristics of the skeleton sequence are excavated simultaneously, so that the model design complexity is reduced; by adopting an organization method that the pre-training network and the graph convolution network are mutually matched and optimized, the overall efficiency and accuracy of behavior recognition are improved.

The scheme adopts a deep learning method, and solves the problem of model training of large-scale data; the skeleton sequence is described by adopting a time-space domain graph model, the natural connection characteristic of the skeleton sequence is reserved, and the characteristics of richer expressive force can be extracted;

the skeleton sequence is directly obtained by using the depth camera for behavior recognition, so that the calculated amount of deep network skeleton extraction is reduced; the graph convolution network and the SoftMax function are used for recognition and classification, so that the result is more accurate, and the model is more generalized;

the method does not need to manually extract various features, but uses a graph modeling method to describe the whole sequence, and uses a graph convolution method to extract time domain and space domain features, so that the method has fewer links needing manual participation and is more intelligent in model; and a preprocessing network for extracting key frames is added, so that the workload of behavior identification is further reduced.

An embodiment of the present invention further provides a behavior recognition apparatus, as shown in fig. 5, including:

a first obtaining module 51, configured to obtain the key frame sequence data to be identified from the bone sequence data to be identified;

and the first identification module 52 is configured to identify a behavior action corresponding to the to-be-identified key frame sequence data by using a space-time graph convolutional network.

The behavior recognition device provided by the embodiment of the invention obtains the key frame sequence data to be recognized from the bone sequence data to be recognized; identifying behavior actions corresponding to the key frame sequence data to be identified by utilizing a time-space graph convolutional network; redundant information interference can be reduced, and the workload of behavior identification is reduced; the time domain and space domain characteristics of the skeleton sequence are simultaneously excavated, the model design complexity is reduced, and the identification efficiency and accuracy are improved.

Wherein the first identification module comprises: the first construction submodule is used for constructing the key frame sequence data to be identified into a bone sequence space-time diagram to be identified; the first extraction submodule is used for extracting the spatiotemporal features to be identified from the bone sequence spatiotemporal image to be identified by utilizing a spatiotemporal image convolution network; and the first identification submodule is used for identifying the behavior action corresponding to the space-time characteristic to be identified.

Specifically, the first identification submodule includes: and the first processing unit is used for obtaining the behavior action corresponding to the space-time feature to be identified by utilizing the normalized exponential function.

In order to extract a frame with rich information content, more identification and representativeness; in an embodiment of the present invention, the first obtaining module includes: and the first acquisition sub-module is used for acquiring the key frame sequence data to be identified from the bone sequence data to be identified by using the frame rectification network.

Further, the behavior recognition device further includes: the first training module is used for training the frame rectification network and the space-time graph convolution network before acquiring the key frame sequence data to be identified from the bone sequence data to be identified by using the frame rectification network.

Wherein the first training module comprises: the second acquisition submodule is used for acquiring key frame sequence training data from the bone sequence data corresponding to the preset behavior action by using the frame rectification network; the first processing submodule is used for training the space-time graph convolutional network by utilizing the key frame sequence training data and identifying a training behavior action corresponding to the key frame sequence training data by utilizing the trained space-time graph convolutional network; the first adjusting submodule is used for adjusting the frame rectification network reversely according to the training behavior action;

the third acquisition submodule is used for acquiring the key frame sequence training data from the bone sequence data corresponding to the preset behavior action again by using the adjusted frame rectification network; and the first training submodule is used for retraining the space-time diagram convolutional network again by using the adjusted key frame sequence training data acquired by the frame rectification network.

Further, the behavior recognition device further includes: the second identification module is used for identifying the training behavior action corresponding to the adjusted key frame sequence training data obtained by the frame rectification network by using the retrained space-time graph convolution network after retraining the space-time graph convolution network by using the adjusted key frame sequence training data obtained by the frame rectification network; the first processing module is used for storing the parameter information of the frame rectification network and the space-time diagram convolution network if the continuous times that the obtained training behavior action is consistent with the preset behavior action are larger than a first threshold value; otherwise, returning to the action according to the training behavior, and reversely adjusting the operation of the frame rectification network.

In particular, the loss function is specifically represented by l- α (r)_t+γ×max Q(s_t+1，ω)-Q(s_tω), where l represents a loss function, α represents a predetermined learning rate, r_tRepresents the value of a return function when the frame rectification network is trained for the t time, gamma represents a preset attenuation value, Q(s)_t+1ω) and Q(s)_tω) each represent a function of motion value, s_t+1Representing the bone sequence state, s, corresponding to the key frame sequence training data when the frame rectification network is trained for the t +1 th time_tAnd representing the bone sequence state corresponding to the key frame sequence training data when the frame rectification network is trained for the t time, wherein omega represents the network weight of the frame rectification network.

The implementation embodiments of the behavior recognition method are all suitable for the embodiment of the behavior recognition device, and the same technical effects can be achieved.

An embodiment of the present invention further provides a behavior recognition apparatus, as shown in fig. 6, including a memory 61, a processor 62, and a computer program 63 stored on the memory 61 and executable on the processor; the processor 62, when executing the program, implements the behavior recognition method described above.

The implementation embodiments of the behavior recognition method are all suitable for the embodiment of the behavior recognition device, and the same technical effect can be achieved.

The implementation embodiments of the behavior recognition method are all applicable to the embodiment of the computer-readable storage medium, and the same technical effects can be achieved.

It should be noted that many of the functional components described in this specification are referred to as modules/sub-modules/units in order to more particularly emphasize their implementation independence.

In embodiments of the present invention, the modules/sub-modules/units may be implemented in software for execution by various types of processors. An identified module of executable code may, for instance, comprise one or more physical or logical blocks of computer instructions which may, for instance, be constructed as an object, procedure, or function. Nevertheless, the executables of an identified module need not be physically located together, but may comprise disparate instructions stored in different bits which, when joined logically together, comprise the module and achieve the stated purpose for the module.

Indeed, a module of executable code may be a single instruction, or many instructions, and may even be distributed over several different code segments, among different programs, and across several memory devices. Likewise, operational data may be identified within the modules and may be embodied in any suitable form and organized within any suitable type of data structure. The operational data may be collected as a single data set, or may be distributed over different locations including over different storage devices, and may exist, at least partially, merely as electronic signals on a system or network.

When a module can be implemented by software, considering the level of existing hardware technology, a module that can be implemented by software can build corresponding hardware circuits including conventional very large scale integration (V L SI) circuits or gate arrays and existing semiconductors such as logic chips, transistors, or other discrete components to implement corresponding functions, without considering the cost.

While the preferred embodiments of the present invention have been described, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention.

Claims

1. A method of behavior recognition, comprising:

2. The behavior recognition method according to claim 1, wherein the recognizing the behavior action corresponding to the sequence of the key frames to be recognized by using a space-time graph convolutional network comprises:

3. The behavior recognition method according to claim 2, wherein the recognizing the behavior action corresponding to the spatiotemporal feature to be recognized comprises:

4. The behavior recognition method according to claim 1, wherein the obtaining of the sequence of key frames data to be recognized from the bone sequence data to be recognized comprises:

5. The behavior recognition method according to claim 4, wherein before the step of obtaining the sequence data of the key frame to be recognized from the sequence data of the bone to be recognized by using the frame rectification network, the method further comprises the following steps:

6. The behavior recognition method of claim 5, wherein the training the frame rectification network and the space-time graph convolution network comprises:

7. The behavior recognition method according to claim 6, further comprising, after retraining the space-time graph convolutional network with the adjusted training data of the sequence of key frames obtained by the frame rectification network, the following steps:

8. The behavior recognition method according to claim 6 or 7, wherein the adjusting the frame rectification network in reverse according to the training behavior action comprises:

acquiring a return function value according to the training behavior action;

obtaining a loss function according to the return function value;

9. The behavior recognition method according to claim 8, wherein the loss function is specifically:

l＝α(r_t+γ×maxQ(s_t+1，ω)-Q(s_t，ω))；

10. A behavior recognition apparatus, comprising:

11. The behavior recognition device according to claim 10, wherein the first recognition module includes:

12. The behavior recognition device according to claim 11, wherein the first recognition submodule includes:

13. The behavior recognition device according to claim 10, wherein the first obtaining module includes:

14. The behavior recognition device according to claim 13, further comprising:

15. The behavior recognition device of claim 14, wherein the first training module comprises:

16. The behavior recognition device according to claim 15, further comprising:

17. The behavior recognition device according to claim 15 or 16, wherein the adjusting the frame rectification network in reverse according to the training behavior action comprises:

acquiring a return function value according to the training behavior action;

obtaining a loss function according to the return function value;

18. The behavior recognition device according to claim 17, wherein the loss function is specifically:

l＝α(r_t+γ×maxQ(s_t+1，ω)-Q(s_t，ω))；

19. A behavior recognition device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor; characterized in that the processor implements the behavior recognition method according to any one of claims 1 to 9 when executing the program.

20. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method for behavior recognition according to one of claims 1 to 9.