CN109753897B

CN109753897B - Behavior recognition method based on memory cell reinforcement-time sequence dynamic learning

Info

Publication number: CN109753897B
Application number: CN201811569882.8A
Authority: CN
Inventors: 袁媛; 王�琦; 王栋
Original assignee: Northwestern Polytechnical University
Current assignee: Northwestern Polytechnical University
Priority date: 2018-12-21
Filing date: 2018-12-21
Publication date: 2022-05-27
Anticipated expiration: 2038-12-21
Also published as: CN109753897A

Abstract

The invention discloses a behavior recognition method based on memory cell reinforcement-time sequence dynamic learning, which is used for solving the technical problem of poor practicability of the existing behavior recognition method. The technical scheme includes that a recurrent neural network with a fusion memory unit is adopted to model time sequence structure information of a long-term video sequence, each video frame of the video sequence is classified into a related frame and a noise frame through a discretization memory unit read-write controller module, and information of the related frame is written into the memory unit while noise frame information is ignored. The method can filter a large amount of noise information in the un-clipped video, realizes the connection of a large-span time sequence structure by combining the recurrent neural network of the memory unit, models the long-time sequence structure mode of the complex character behaviors through data-driven self-training learning, solves the problems of complex motion mode and multiple background changes of the long-time un-clipped video in the background technology, improves the robustness of the character behavior identification method, and achieves the identification accuracy rate of 94.8% on average.

Description

Behavior recognition method based on memory cell reinforcement-time sequence dynamic learning

Technical Field

The present invention relates to a behavior recognition method, and more particularly, to a behavior recognition method based on memory cell reinforcement-time sequence dynamic learning.

Background

Documents "L.Wang, Y.Xiong, Z.Wang, Y.Qiao, D.Lin, X.Tang, and L.V.Gool.temporal Segment Networks: aware Good Practices for Deep Action Recognition, In Proceedings of European Conference reference Computer Vision, pp.20-36,2016" disclose a method for person behavior Recognition based on a dual-stream convolutional neural network and a time-series Segment network. The method utilizes two independent convolutional neural networks to solve a behavior recognition task, wherein the spatial flow network extracts apparent features of a target from a video frame, the time sequence flow network extracts motion features of the target from corresponding optical flow field data, and a behavior recognition result is obtained by fusing the outputs of the two networks. Meanwhile, the method provides a time sequence segment network to model long-term time sequence structure information of the video sequence, and the network realizes efficient and effective learning of the whole neural network through sparse time sequence sampling strategy and sequence scale supervised learning, and obtains better results on a large-scale public data set. The method disclosed by the document is rough in time sequence modeling in the video, so that the network often ignores the time sequence relevance of the features in the learning process; when the video sequence is long and is not edited, irrelevant noise information is blended into the final recognition result by the method, the accuracy of character behavior recognition is reduced, and meanwhile, the training and learning of the whole neural network are difficult due to the addition of the noise information.

Disclosure of Invention

In order to overcome the defect of poor practicability of the conventional behavior identification method, the invention provides a behavior identification method based on memory unit reinforcement-time sequence dynamic learning. The method adopts a recurrent neural network of a fusion memory unit to model the time sequence structure information of a long-term video sequence, classifies each video frame of the video sequence into a related frame and a noise frame through a discretization memory unit read-write controller module, writes the information of the related frame into the memory unit and ignores the noise frame information, can filter a large amount of noise information in the un-clipped video, and improves the accuracy of subsequent behavior identification. In addition, the recurrent neural network fused with the memory unit can realize the connection of a large-span time sequence structure, and model a long-term time sequence structure mode of a complex character behavior through data-driven autonomous training and learning, so that the problems that the motion mode of a long-term video and an uncut video is complex and the background changes are more solved, the robustness of the character behavior identification method is improved, and the identification accuracy of average 94.8% and 71.8% is achieved.

The technical scheme adopted by the invention for solving the technical problems is as follows: a behavior identification method based on memory cell reinforcement-time sequence dynamic learning is characterized by comprising the following steps:

step one, calculating a video frame I_aWherein the optical flow information of each pixel is represented by a two-dimensional vector (Δ x, Δ y) and stored as an optical flow map I_m. Extracting respective high-dimensional semantic features by using two independent thinking convolutional neural networks:

x_a＝CNN_a(I_a；w_a) (1)

x_m＝CNN_m(I_m；w_m) (2)

wherein, CNN_a、CNN_mRespectively representing an apparent convolutional neural network and a motion convolutional neural network for extracting a video frame I_aAnd an optical flow diagram I_mHigh dimensional characteristics of (2). x is the number of_a、x_m2048 dimensional vectors respectively represent the appearance and motion characteristics extracted by the convolutional neural network. w is a_a、w_mRepresenting the internally trainable parameters of two convolutional neural networks. And x is used for representing the high-dimensional features extracted by the convolutional neural network.

Step two, initializing the memory cell M to be empty, which is denoted as M₀. Assuming the tth video frame, the memory unit M_tIs not empty, and contains N_t>0 elements, respectively denoted as m₁,m₂,...m_Nt. Then, the memory module read operation at the corresponding time is as follows:

wherein mh is read out_tRepresenting historical information at time t before the video.

And step three, extracting the short-time context characteristics of the video content by utilizing the segmented recurrent neural network. Taking the high-dimensional semantic feature x obtained by calculation in the step one as input, and recording the feature corresponding to the t-th video frame as x_t. Initializing hidden states h of long-and-short-term recurrent neural networks (LSTMs)₀、c₀At zero, the short-time context feature at time t is calculated as follows:

wherein LSTM () represents a long and short recurrent neural network, h_t-1,c_t-1Representing the hidden state of the recurrent neural network at the previous moment. While

As short-time contextual features of the video content for subsequent calculations.

Step four, for each video frame, calculating the high-dimensional semantic feature x obtained in the step one, the step two and the step three_tMemorize the cell history information mh_tAnd short-time contextual features

Inputting the data into the memory unit controller, calculating to obtain a binary memory unit write command s_tE {0,1}, specifically as follows:

a_t＝σ(q_t) (6)

s_t＝τ(a_t) (7)

wherein v is^TFor learnable row vector parameters, W_f、W_c、W_mAs a learnable weight parameter, b_sIs a bias parameter. sigmoid function σ () linearly weighted result q_tNormalized to between 0 and 1, i.e. a_t∈(0,1)。a_tInputting the binarization function tau () limited by the threshold value to obtain the writing instruction s of the binarization memory unit_t。

Step five, writing the command s based on the binary memory unit_tAnd updating the memory unit and the segmented recurrent neural network. For each video frame, a memory unit M_tThe update strategy of (2) is as follows:

wherein, W_wIs a learnable weight matrix which multiplies a high-dimensional semantic feature x_tConversion to memory cell elements

Show that

Write memory cell M_t-1Forming a new memory cell M_t. In addition, the hidden state h of the segmented recurrent neural network_t，c_tThe update is as follows:

wherein the content of the first and second substances,

the result obtained by calculation for equation (4).

And step six, utilizing the memory unit to classify the behaviors. Assuming that the total video length is T, the memory unit is M at the end of the whole video processing_TIn which there is N_TElement, then the feature representation f of the entire video is:

and f is a D-dimensional vector and represents information of behavior categories in the video. The feature is input into the full-connection classification layer to obtain a behavior classification score y, which is specifically as follows:

y＝softmax(W·f) (12)

wherein W ∈ R^C×DAnd C represents the total number of identifiable behavior categories. The calculated y represents the classification score of the system for each category, and higher score represents more probable behavior of the category. Suppose y_a、y_mRespectively representing the scores obtained by the appearance and motor neural networks, and finally obtaining the score y_fThe following were used:

y_f＝y_a+y_m (13)

wherein, y_fAnd representing the final character behavior recognition result.

The invention has the beneficial effects that: the method adopts a recurrent neural network of a fusion memory unit to model the time sequence structure information of a long-term video sequence, classifies each video frame of the video sequence into a related frame and a noise frame through a discretization memory unit read-write controller module, writes the information of the related frame into the memory unit and ignores the noise frame information, can filter a large amount of noise information in the un-clipped video, and improves the accuracy of subsequent behavior identification. In addition, the recurrent neural network fused with the memory unit can realize the connection of a large-span time sequence structure, and model a long-term time sequence structure mode of a complex character behavior through data-driven autonomous training and learning, so that the problems that the motion mode of a long-term video and an uncut video is complex and the background changes are more solved, the robustness of the character behavior identification method is improved, and the identification accuracy of average 94.8% and 71.8% is achieved.

The present invention will be described in detail below with reference to the accompanying drawings and specific embodiments.

Drawings

FIG. 1 is a flowchart of a behavior recognition method based on reinforcement-time sequence dynamic learning of memory cells according to the present invention.

Detailed Description

Refer to fig. 1. The behavior identification method based on the memory cell reinforcement-time sequence dynamic learning specifically comprises the following steps:

step one, extracting high-dimensional appearance and motion characteristics containing semantic information. First, a video frame I is calculated_aWherein the optical flow information of each pixel is represented by a two-dimensional vector (Δ x, Δ y) and stored as an optical flow map I_m. Then, extracting respective high-dimensional semantic features by using two independent thinking convolutional neural networks:

x_a＝CNN_a(I_a；w_a) (1)

x_m＝CNN_m(I_m；w_m) (2)

wherein CNN_a、CNN_mRespectively representing an apparent convolutional neural network and a motion convolutional neural network for extracting a video frame I_aAnd an optical flow diagram I_mHigh dimensional characteristics of (2). x is the number of_a、x_m2048 dimensional vectors respectively represent the appearance and motion characteristics extracted by the convolutional neural network. w is a_a、w_mRepresenting the internally trainable parameters of two convolutional neural networks. Because the subsequent operations of the apparent neural network and the motor neural network are completely consistent, in order to make the labels simple and clear, the high-dimensional features extracted by the convolutional neural network are represented by x.

Step two, initializing the memory cell M to be empty, which is denoted as M₀. Assuming the tth video frame, the memory unit M_tIs not empty, and contains N_t>0 elements, respectively expressed as

Then, the memory module read operation at the corresponding time is as follows:

in which mh is read_tRepresents the historical information of the video at the previous t moment, and the historical information influences the analysis and understanding of the video content at the moment.

Step three, utilizing segmented recurrent nerveAnd the network extracts the short-time contextual features of the video content. Taking the high-dimensional semantic feature x obtained by calculation in the step one as input, and recording the feature corresponding to the t-th video frame as x_t. First, initialize the hidden state h of long and short time recurrent neural network (LSTM)₀、c₀At zero, the short-time context feature at time t is calculated as follows:

where LSTM () represents a long and short recurrent neural network, h_t-1,c_t-1Representing the hidden state of the recurrent neural network at the previous moment. While

Step four, writing the discretized memory cell into the controller. For each video frame, the high-dimensional semantic feature x obtained by calculation in the steps 1,2 and 3_tMemorize the cell history information mh_tAnd short-time contextual features

a_t＝σ(q_t) (6)

s_t＝τ(a_t) (7)

wherein v is^TFor learnable row vector parameters, W_f、W_c、W_mAs a learnable weight parameter, b_sIs a bias parameter. From the above, it can be seen that the sigmoid function σ () linearly weights the result q_tNormalized to between 0 and 1, i.e. a_tE (0, 1). Secondly, a_tInputting the binarization function tau () limited by the threshold value to obtain the writing instruction s of the binarization memory unit_t。

wherein W_wIs a learnable weight matrix which multiplies a high-dimensional semantic feature x_tConversion to memory cell elements

Show that

wherein

The result obtained by calculation for equation (4).

where f is a D-dimensional vector representing information of the behavior class in the video. Then, the feature is input into the full-connection classification layer to obtain a behavior classification score y, which is as follows:

y＝softmax(W·f) (12)

wherein W ∈ R^C×DAnd C represents the total number of identifiable behavior categories. The calculated y represents the classification score of the system for each category, and higher score represents more probable behavior of the category. Suppose y_a、y_mRespectively representing the scores obtained by the appearance and motor neural networks, and finally obtaining the score y_fThe following:

y_f＝y_a+y_m (13)

wherein y is_fAnd representing the final character behavior recognition result.

The effects of the present invention are further illustrated by the following simulation experiments.

1. And (5) simulating conditions.

The invention is implemented by a central processing unit

Xeon E5-2697A 2.6GHz CPU, video card NVIDIA K80 and memory 16G, Centos 7 operating system, and PyTorch software is used for simulation.

The data used in the simulation was data from two public test data sets, UCF101/HMDB51, where camera movement varied significantly and the background was complex. The experimental data comprises 13320/6766 video segments, and can be classified into 101/51 classes according to behavior categories. Most of the video data in the HMDB51 dataset is not clipped, and contains much noise.

2. And simulating the content.

In order to prove the effectiveness of the invention, a simulation experiment carries out a comparison experiment on the memory cell strengthening and time sequence dynamic learning method provided by the invention. Specifically, as a comparison algorithm of the present invention, a method of Lattice-length recurrent neural network (Lattice-LSTM) was selected In a simulation experiment by using the dual stream network architecture (TSN) with the highest accuracy and the method of l.sun et al In the documents "l.sun, k.jia, k.chen, d.yeung, b.shi and s.savress.lattic Long Short-Term Memory for Human Action Recognition, In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp.2166-2175,2011". The three algorithms set the same parameters and calculate their mean AUC values over the UCF101/HMDB51 data set. The comparative results are shown in Table 1.

TABLE 1

Method	TSN	Lattice-LSTM	OUR
				AUC(UCF101)	93.6％	94.0％	94.8％
AUC(HMDB51)	66.2％	68.5％	71.8％

As can be seen from table 1, the recognition accuracy of the present invention is significantly higher than that of the existing behavior recognition method. Specifically, the accuracy of the algorithm TSN is lower than that of the algorithms Lattice-LSTM and OUR, because the TSN algorithm does not consider the time sequence change mode of the video content, and both the Lattice-LSTM and the OUR adopt the recurrent neural network to model the time sequence change mode of the video, thereby proving the effectiveness of the time sequence dynamic learning method based on the recurrent neural network provided by the invention. In addition, on the HMDB51 data set, the algorithm OUR is obviously superior to that of Lattice-LSTM, because the memory unit provided by the invention can effectively strengthen the processing capacity of the recurrent neural network on long-term and uncut video. Therefore, for the effectiveness of the memory unit on the strengthening of the recurrent neural network, the simulation experiment performed a comparison experiment on various recurrent neural networks LSTM, ALSTM, and VideoLSTM on the UCF101 data set with the algorithm of the present invention, and the results are shown in table 2.

TABLE 2

Method	LSTM	ALSTM	VideoLSTM	Ours
					AUC	88.3％	77.0％	89.2％	91.03％

As can be seen from the table 2, the result obtained by the fusion of the method is higher in accuracy than the results of various recurrent neural networks, and the reason is that the memory unit strengthening method can effectively extract effective information in the video, so that a time sequence change mode in the video can be modeled. In contrast, simple recurrent neural network methods are susceptible to noise, thus reducing accuracy in return. Therefore, the effectiveness of the present invention can be verified through the above simulation experiment.

Claims

1. A behavior recognition method based on memory cell reinforcement-time sequence dynamic learning is characterized by comprising the following steps:

step one, calculating a video frame I_aWherein the optical flow information of each pixel is represented by a two-dimensional vector (Δ x, Δ y) and stored as an optical flow map I_m(ii) a Extracting respective high-dimensional semantic features by using two independent thinking convolutional neural networks:

x_a＝CNN_a(I_a；w_a) (1)

x_m＝CNN_m(I_m；w_m) (2)

wherein, CNN_a、CNN_mRespectively representing an apparent convolutional neural network and a motion convolutional neural network for extracting a video frame I_aAnd an optical flow diagram I_mHigh dimensional features of (2); x is the number of_a、x_m2048-dimensional vectors respectively representing appearance and motion characteristics extracted by the convolutional neural network; w is a_a、w_mInternal trainable parameters representing two convolutional neural networks; representing the high-dimensional features extracted by the convolutional neural network by using x;

step two, initializing the memory cell M to be empty, which is expressed as M₀(ii) a Assuming the tth video frame, the memory unit M_tIs not empty, and contains N_t>0 elements, respectively denoted as m₁,m₂,...

Then, the memory module reading operation at the corresponding time is as follows:

wherein the read-out mh_tRepresenting historical information of the video at the previous t moment;

extracting short-time context characteristics of video contents by using a segmented recurrent neural network; taking the high-dimensional semantic feature x obtained by calculation in the step one as input, and recording the feature corresponding to the t-th video frame as x_t(ii) a Initializing hidden states h of long and short term recurrent neural networks (LSTMs)₀、c₀At zero, the short-time context feature at time t is calculated as follows:

wherein LSTM () represents a long and short recurrent neural network, h_t-1,c_t-1Representing a hidden state of the recurrent neural network at a previous time; while

As short-time contextual features of the video content for subsequent computation;

a_t＝σ(q_t) (6)

s_t＝τ(a_t) (7)

wherein v is^TFor learnable row vector parameters, W_f、W_c、W_mAs a learnable weight parameter, b_sIs a bias parameter; sigmoid function σ () linearly weighted result q_tNormalized to between 0 and 1, i.e. a_t∈(0,1)；a_tInputting the binarization function tau () limited by the threshold value to obtain the writing instruction s of the binarization memory unit_t；

Step five, writing the command s based on the binary memory unit_tUpdating the memory unit and the segmented recurrent neural network; for each video frame, a memory unit M_tThe update strategy of (2) is as follows:

Show that

Write memory cell M_t-1Forming a new memory cell M_t(ii) a In addition, the hidden state h of the segmented recurrent neural network_t，c_tThe update is as follows:

wherein the content of the first and second substances,

the result obtained by calculation for formula (4);

step six, using a memory unit to classify behaviors; assuming that the total video length is T, the memory unit is M at the end of the whole video processing_TIn which there is N_TElement, then the feature representation f of the entire video is:

wherein f is a D-dimensional vector and represents information of behavior categories in the video; the feature is input into the full-connection classification layer to obtain a behavior classification score y, which is specifically as follows:

y＝softmax(W·f) (12)

wherein W ∈ R^C×DC represents the total number of identifiable behavior categories; the calculated y represents the classification score of the system for each category, and the higher the score is, the more probable the behavior is; suppose y_a、y_mRespectively representing the scores obtained by the appearance and motor neural networks, and finally obtaining the score y_fThe following were used:

y_f＝y_a+y_m (13)

wherein, y_fAnd representing the final character behavior recognition result.