CN109753897A

CN109753897A - Based on memory unit reinforcing-time-series dynamics study Activity recognition method

Info

Publication number: CN109753897A
Application number: CN201811569882.8A
Authority: CN
Inventors: 袁媛; 王�琦; 王栋
Original assignee: Northwestern Polytechnical University
Current assignee: Northwestern Polytechnical University
Priority date: 2018-12-21
Filing date: 2018-12-21
Publication date: 2019-05-14
Anticipated expiration: 2038-12-21
Also published as: CN109753897B

Abstract

The invention discloses a kind of Activity recognition method based on memory unit reinforcing-time-series dynamics study, the technical issues of the practicability is poor for solving existing Activity recognition method.Technical solution is the timing structure information of video sequence when modeling long using the recurrent neural network of fusion memory unit, each video frame of video sequence is classified as associated frame and noise frame by discretization memory unit read-write controller module, the information write-in memory unit of associated frame is ignored into noise frame information simultaneously.This method can filter a large amount of noise information in non-editing video, the recurrent neural network of fusion memory unit realizes the connection of large span sequential organization, pass through the autonomous training study of data-driven, to complicated personage's behavior it is long when sequential organization mode model, solve background technique to it is long when, the motor pattern of non-editing video it is complicated, background changes more problems, improves the robustness of personage's Activity recognition method, and has reached average 94.8% recognition accuracy.

Description

Based on memory unit reinforcing-time-series dynamics study Activity recognition method

Technical field

It is the present invention relates to a kind of Activity recognition method, in particular to a kind of based on memory unit reinforcing-time-series dynamics study Activity recognition method.

Background technique

Document " L.Wang, Y.Xiong, Z.Wang, Y.Qiao, D.Lin, X.Tang, and L.V.Gool.Temporal Segment Networks:Towards Good Practices for Deep Action Recognition,In Proceedings of European Conference on Computer Vision, pp.20-36,2016. " discloses one Personage Activity recognition method of the kind based on double-current convolutional neural networks Yu temporal sequence network.This method utilizes two independent volumes Product neural network solves Activity recognition task, wherein and space flow network extracts the appearance features of target from video frame, and when Sequence flow network then extracts the motion feature of target from corresponding light stream field data, is exported by merging the two networks and is gone For recognition result.Meanwhile this method propose temporal sequence network come model video sequence it is long when timing structure information, the network By the supervised learning of sparse Temporal Sampling strategy and sequence scale, efficient effective study of entire neural network is realized, and It is yielded good result on extensive public data collection.Document the method is more coarse to the time series modeling in video, So that network often sequential correlation of override feature in learning process；When video sequence is longer and non-editing, the party Unrelated noise information can be incorporated final recognition result, the accuracy rate of reduction personage's Activity recognition, while noise information by method It is added, the training study of entire neural network can also be made to become difficult.

Summary of the invention

In order to overcome the shortcomings of existing Activity recognition method, the practicability is poor, and the present invention provides a kind of strong based on memory unit Change-time-series dynamics study Activity recognition method.This method regards when long using the recurrent neural network modeling of fusion memory unit The timing structure information of frequency sequence is divided each video frame of video sequence by discretization memory unit read-write controller module Class is associated frame and noise frame, the information write-in memory unit of associated frame is ignored noise frame information simultaneously, this method can filter Fall a large amount of noise information in non-editing video, promotes the accuracy rate of subsequent act identification.In addition, the recurrence of fusion memory unit The connection of large span sequential organization may be implemented in neural network, is learnt by the autonomous training of data-driven, to complicated personage's row For it is long when sequential organization mode modeled, and then solve existing Activity recognition method to it is long when, non-editing video Motor pattern is complicated, and background changes more problems, improves the robustness of personage's Activity recognition method, and reaches average 94.8%, 71.8% recognition accuracy.

The technical solution adopted by the present invention to solve the technical problems: one kind being based on memory unit reinforcing-time-series dynamics The Activity recognition method of habit, its main feature is that the following steps are included:

Step 1: calculating video frame I_aOptic flow information, wherein the Optic flow information of each pixel is by bivector (Δ x, Δ Y) it indicates and saves as light stream figure I_m.Respective higher-dimension semantic feature is extracted using two independent thinking convolutional neural networks:

x_a=CNN_a(I_a；w_a) (1)

x_m=CNN_m(I_m；w_m) (2)

Wherein, CNN_a、CNN_mApparent convolutional neural networks and movement convolutional neural networks are respectively represented, to extract video Frame I_aWith light stream figure I_mHigh dimensional feature.x_a、x_mRespectively 2048 dimensional vectors, represent that convolutional neural networks extract it is apparent with Motion feature.w_a、w_mIndicate that the inside of two convolutional neural networks can training parameter.Indicate that convolutional neural networks extract using x High dimensional feature.

Step 2: initialization memory unit M is sky, it is expressed as M₀.Assuming that when t video frame, memory unit M_tIt is not sky, It wherein include N_t> 0 element, is expressed asSo, the memory module read operation at corresponding moment is as follows:

Wherein, the mh read out_tRepresent the historical information of t moment before video.

Step 3: extracting the contextual feature in short-term of video content using section type recurrent neural network.In terms of step 1 For obtained higher-dimension semantic feature x as input, feature when corresponding to t video frame is denoted as x_t.The long mind of recurrence in short-term of initialization Hidden state h through network (LSTM)₀、c₀It is zero, then the contextual feature in short-term of t moment calculates as follows:

Wherein, EMD () indicates long recurrent neural network in short-term, h_t-1,c_t-1Indicate the hidden of recurrent neural network previous moment State.AndContextual feature in short-term as video content is used for subsequent calculating.

Step 4: for each video frame, Step 1: two, the three higher-dimension semantic feature x being calculated_t, memory unit goes through History information mh_tAnd contextual feature in short-termMemory unit controller is inputted, binaryzation memory unit write instruction is calculated s_t∈ { 0,1 }, specific as follows:

a_t=σ (q_t) (6)

s_t=τ (a_t) (7)

Wherein, v^TFor the row vector parameter that can learn, W_f、W_c、W_mFor the weight parameter that can learn, b_sFor offset parameter. Sigmoid function σ () is by the result q of linear weighted function_tIt normalizes between 0,1, i.e. a_t∈(0,1)。a_tIt is input to threshold restriction Binaryzation function τ () obtain binaryzation memory unit write instruction s_t。

Step 5: being based on binaryzation memory unit write instruction s_t, update memory unit and section type recurrent neural network. For each video frame, memory unit M_tMore new strategy it is as follows:

Wherein, W_wFor that can learn weight matrix, which passes through multiplying for higher-dimension semantic feature x_tIt is single to be converted to memory Element Indicating willMemory unit M is written_t-1, form new memory unit M_t.In addition, section type recurrence The hidden state h of neural network_t, c_tIt updates as follows:

Wherein,The result being calculated for formula (4).

Step 6: carrying out behavior classification using memory unit.Assuming that video overall length is T, entire video is remembered when processing terminate Recalling unit is M_T, wherein there is N_TA element, then the character representation f of entire video are as follows:

Wherein, f is D dimensional vector, represents the information of behavior classification in video.This feature inputs full link sort layer and is gone It is specific as follows for category score y:

Y=soft max (Wf) (12)

Wherein, W ∈ R^C×D, the identifiable behavior classification sum of C expression.The y being calculated indicates system to each classification Classification score, the higher expression of score are more likely to be the class behavior.Assuming that y_a、y_mIt respectively indicates and is apparently obtained with kinesitherapy nerve network The score arrived, then final score y_fIt is as follows:

y_f=y_a+y_m (13)

Wherein, y_fIndicate final personage's Activity recognition result.

The beneficial effects of the present invention are: video sequence when this method is long using the recurrent neural network modeling of fusion memory unit Each video frame of video sequence is classified as by the timing structure information of column by discretization memory unit read-write controller module The information write-in memory unit of associated frame is ignored noise frame information simultaneously by associated frame and noise frame, and this method can filter not A large amount of noise information in editing video promotes the accuracy rate of subsequent act identification.In addition, the recurrent neural of fusion memory unit The connection of large span sequential organization may be implemented in network, is learnt by the autonomous training of data-driven, to complicated personage's behavior Sequential organization mode is modeled when long, so solve existing Activity recognition method to it is long when, non-editing video movement Mode is complicated, and background changes more problems, improves the robustness of personage's Activity recognition method, and reach average 94.8%, 71.8% recognition accuracy.

It elaborates with reference to the accompanying drawings and detailed description to the present invention.

Detailed description of the invention

Fig. 1 is that the present invention is based on the flow charts of memory unit reinforcing-time-series dynamics study Activity recognition method.

Specific embodiment

Referring to Fig.1.The present invention is based on memory unit reinforcing-time-series dynamics study Activity recognition method specific steps such as Under:

Contain the higher-dimension of semantic information apparently and motion feature Step 1: extracting.Firstly, calculating video frame I_aLight stream letter Breath, wherein the Optic flow information of each pixel is by bivector, (Δ x, Δ y) expression simultaneously save as light stream figure I_m.Then, two are utilized A independent thinking convolutional neural networks extract respective higher-dimension semantic feature:

x_a=CNN_a(I_a；w_a) (1)

x_m=CNN_m(I_m；w_m) (2)

Wherein CNN_a、CNN_mApparent convolutional neural networks and movement convolutional neural networks are respectively represented, to extract video frame I_aWith light stream figure I_mHigh dimensional feature.x_a、x_mRespectively 2048 dimensional vectors represent the apparent and fortune that convolutional neural networks extract Dynamic feature.w_a、w_mIndicate that the inside of two convolutional neural networks can training parameter.Due to apparent neural network and kinesitherapy nerve net The subsequent operation of network is completely the same, to make label simply clear, indicates that the higher-dimension that convolutional neural networks extract is special using x Sign.

The mh wherein read out_tThe historical information of t moment before video is represented, while the historical information affects this moment The analysis and understanding of video content.

Step 3: extracting the contextual feature in short-term of video content using section type recurrent neural network.In terms of step 1 For obtained higher-dimension semantic feature x as input, feature when corresponding to t video frame is denoted as x_t.Firstly, initialization length is passed in short-term Return the hidden state h of neural network (LSTM)₀、c₀It is zero, then the contextual feature in short-term of t moment calculates as follows:

Wherein EMD () indicates long recurrent neural network in short-term, h_t-1,c_t-1Indicate the hidden shape of recurrent neural network previous moment State.AndContextual feature in short-term as video content is used for subsequent calculating.

Step 4: discretization memory unit writing controller.For each video frame, step 1,2,3 height being calculated Tie up semantic feature x_t, memory unit historical information mh_tAnd contextual feature in short-termMemory unit controller is inputted, is calculated To binaryzation memory unit write instruction s_t∈ { 0,1 }, specific as follows:

a_t=σ (q_t) (6)

s_t=τ (a_t) (7)

Wherein v^TFor the row vector parameter that can learn, W_f、W_c、W_mFor the weight parameter that can learn, b_sFor offset parameter.By upper It can be seen that, sigmoid function σ () is by the result q of linear weighted function_tIt normalizes between 0,1, i.e. a_t∈(0,1).Secondly, a_tIt is defeated The binaryzation function τ () entered to threshold restriction obtains binaryzation memory unit write instruction s_t。

Wherein W_wFor that can learn weight matrix, which passes through multiplying for higher-dimension semantic feature x_tIt is single to be converted to memory Element Indicating willMemory unit M is written_t-1, form new memory unit M_t.In addition, section type recurrence The hidden state h of neural network_t, c_tIt updates as follows:

WhereinThe result being calculated for formula (4).

Wherein f is D dimensional vector, represents the information of behavior classification in video.Then, this feature inputs full link sort layer Behavior category score y is obtained, specific as follows:

Y=soft max (Wf) (12)

Wherein W ∈ R^C×D, the identifiable behavior classification sum of C expression.The y being calculated indicates system to each classification Classification score, the higher expression of score are more likely to be the class behavior.Assuming that y_a、y_mIt respectively indicates and is apparently obtained with kinesitherapy nerve network The score arrived, then final score y_fIt is as follows:

y_f=y_a+y_m (13)

Wherein y_fIndicate final personage's Activity recognition result.

Effect of the invention is described further by following emulation experiment.

1. simulated conditions.

The present invention is to be in central processing unitXeon E5-2697A 2.6GHz CPU, video card NVIDIA K80, In 7 operating system of memory 16G, Centos, with the emulation of PyTorch software progress.

Data used in emulation are the data in two open test data set UCF101/HMDB51, wherein video camera Movement changes greatly, and background is complex.Experimental data includes 13320/6766 section of video altogether, can be divided into according to behavior classification 101/51 class.The wherein non-editing mostly of the video data in HMDB51 data set includes more noise.

2. emulation content.

In order to prove effectiveness of the invention, emulation experiment strengthens to memory unit proposed by the present invention and time-series dynamics Learning method has carried out comparative experiments.Specifically, as comparison algorithm of the invention, emulation experiment has selected accuracy rate highest double Flow network framework (TSN) and L.Sun et al. are in document " L.Sun, K.Jia, K.Chen, D.Yeung, B.Shi and S.Savarese.Lattice Long Short-Term Memory for Human Action Recognition,In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, The method (Lattice-LSTM) of the long recurrent neural network in short-term of lattice is proposed in pp.2166-2175,2011. ".Three algorithms Same parameter is set, its average AUC numerical value on UCF101/HMDB51 data set is calculated.Comparing result is as shown in table 1.

Table 1

Method	TSN	Lattice-LSTM	OUR
				AUC(UCF101)	93.6%	94.0%	94.8%
AUC(HMDB51)	66.2%	68.5%	71.8%

As seen from Table 1, recognition accuracy of the invention is higher than existing Activity recognition method significantly.Specifically, algorithm TSN Accuracy rate be lower than algorithm Lattice-LSTM and OUR, reason is that TSN algorithm does not account for the timing variations mould of video content Formula, and Lattice-LSTM and OUR use recurrent neural network and are modeled to the timing variations mode of video, thus Demonstrate the validity of the time-series dynamics learning method proposed by the present invention based on recurrent neural network.In addition, in HMDB51 number According on collection, algorithm OUR is substantially better than Lattice-LSTM, passs this is because memory unit proposed by the present invention can effectively be strengthened Return neural network to it is long when, non-editing video processing capacity.Therefore, in order to which memory unit has recurrent neural network reinforcing Effect property, emulation experiment is on UCF101 data set by all kinds of recurrent neural network LSTM, ALSTM and VideoLSTM and this hair Bright algorithm has carried out comparative experiments, and the results are shown in Table 2.

Table 2

Method	LSTM	ALSTM	VideoLSTM	Ours
					AUC	88.3%	77.0%	89.2%	91.03%

As seen from Table 2, the result that the present invention merges is higher than all kinds of recurrent neural network result accuracys rate, and reason exists In memory unit intensifying method of the invention can effectively extract the effective information in video, and then model the timing in video Changing pattern.In contrast, influence of the simple recurrent neural network method vulnerable to noise, therefore accuracy rate is reduced instead. Therefore, effectiveness of the invention can be verified by the above emulation experiment.

Claims

1. a kind of based on memory unit reinforcing-time-series dynamics study Activity recognition method, it is characterised in that the following steps are included:

Step 1: calculating video frame I_aOptic flow information, wherein the Optic flow information of each pixel is by bivector (Δ x, Δ y) table Show and saves as light stream figure I_m；Respective higher-dimension semantic feature is extracted using two independent thinking convolutional neural networks:

x_a=CNN_a(I_a；w_a) (1)

x_m=CNN_m(I_m；w_m) (2)

Wherein, CNN_a、CNN_mApparent convolutional neural networks and movement convolutional neural networks are respectively represented, to extract video frame I_a With light stream figure I_mHigh dimensional feature；x_a、x_mRespectively 2048 dimensional vectors represent the apparent and movement that convolutional neural networks extract Feature；w_a、w_mIndicate that the inside of two convolutional neural networks can training parameter；The height that convolutional neural networks extract is indicated using x Dimensional feature；

Step 2: initialization memory unit M is sky, it is expressed as M₀；Assuming that when t video frame, memory unit M_tIt is not sky, wherein Include N_t> 0 element, is expressed as m₁,m₂,...m_Nt；So, the memory module read operation at corresponding moment is as follows:

Wherein, the mh read out_tRepresent the historical information of t moment before video；

Step 3: extracting the contextual feature in short-term of video content using section type recurrent neural network；It is calculated with step 1 For the higher-dimension semantic feature x arrived as input, feature when corresponding to t video frame is denoted as x_t；The long recurrent neural net in short-term of initialization The hidden state h of network (LSTM)₀、c₀It is zero, then the contextual feature in short-term of t moment calculates as follows:

Wherein, EMD () indicates long recurrent neural network in short-term, h_t-1,c_t-1Indicate the hidden state of recurrent neural network previous moment； AndContextual feature in short-term as video content is used for subsequent calculating；

Step 4: for each video frame, Step 1: two, the three higher-dimension semantic feature x being calculated_t, memory unit history letter Cease mh_tAnd contextual feature in short-termMemory unit controller is inputted, binaryzation memory unit write instruction s is calculated_t∈ { 0,1 }, specific as follows:

a_t=σ (q_t) (6)

s_t=τ (a_t) (7)

Wherein, v^TFor the row vector parameter that can learn, W_f、W_c、W_mFor the weight parameter that can learn, b_sFor offset parameter；sigmoid Function σ () is by the result q of linear weighted function_tIt normalizes between 0,1, i.e. a_t∈(0,1)；a_tIt is input to the binaryzation of threshold restriction Function τ () obtains binaryzation memory unit write instruction s_t；

Step 5: being based on binaryzation memory unit write instruction s_t, update memory unit and section type recurrent neural network；For Each video frame, memory unit M_tMore new strategy it is as follows:

Wherein, W_wFor that can learn weight matrix, which passes through multiplying for higher-dimension semantic feature x_tBe converted to memory unit member Element Indicating willMemory unit M is written_t-1, form new memory unit M_t；In addition, section type recurrent neural The hidden state h of network_t, c_tIt updates as follows:

Wherein,The result being calculated for formula (4)；

Step 6: carrying out behavior classification using memory unit；Assuming that video overall length is T, entire video is remembered single when processing terminate Member is M_T, wherein there is N_TA element, then the character representation f of entire video are as follows:

Wherein, f is D dimensional vector, represents the information of behavior classification in video；This feature inputs full link sort layer and obtains behavior class Other score y, specific as follows:

Y=softmax (Wf) (12)

Wherein, W ∈ R^C×D, the identifiable behavior classification sum of C expression；The y being calculated indicates classification of the system to each classification Score, the higher expression of score are more likely to be the class behavior；Assuming that y_a、y_mIt respectively indicates and is apparently obtained with kinesitherapy nerve network Score, then final score y_fIt is as follows:

y_f=y_a+y_m (13)

Wherein, y_fIndicate final personage's Activity recognition result.