CN106845351A

CN106845351A - It is a kind of for Activity recognition method of the video based on two-way length mnemon in short-term

Info

Publication number: CN106845351A
Application number: CN201611193290.1A
Authority: CN
Inventors: 刘纯平; 葛瑞; 季怡; 刘海宾; 龚声蓉
Original assignee: Suzhou University
Current assignee: Suzhou University
Priority date: 2016-05-13
Filing date: 2016-12-21
Publication date: 2017-06-13

Abstract

The invention discloses a kind of for Activity recognition method of the video based on two-way length mnemon in short-term, including：(1) input video sequence, extracts the RBG frame sequences and light stream picture in video sequence；(2) RGB image depth convolutional network and light stream picture depth convolutional network are respectively trained；(3) multilayer feature of network is extracted, wherein at least extracts the 3rd convolutional layer, the 5th convolutional layer, the feature of the 7th full articulamentum；He Chihua is carried out to convolutional layer feature；(4) to being trained using the two-way length recurrent neural network that mnemon builds in short-term, probability matrix of the video per frame is obtained；(5) each probability matrix is averaged, finally merges the probability matrix of light stream frame and RGB frame, take the class of maximum probability as last classification results, be achieved in Activity recognition.The present invention replaces traditional manual features, the depth characteristic of different layers to characterize different information using the feature of multilayer deep learning, and the combination of multilayer feature can improve the accuracy rate of classification；By using two-way short-term memory capture time information long, more spatial structure information are obtained, improve the effect of Activity recognition.

Description

It is a kind of for Activity recognition method of the video based on two-way length mnemon in short-term

Technical field

The present invention relates to a kind of method for processing video frequency, and in particular to the side of the behavior of personage in a kind of automatic identification video Method.

Background technology

Activity recognition refer to by extracting video or image sequence in behavior of the characteristic information to target be analyzed, with Personage's behavior pattern in identification video.

Activity recognition is computer vision and an important and difficult problem in pattern-recognition.It is in many aspects Have a wide range of applications, such as intelligent monitoring, man-machine interaction, virtual reality, intelligent security guard etc..Today's society is with economy Fast development, safety problem increasingly paid close attention to by people, and increasing place is assembled with video monitoring camera, There is substantial amounts of monitor video to produce daily, present people generally take and are manually monitored, and this is accomplished by putting into substantial amounts of people Power, material resources, financial resources；In addition, people in prolonged video monitoring, notice can decline, cause some emergencies without Method is arrived by effective detection, has delayed the time of the work such as rescue, causes substantial amounts of loss.In order to solve this problem, if can Monitored with by computer automatic video frequency, then will effectively reduce the input of this respect, and timely warning information can be obtained, Taken measures with facilitating.

The conventional method of Activity recognition is to extract the feature of frame of video first, is then identified using disaggregated model.Its In, extract that effective feature representation is critically important from video, directly affect the performance of whole system.

Traditional Activity recognition method, uses different hand-designed features.Deep learning method then utilizes neutral net The feature for learning, non-linear due to depth network, it can include more internal informations, can be with significant increase mark sheet Up to generalization ability.

Traditional behavioural characteristic extracting method mainly has：

1）Method based on low-level feature such as moving object detection, tracking.The method is to extract low layer pictures feature to be analyzed, Method is fairly simple directly perceived.The low-level feature that can be used mainly includes movement velocity, direction, light stream, the target of foreground target Shape contour, movement locus etc..This category feature can ignore static irrelevant information, pay close attention to moving target, it is possible to reduce The interference of irrelevant information in frame of video.In addition, the method to extract feature also relatively easy, but wherein there is also some and serious ask Topic.Such as, this category feature heavy dependence target tracking algorism, if tracking effect is too poor, it will cause pole to last result Big influence.And the video in real world, various interference, such as mixed and disorderly background, other moving targets are often included again.This This kind of method is resulted in carry out daily video behavior analytical effect poor, thus this category feature often robustness is poor, it is difficult to It is applied in actual scene.

2）The method that subcharacter is described based on space-time.The behavior pattern of this kind of characteristic use craft feature description.When such as Empty interest point methods, it is described with some points not associated to human action.In addition, also intensive track (Dense Trajectories), track (Improved Trajectories) is improved, Scale invariant features transform (SIFT), gradient Nogata Figure (HOG), the method such as light stream histogram (HOF), the simple motion feature more robust of the aspect ratio of these manual designs, but it is local The amount of calculation for describing son is larger, is also easily disturbed by noise.Current these features description achieves good effect, but its The space of lifting very little, it is necessary to find more powerful effective feature.

3）Feature based on middle level semantic understanding.This category information generally uses a unified manikin, this model Human body is divided into different parts, such as head, shoulder, arm, leg, this expression can obtain precision higher, also compare Compared with robust, but the structure of model is relative complex, it is necessary to substantial amounts of work.In order to lift accuracy rate, many articles utilize RGB-D depths Degree figure is detected.

Three of the above is the main feature used in traditional Activity recognition.The wise and able performance in order to lift Activity recognition of woods, Effective integration is carried out using multiple features, certain achievement is achieved.Lai et al. per frame as an example, using how real Example learning method, system can be while reasoning frame tagging and video tab.Haoi et al. goes to model frame of video using structuring SVM Between relation, and predict whole event in observation portion event.

Conventional learning algorithmses achieve good result to a certain extent.But with the appearance of deep learning, many necks The performance in domain is obtained for greatly lifting.Deep learning simulates the working mechanism of brain, and it employs the network structure of multilayer, It is a kind of nonlinear model, abstract by carrying out to visual object with powerful data capability of fitting and learning ability, Can be with unsupervised from data learning to internal information.The feature obtained by deep learning, the sensing results phase with the mankind Seemingly, certain semantic information has been usually contained, identification has more conducively been analyzed to visual object.

Static images classification is carried out using convolutional neural networks have been achieved with greatly success.But because the complexity of video Structure and noise jamming, the attention rate of Activity recognition are less.If we are processed every frame frame of video as static images, Then the result of all frames of video is averaging and draws and classify belonging to this video.But such method relies only on static frames loses Too many time related information is lost, can so cause hydraulic performance decline.

For Activity recognition, two problems are primarily present.

1st, the wide gap between feature and video semanteme is extracted.The feature of mankind's hand-designed, achieved effectively in the past Achievement, but present manual feature representation is to a bottleneck, it is difficult to obtain big lifting again, particularly processes video The problems such as this challenge, the ambient interferences of video, frame per second change, illumination variation, motion of visual angle change and video camera etc. The performance of system is all had a strong impact on, so needing a more preferable feature description.

2nd, another problem for meriting attention is how to model the temporal information of video.Using only space characteristics, it is impossible to Accurate description video.In order to solve this problem, it is necessary to consider to add temporal information.Different actions possess different sequences, Time-domain information is added can effectively lift recognition effect.

The content of the invention

Goal of the invention of the invention is to provide a kind of for Activity recognition of the video based on two-way length mnemon in short-term Method, traditional manual features are replaced by the feature using multilayer deep learning, the accuracy rate of classification are improved, by using double To short-term memory long (Bi-directional Long Short-Term Memory) capture time information, when obtaining more Domain structure information, to improve the effect of Activity recognition.

To achieve the above object of the invention, the technical solution adopted by the present invention is：It is a kind of for video based on two-way length When mnemon Activity recognition method, comprise the following steps：

(1) input video sequence, extracts the RBG frame sequences and light stream picture in the video sequence；

(2) depth convolutional network is trained：It is respectively trained RGB image depth convolutional network and light stream picture depth convolutional network；

(3) multilayer feature of network is extracted, wherein at least extracts the 3rd convolutional layer, the 5th convolutional layer, the spy of the 7th full articulamentum Levy；A vector for fixed size is obtained to full articulamentum；He Chihua is carried out to convolutional layer feature, temporal information is added；

(4) using each layer characteristic vector obtained from convolutional neural networks, to using two-way length, mnemon structure is passed in short-term Neutral net is returned to be trained, and input test collection obtains the probability matrix and RGB of the every frame of light stream frame in video using Softmax Probability matrix of the frame per frame；

(5) probability matrix respectively to all light stream frames in video is averaged, and the probability matrix to all RGB frames in video takes Averagely, the probability matrix of light stream frame and RGB frame is finally merged, the class of maximum probability is taken as last classification results, it is thus real Existing Activity recognition.

Preferred technical scheme, in step (2), the 3rd convolutional layer of selection, the 5th convolutional layer and the 7th full articulamentum conduct Feature representation.

Described and pond turns to：

It is in m layers of characteristic pattern of time t, pond is carried out using following formula：

Wherein, j is pond frame, and N is time domain pond scope, finally describes the dimension of son and is,It is feature The height of figure,It is the width of characteristic pattern, C is characteristic pattern channel number, and T is the totalframes of video.

Further technical scheme, before step (4) behind described and pond, using spatial pyramid maximum pond, makes Two-layer pyramid is used, the characteristic vector that dimension is fixed length is obtained.

Preferentially, N is 15.

Because above-mentioned technical proposal is used, the present invention has following advantages compared with prior art：

1st, the present invention replaces traditional manual features, the depth characteristic of different layers to characterize not using the feature of multilayer deep learning Same information, the combination of multilayer feature can improve the accuracy rate of classification.

2nd, the present invention is using two-way short-term memory (Bi-directional Long Short-Term Memory) capture long Temporal information.Two-way short-term memory long is usually used to treatment sequence problem, and mnemon can only be to folk prescription in short-term for unidirectional length It is modeled to the time, but present frame will not only refer to former frame sometimes, it is also contemplated that the information of a later frame, so, using double Just both direction can be modeled to mnemon in short-term long, it is ensured that the accuracy of Activity recognition.

Brief description of the drawings

Fig. 1 is the method frame composition of the embodiment of the present invention；

Fig. 2 is the visualization of the characteristic pattern to identical input picture different layers；

Fig. 3 is LSTM configuration diagrams；

Fig. 4 is the two-way LSTM configuration diagrams in the embodiment of the present invention；

Fig. 5 is UCF101 data set schematic diagrames；

Fig. 6 is the visualization schematic diagram of different layers.

Specific embodiment

Below in conjunction with the accompanying drawings and embodiment the invention will be further described：

Embodiment one：It is shown in Figure 1, it is a kind of for Activity recognition side of the video based on two-way length mnemon in short-term Method, using deep learning to feature with reference to two-way length, mnemon carries out Activity recognition in short-term.In order to select one it is strong Feature representation, use the traditional hand-designed feature of the deep learning feature replacement of multilayer, improve Activity recognition performance.For Abundant excavation timing information, using two-way length, mnemon (Bi-LSTM) is modeled in short-term, it can capture both direction The change of upper time series, there is provided information be better than unidirectional length mnemon in short-term.

1st, convolutional neural networks and its multilayer feature

In order to extract effectively expressing feature, it is necessary to train a depth convolutional network.The present embodiment uses simple and effective Caffe frameworks build system.Pre-training is carried out using ImageNet data the set pair analysis model.ImageNet data sets are comprising substantial amounts of Picture, this can ensure the generalization ability of model.Afterwards, the good model of pre-training is migrated to be finely adjusted to real data collection. Use the network architecture of similar double fluid.The light stream of RGB frame and video sequence is extracted in advance and is saved in local disk.The present embodiment Using the frame of 224 × 224 sizes as input.

Different layers include different information.The preceding several layers of of network have more low-level features, such as marginal information.Network It is several layers of afterwards can be more abstract, encode the more semantic informations of video.Fig. 2 is the characteristic pattern that picture different layers are input into identical Visualization.From figure, it has been found that different convolutional layers are represented comprising different information.They have different to same width picture Response.We are by experiment the 3rd convolutional layer (conv3) of last selection, the 5th convolutional layer (conv5) and the 7th full articulamentum (fc7) as our feature representation.The generalization ability of the first convolutional layer (conv1) and the second convolutional layer (conv2) is too poor, right Last recognition result effect is little.

The parameter of each layer of convolutional neural networks is as follows：

After characteristic pattern is extracted, characteristic pattern is used and pond (Sum Pooling).If it is in m layers of characteristic pattern of time t, Pond is carried out using following formula：

Wherein, j is pond frame, and N is time domain pond scope.Here 15 frames are selected.Finally describing the dimension of son is。It is the height of characteristic pattern,It is the width of characteristic pattern, C is characteristic pattern channel number, and T is the totalframes of video.For connecting entirely Layer is connect, 4096 dimensional feature vectors for directly being exported using network.

2nd, recurrent neural network

Mainly there are two kinds of neutral net, feedforward neural network and recurrent neural network in the prior art.Feedforward Neural Networks Network has been achieved for successfully in numerous applications, but it but cannot very well process sequence problem.Conversely, recurrent neural network Feature is allowed to be modeled time domain very simple.Recurrent neural network is using sequence as input.For video sequence, output with Present frame is relevant with former frame.If given list entries is expressed as, there is equation below：

WhereinThe activation value of the hidden layer in time t is represented,Input layer to the weight matrix of hidden layer is represented,Table Show the weight matrix between hidden layer,It is skew, f is activation primitive.Finally, exported by equation below：

Wherein,Hidden layer to the weight matrix of output layer is represented,It is output offset.

RNN subject matters are that it can only effectively be modeled to the sequence of short time, as network depth is deepened, can be caused Gradient disperse, frame in the past is too small to current action effect.In order to solve the problem, mnemon (LSTM) in short-term long is introduced Three doors go to keep network state.Three doors are respectively input gates, forget door and out gate.By adding before input gate control These three doors, it makes in sequence problem is processed, and graded is more steady.LSTM is using one than conventional recursive neutral net Complicated framework goes to improve the performance for the treatment of time series long.Fig. 3 describes LSTM frameworks.

3rd, two-way length mnemon in short-term

Although LSTM can capture sequential column information long, it is unidirectional.That is, before present frame is only received in LSTM Frame influences.In the present invention, in order to strengthen this relation, it is allowed to be extended to two-way.That is, treatment when the current frame, it is also contemplated that The influence of frame afterwards.Its implementation is, processes from back to front along with one layer.Two-way LSTM models such as Fig. 4 of the present embodiment It is shown.Ground floor is to process sequence from front to back, and the second layer is to process sequence from back to front.Finally, the result of this two-layer can be total to Same-action is input in Softmax graders.Unidirectional LSTM can only capture the time-evolution information in direction, but two-way LSTM Two-way sequential organization can be modeled, therefore it can capture more spatial structure information.

The model is tested on UCF101 and HMDB51 data sets.UCF101 data sets include 13,320 video sequences, Totally 101 behavior classifications.As shown in figure 5, these videos are shot by professional.It is divided into three parts, each portion It is respectively divided training set and test set.The method of the present embodiment is tested on each part and by results averaged. HMDB51 data sets come from many aspects, such as film or Internet video.This data set is made up of 6766 video sequences, Totally 51 behavior classifications.It has more challenge than other data sets, because its background environment is more complicated.Data set is also divided It is three parts.Three averages of partial results are taken as last result.

In order to choose useful CNN layers, the experimental result of different layers is compared on UCF101 data sets.Can be with from Fig. 6 It was found that, for width picture input, different layers have different activation.It is preceding it is several layers of have more detailed information, it is rear several layers of more to take out As.But it has also been found that the first convolutional layer and the second convolutional layer include excessive interference information.

Next step carries out quantitative analysis.For simple and quick comparing, each video is only sampled 10 frames.Short-sighted frequency is on the one hand special Levy representative enough, on the other hand reduce experimental period.Result in table 1 is with RGB frame on spatial network As the result that input is produced.From table 1, can also find that earliest several layers of effects are not best, this can influence last result.

The different individual layer performances on UCF101 of table 1

Layer	conv1	conv2	conv3	conv4	conv5	fc6	fc7
								Accuracy rate	33.5%	44.7%	68.2%	68.9%	69.1%	60.3%	61.4%

Continue the result of different layers combination.Result more more preferable than individual layer can be obtained using multiple layer combination.This demonstrate that multilayer CNN features are more powerful.Such as table 2, find to use the 3rd convolutional layer (conv3), the 5th convolutional layer (conv5) and the in an experiment Seven full articulamentums (fc7) obtain best result.

Performance of the various combination of table 2 layer on UCF101

The number of plies	Accuracy rate
		conv3 + conv4	70.1%
conv4 + conv5	69.2%
		conv3 + conv5	70.5%
conv3 + conv5 + fc6	70.3%
		conv3 + conv5 + fc7	70.8%

Again to the two-way length of the present embodiment in short-term mnemon (Bi-LSTM) and averaging model and unidirectional unit in short-term long carry out it is right Than experiment.Averaging model directly go to obtain last result by the score of average each Softmax.Using individual layer LSTM in individual layer In LSTM models.All three model has been proposed that fc7 as input.Find two-way LSTM better than average from the result in table 3 Model and unidirectional LSTM.

Different model performances on the UCF101 of table 3

Model	Accuracy rate
		Averaging model	61.4%
Unidirectional LSTM	62.1%
		Two-way LSTM	63.5%

Finally, table 4 describes the accuracy rate of various methods.The method that proof adds multi-layer C NN features and two-way LSTM can be with Effectively improve Activity recognition performance.

Result of the distinct methods of table 4 on data set UCF101 and HMDB51

Model	UCF101	HMDB51
			STIP+BovW (2011, 2012)	43.9%	23.0%
Motionlets (2013)	-	42.1%
			DT+MVSV(2014)	83.5%	55.9%
iDT+FV (2013)	85.9%	57.2%
			iDT+HSV (2014)	87.9%	61.1%
Two-Stream (2014)	88.0%	59.4%
			LRCN (2015)	82.9%	-
BSS (2015)	88.6%	-
			Composite LSTM (2015)	84.3%	-
The present embodiment	88.9%	62.3%

Claims

1. a kind of for Activity recognition method of the video based on two-way length mnemon in short-term, comprise the following steps：

2. according to claim 1 for Activity recognition method of the video based on two-way length mnemon in short-term, it is special Levy and be：In step (2), the 3rd convolutional layer of selection, the 5th convolutional layer and the 7th full articulamentum are used as feature representation.

3. according to claim 1 for Activity recognition method of the video based on two-way length mnemon in short-term, it is special Levy and be, described and pond turns to：

4. according to claim 3 for Activity recognition method of the video based on two-way length mnemon in short-term, it is special Levy and be：Before step (4) behind described and pond, using spatial pyramid maximum pond, using two-layer pyramid, tieed up Spend the characteristic vector for fixed length.

5. according to claim 3 for Activity recognition method of the video based on two-way length mnemon in short-term, it is special Levy and be：N is 15.