CN105787458A

CN105787458A - Infrared behavior identification method based on adaptive fusion of artificial design feature and depth learning feature

Info

Publication number: CN105787458A
Application number: CN201610139450.8A
Authority: CN
Inventors: 高陈强; 吕静; 杜银和; 汪澜; 刘江
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2016-03-11
Filing date: 2016-03-11
Publication date: 2016-07-20
Anticipated expiration: 2036-03-11
Also published as: CN105787458B

Abstract

The invention relates to an infrared behavior recognition method based on adaptive fusion of artificially designed features and deep learning features, comprising the following steps: S1: extracting dense trajectory features from the original video through artificially designed feature modules; S2: extracting the extracted features S3: Using the CNN feature module, use a variational optical flow algorithm to extract the optical flow information of the original video image sequence, and obtain the corresponding optical flow image sequence; S4: Using the convolutional neural network, the step The optical flow map sequence obtained in S3 extracts CNN features; S5: Divide the data set into a training set and a test set; for the training set data, optimize the network learning weights through weights, and use the learned weights to classify CNN features into network and manual design The probability output of the feature classification network is weighted and fused, and the optimal weight is obtained by comparing the recognition results, and it is applied to the classification of the test set data. This method innovates the way of feature fusion, improves the reliability of behavior recognition in infrared video, and is of great significance to the follow-up video analysis work.

Description

Based on the infrared behavior recognition methods that engineer's feature and degree of deep learning characteristic self adaptation merge

Technical field

The invention belongs to image procossing and technical field of computer vision, relate to a kind of infrared behavior recognition methods merged based on engineer's feature and degree of deep learning characteristic self adaptation.

Background technology

In recent years, image and the Activity recognition in video become a vital task of computer vision field, and the work such as video monitoring, Video Information Retrieval Techniques, man-machine interaction are all of great significance by the Activity recognition in video.Along with various actions recognizer constantly refreshes the recognition accuracy of each public data collection, the Activity recognition task in video has made great progress.But current most data set is all based on visible light video, the Activity recognition based on infrared video works relatively fewer.

The Activity recognition algorithm of current main flow relates generally to two kinds of description: engineer's Feature Descriptor and description obtained by degree of depth study.

Engineer's feature and typical local description, such as space-time interest points feature (Spatial-TemporalInterestPoint, STIP), histograms of oriented gradients (HistogramofOrientedGradient, HOG), light stream direction histogram (HistogramofOpticalFlow, HOF), intensive track characteristic (DenseTrajectory, DT) etc., it is based on movable information between the texture information of image, vision shape and different frame etc. different behaviors is classified and identifies；Owing to intensive track characteristic comprises the abundant sub-information of description such as HOG, HOF, MBH (MotionBoundaryHistory), become engineer's feature that current recognition accuracy is higher；Lifting along with computing power, the CNN feature extracted by convolutional neural networks is used as image, Activity recognition in video becomes popular research direction in recent years, convolutional neural networks achieves the profound excavation to pictorial information, effective extraction has the information of identification, the 3D convolutional neural networks having been proposed that at present, degree of depth convolutional neural networks, in the models such as dual pathways convolutional neural networks, effect is a dual pathways convolutional neural networks being made up of temporal information passage and spatial information passage most preferably, each more challenging data set all achieves good experiment effect.

But the Activity recognition research work being currently based on infrared video is relatively fewer, in video monitoring works, if running into the relatively low vile weather of the visibility such as rain, mist or nighttime conditions descends, it is seen that light video monitoring just loses meaning.Therefore, infrared video Activity recognition has critically important practical value, is badly in need of a kind of effective infrared video Activity recognition algorithm at present.

Summary of the invention

In view of this, it is an object of the invention to provide a kind of infrared behavior recognition methods merged based on engineer's feature and degree of deep learning characteristic self adaptation, the method takes full advantage of feature and the advantage of infrared image, engineer's feature is improved, and by two sorter networks and a weight learning function, it is weighted the probability output of engineer's tagsort network and degree of deep learning characteristic sorter network merging, the accuracy rate being effectively improved in infrared video Activity recognition.

For reaching above-mentioned purpose, the present invention provides following technical scheme:

A kind of infrared behavior recognition methods merged based on engineer's feature and degree of deep learning characteristic self adaptation, the method comprises the following steps:

S1: by engineer's characteristic module, the intensive track characteristic that original video is improved extracts；

S2: the engineer's feature extracted in step S1 is carried out feature coding；

S3: by CNN characteristic module, to original video image sequence, utilizes a kind of variation optical flow algorithm to extract Optic flow information, obtains corresponding light flow image sequences, as the input of convolutional neural networks；

S4: utilize convolutional neural networks, extracts CNN feature to the light stream graphic sequence obtained in step S3；

S5: data set is divided into training set and test set；To training set data, by weight optimization e-learning weight, utilize study to weight be weighted the probability output of CNN tagsort network and engineer's tagsort network merging, draw optimal weights by contrasting recognition result, and be applied to the classification of test set data.

Further, in step sl, described to the engineer's feature after original video extraction improvement, specifically include: first each frame infrared image is carried out point of interest intensive sampling, then between consecutive image sequence, follow the trail of point of interest, describe addition track correspondence Pictures location gray value weight in son at the track obtained, the track after weighting is described the son intensive track characteristic as improvement.

Further, in step s 2, described is encoded intensive track characteristic after the improvement extracted, and coding carries out in the following ways:

1): Bag-of-Words (BOW)

It is step by step rapid that this feature coding method includes two:

11) visual dictionary is generated: in training set data, extract feature, by clustering algorithm, the feature extracted is clustered；Cluster centre can be regarded as the visual vocabulary in dictionary, and all visual vocabularies form a visual dictionary；

12) with the lexical representation image in dictionary: each feature in image is mapped on certain word of visual dictionary, then add up each visual word occurrence number on an image, image table can be shown as the histogram vectors that a dimension is fixing；

2): FisherVector (FV)

FisherVector coding is a kind of coded system based on Fisher core principle, gauss hybrid models (GMM) is trained by the method for maximal possibility estimation first with training sample, then utilizing gauss hybrid models that the primitive character (such as Dense-Traj) extracted from sample is modeled, the primitive character of sample is encoded into the Fisher vector being easy to study and tolerance by the model parameter that recycling generates；

3): VectorofLocallyAggregatedDescriptors (VLAD)

First the feature extracted in training set is clustered by VLAD coding, obtain code book, then calculate the gap of each word in primitive character and code book, and these differences are added up, finally the accumulative difference of all words is carried out being spliced to form a new vector to represent image.

Further, in step s3, described utilization one variation optical flow algorithm extracts Optic flow information, obtains corresponding light flow image sequences, specifically includes:

S31: when the constant hypothesis of brightness, constant gradient hypothesis and space-time smoothness constraint continuously, it is proposed to the energy functional of improvement；

S32: derive the Euler-Lagrange equation of its correspondence according to energy function, then utilizes Gauss-Seidel or SOR method to obtain light stream vector, obtains corresponding light flow graph.

Further, in step s 4, to the light flow image sequences obtained, utilize convolutional neural networks, extract the output of full articulamentum as CNN feature；CNN adopts multitiered network structure, convolutional layer and sampling layer to be alternately present, and network output layer is the full connected mode of feedforward network.

Further, in step s 5, described data set is divided into training set and test set；Utilize training set data to pass through the study of self adaptation Fusion Module and obtain optimal weights, and optimal weights is applied to test set data classifies, specifically include:

S51: utilize training set data to learn optimal weights:

Self adaptation Fusion Module includes two sorter networks comprising full articulamentum and softmax grader layer and a single node logical function；The input of two sorter networks is engineer's feature and the CNN feature of training set data respectively, and two kinds of features obtain respective probability output P1 and P2 after being input to map network, utilizes single node logical function to calculate the weight Q of correspondence simultaneously；The final probability output of two sorter networks being weighted according to specified weight is merged, and utilizes error backpropagation algorithm to judge identification error, update gradient, it is judged that and export optimal weights；

S52: optimal weights is applied to test set data and classifies:

Engineer's feature test set data extracted and CNN feature are input to corresponding sorter network, obtain corresponding probability output, utilize the probability output of two sorter networks of optimal weights Weighted Fusion obtained in S51, obtain the recognition result of test set data.

The beneficial effects of the present invention is: the method for the invention highlights the infrared video advantage compared with visible light video, Traditional Man design feature and the degree of deep learning characteristic recently quickly grown are conjointly employed in Activity recognition task, innovate Feature Fusion mode, improve the reliability of Activity recognition in infrared video, subsequent video analysis work is of great importance.

Accompanying drawing explanation

In order to make the purpose of the present invention, technical scheme and beneficial effect clearly, the present invention provides drawings described below to illustrate:

Fig. 1 is the dual pathways infrared behavior recognizer schematic diagram of the present invention；

Fig. 2 is that the intensive track characteristic improved extracts schematic diagram；

Fig. 3 is light stream convolutional neural networks feature extraction schematic diagram；

Fig. 4 is self-adaptive features Fusion Model schematic diagram.

Detailed description of the invention

Below in conjunction with accompanying drawing, the preferred embodiments of the present invention are described in detail.

In the present invention, engineer's characteristic module, the intensive track characteristic after original video is improved extracts, and the feature extracted is encoded；The intensive track characteristic improved is add gray value weight information in description of original intensive track, the space time information of sequence of video images of the intensive track characteristic major embodiment after improvement, highlights the foreground moving information of sequence of pictures；CNN characteristic module, original Infrared video image sequence utilize a kind of variation optical flow algorithm extract Optic flow information, form light flow image sequences, using the light flow graph that extracts as the input of convolutional neural networks, take the feature of the full articulamentum of convolutional neural networks as video CNN Feature Descriptor；First data set is divided into training set and test set by self adaptation Fusion Module；To training set data, weight is learnt by weight learning network, utilize study to weight be weighted the probability output of CNN tagsort network and engineer's tagsort network merging, draw optimal weights by contrasting recognition result, and optimal weights be applied to the classification of test set data.

Fig. 1 is the dual pathways infrared behavior recognizer schematic diagram of the present invention, as it can be seen, the method for the invention specifically includes following steps:

Step S1: to original video extract improve after intensive track characteristic, its particular flow sheet as shown in Figure 2:

S101: each frame infrared image of original video is carried out point of interest intensive sampling；

S102: following the trail of point of interest between consecutive image sequence, the pixel that between two continuous frames, displacement is only small and excessive is all rejected, and finally gives the point of interest track that can characterize action；

S103: describe addition track correspondence Pictures location gray value weight in son at the track obtained, describes the son intensive track characteristic as improvement using the track after weighting.

Original video is resolution be 293 × 256 infrared data collection, totally 12 type of action, each type of action has 50 video samples.Intensive track characteristic is engineer's feature that in current image Classification and Identification task, accuracy rate is higher, but the data set being mostly used in test is all visible data collection.Owing to infrared image can reflect the infrared emanation of each target in image, heat radiation is more strong, gray value is more big, making there is obvious contrast between target and background, based on this feature of infrared image, we add gray value weight on the basis of original intensive track characteristic, the track making gray value high has bigger weight, the track that gray value is less distributes less weight, improves original intensive track characteristic, highlights the advantage of infrared image Activity recognition.

Step S2: the engineer's feature extracted is carried out FisherVector coding.

Step S3: utilize a kind of variation optical flow algorithm that original video image sequence extracts Optic flow information, form light flow image sequences.The data item of this algorithm energy functional is made up of the constant hypothesis of brightness and the constant hypothesis of gradient, adds discontinuous space-time smoothness constraint, has continuous preferably and rotational invariance, calculates speed fast, and precision is high.

Step S4: the picture of the light flow image sequences that step S3 is obtained carries out pretreatment, it is 227 × 227 that dimension of picture adjusts resolution, and it can be used as the input of convolutional neural networks, this convolutional neural networks is made up of five convolutional layers and three full articulamentums, finally taking the feature of second convolutional layer as feature representation, flow chart is as shown in Figure 3.

Convolutional neural networks (ConvolutionalNeuralNet, CNN) it is a kind of Multi-layered Feedforward Networks, convolutional layer and sampling layer as network intermediate layer are alternately present, and network is output as the full connected mode of feedforward network, and the dimension of output layer is the classification number in classification task.Convolutional neural networks directly can automatically learn individual features from original input data, saves the characteristic Design process that general algorithm needs, it is possible to learn and find more effective feature.

Step S5: training set data utilizes self adaptation Fusion Module to learn optimal weights, and is applied to the classification of test set data.The schematic diagram of self adaptation Fusion Model is as shown in Figure 4:

S501: data set is divided into training set and test set；

S502: utilize training set sample data to learn optimal weights: self adaptation Fusion Model includes two sorter networks comprising full articulamentum and softmax grader layer and a single node logical function；The input of two sorter networks is engineer's feature and CNN feature that training set data is extracted respectively, and two kinds of features obtain respective probability output P1 and P2 after being input to map network, single node logical function calculates corresponding weight Q simultaneously；The final probability output of two sorter networks being weighted according to specified weight is merged, and utilizes error backpropagation algorithm to judge identification error, update gradient, it is judged that and export optimal weights；

S503: optimal weights is applied to test set data and classifies: engineer's feature test set sample data extracted and CNN feature are input to corresponding sorter network, obtain corresponding probability output, utilize optimal weights proportion weighted to merge the probability output of two sorter networks, obtain test set data recognition result.

The main purpose of error back propagation (ErrorBackpropagation, BP) algorithm is by by output error anti-pass, giving all unit of each layer by error distribution, thus obtaining the error signal of each layer unit, and then revises the weights of each unit.The learning process of BP algorithm is made up of two processes of back propagation of the forward-propagating of signal Yu error.During forward-propagating, input sample is incoming from input layer, after each hidden layer successively processes, is transmitted to output layer.If the actual output of output layer and desired output are not inconsistent, then proceed to the back-propagation phase of error.Error-duration model be by output error with some form by each hidden layer to input layer successively anti-pass, and error distribution is given all unit of each layer, thus obtaining the error signal of each layer unit, this error signal is namely as the foundation of correction unit weights.In this module, the process that this weights gone round and begun again constantly adjust just makes to finally give optimal weights.

What finally illustrate is, preferred embodiment above is only in order to illustrate technical scheme and unrestricted, although the present invention being described in detail by above preferred embodiment, but skilled artisan would appreciate that, in the form and details it can be made various change, without departing from claims of the present invention limited range.

Claims

1. An infrared behavior recognition method based on artificial design features and deep learning feature adaptive fusion, is characterized in that: the method comprises the following steps:

S1: Through the manual design of the feature module, the improved dense trajectory feature extraction is performed on the original video;

S2: performing feature encoding on the artificially designed features extracted in step S1;

S3: Through the CNN feature module, use a variational optical flow algorithm to extract the optical flow information for the original video image sequence, and obtain the corresponding optical flow image sequence as the input of the convolutional neural network;

S4: using a convolutional neural network to extract CNN features from the optical flow graph sequence obtained in step S3;

S5: Divide the data set into a training set and a test set; for the training set data, optimize the network learning weight through the weight, and use the learned weight to weight the probability output of the CNN feature classification network and the artificially designed feature classification network. Through comparison The recognition results yield the optimal weights, which are applied to the test set data classification.

2. The infrared behavior recognition method based on artificial design features and deep learning feature adaptive fusion according to claim 1, characterized in that: in step S1, the improved artificial design features extracted from the original video, specifically include : First, densely sample interest points for each frame of infrared image, then track the interest points between consecutive image sequences, add the gray value weight of the trajectory corresponding to the image position in the obtained trajectory descriptor, and use the weighted trajectory descriptor as Improved dense trajectory features.

3. The infrared behavior recognition method based on artificial design features and deep learning feature adaptive fusion according to claim 1, characterized in that: in step S2, the described improved dense trajectory feature after extraction is encoded, and the encoded Proceed in the following ways:

1): Bag-of-Words (BOW)

This feature encoding method consists of two steps:

11) Generate a visual dictionary: extract features from the training set data, and cluster the extracted features through a clustering algorithm; the cluster center can be regarded as the visual vocabulary in the dictionary, and all visual vocabulary forms a visual dictionary;

12) Represent the image with the words in the dictionary: each feature in the image is mapped to a word in the visual dictionary, and then the number of occurrences of each visual word on an image is counted, and the image can be represented as a a histogram vector of fixed dimensionality;

2): Fisher Vector (FV)

FisherVector coding is a coding method based on the Fisher kernel principle. First, the Gaussian mixture model (GMM) is trained by using the training sample through the method of maximum likelihood estimation, and then the original features extracted from the sample (such as Dense- Traj) to model, and then use the generated model parameters to encode the original features of the sample into Fisher vectors that are easy to learn and measure;

3): Vector of Locally Aggregated Descriptors (VLAD)

VLAD encoding first clusters the features extracted on the training set to obtain the codebook, then calculates the difference between the original features and each word in the codebook, and accumulates these differences, and finally splices the accumulated differences of all words Form a new vector to represent the image.

4. The infrared behavior recognition method based on artificial design features and deep learning feature adaptive fusion according to claim 1, characterized in that: in step S3, the described utilization of a variational optical flow algorithm to extract optical flow information, Obtain the corresponding optical flow image sequence, including:

S31: Under the assumption of constant brightness, constant gradient and continuous space-time smooth constraints, an improved energy functional is proposed;

S32: Deduce the corresponding Euler-Lagrange equation according to the energy function, and then use the Gauss-Seidel or SOR method to calculate the optical flow vector to obtain the corresponding optical flow diagram.

5. The infrared behavior recognition method based on artificial design features and deep learning feature adaptive fusion according to claim 1, characterized in that: in step S4, the obtained optical flow image sequence is extracted using a convolutional neural network The output of the fully connected layer is used as the CNN feature; CNN adopts a multi-layer network structure, the convolutional layer and the sampling layer appear alternately, and the network output layer is a feedforward network fully connected.

6. the infrared behavior recognition method based on manual design feature and deep learning feature adaptive fusion according to claim 1, is characterized in that: in step S5, described dataset is divided into training set and test set; The set data is learned through the adaptive fusion module to obtain the optimal weight, and the optimal weight is applied to the test set data for classification, including:

S51: Use the training set data to learn the optimal weight:

The adaptive fusion module includes two classification networks including a fully connected layer and a softmax classifier layer and a single-node logic function; the inputs of the two classification networks are the artificially designed features and CNN features of the training set data, and the two features are input to After the corresponding network, the respective probability outputs P1 and P2 are obtained, and the corresponding weight Q is calculated by using the single-node logic function; finally, the probability outputs of the two classification networks are weighted and fused according to the specified weight, and the error back propagation algorithm is used to judge the recognition error , update the gradient, judge and output the optimal weight;

S52: Apply the optimal weight to the test set data for classification:

Input the artificially designed features and CNN features extracted from the test set data into the corresponding classification network to obtain the corresponding probability output, and use the optimal weight obtained in S51 to weight the probability outputs of the two classification networks to obtain the recognition result of the test set data.