CN106599907A

CN106599907A - Multi-feature fusion-based dynamic scene classification method and apparatus

Info

Publication number: CN106599907A
Application number: CN201611073666.5A
Authority: CN
Inventors: 曹先彬; 郑洁宛; 黄元骏; 潘朝凤; 刘俊英
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2016-11-29
Filing date: 2016-11-29
Publication date: 2017-04-26
Anticipated expiration: 2036-11-29
Also published as: CN106599907B

Abstract

The invention provides a multi-feature fusion-based dynamic scene classification method and apparatus. The method comprises the following steps: videos to be classified are obtained; a C3D feature extractor is used for performing feature extraction on the videos to be classified; first feature information is obtained; an iDT feature extractor is used for performing feature extraction on the videos to be classified, and second feature information is obtained; a VGG feature extractor is used for performing feature extraction on the videos to be classified, and third feature information is obtained; the first feature information, the second feature information and the third feature information are fused, and fusion features can be obtained; the videos to be classified are classified based on the fusion features, and a classification result of the videos to be classified is obtained. According to the feature fusion-based dynamic scene classification method provided in the invention, three kinds of feature extractors are used for extracting different features of the videos to be classified, short time dynamic features of the videos to be classified are taken into consideration, long time dynamic features and static features of the videos to be classified are also taken into consideration, and accurate dynamic scene classification can be enabled.

Description

The dynamic scene sorting technique of multiple features fusion and device

Technical field

The present invention relates to aviation surveillance technology, more particularly to the dynamic scene sorting technique and device of multiple features fusion.

Background technology

With the development and the continuous opening that uses low latitude field of country of unmanned air vehicle technique, unmanned plane is widely used In the tasks such as disaster inspection, mountain area rescue, goods and materials conveying, sample collection.Unmanned plane with camera head is in flight course In shot, and picture is returned to into server, server can carry out target detection tracking automatically according to image content, can be with Realize automatic identification weather, environment, the condition of a disaster etc..

To improve the accuracy of target detection tracking, those skilled in the art remove and have carried out numerous studies and improvement to algorithm Outward, it is also contemplated that the difference of the dynamic scene that target is located, the accuracy of target detection tracking can also be badly influenced.Therefore, Those skilled in the art's proposition carried out first dynamic scene classification before target detection tracking is carried out.But, existing dynamic field Scape sorting technique is generally based only upon still image and is classified, and causes nicety of grading poor.

The content of the invention

The present invention provides a kind of dynamic scene sorting technique and device of multiple features fusion, for solving existing dynamic scene Sorting technique is generally based only upon still image and is classified, and causes the problem that nicety of grading is poor.

On the one hand, the present invention provides a kind of dynamic scene sorting technique of multiple features fusion, including：

Obtain video to be sorted；

Feature extraction is carried out to the video to be sorted using Three dimensional convolution Neural Network Feature Extractor, first is obtained special Reference ceases；Feature extraction is carried out to the video to be sorted using improved dense track characteristic extractor, second feature is obtained Information；Feature extraction is carried out to the video to be sorted using visual geometric Neural Network Feature Extractor, third feature is obtained Information；

The fisrt feature information, the second feature information and the third feature information are merged, acquisition is melted Close feature；

The video to be sorted is classified according to the fusion feature, obtains the classification knot of the video to be sorted Really.

The dynamic scene sorting technique of multiple features fusion as above, it is described to the fisrt feature information, described the Two characteristic informations and the third feature information are merged, and obtain fusion feature, including：

The first corresponding fisrt feature data of default dimension in the fisrt feature information are obtained, the second feature is obtained The second corresponding second feature data of default dimension in information, obtain the 3rd default dimension in the third feature information corresponding Third feature data；

Merge according to the fisrt feature data, the second feature data, the third feature data acquisition special Levy.

The dynamic scene sorting technique of multiple features fusion as above, it is described to the fisrt feature information, described the Two characteristic informations and the third feature information are merged, and before obtaining fusion feature, are also included：

Obtain the respective fisrt feature information of all training videos in training video storehouse, second feature information and third feature Information；

According to the respective fisrt feature information of all training videos, second feature information and third feature information, obtain The Fei Sheer for taking all dimensions of fisrt feature information, second feature information and third feature information differentiates ratio；

The Fei Sheer of all dimensions in the fisrt feature information is differentiated than determining the fisrt feature information First default dimension, the Fei Sheer of all dimensions in the second feature information is differentiated than determining the second feature letter The default dimension of the second of breath, the Fei Sheer of all dimensions in the third feature information is differentiated than determining the third feature letter The default dimension of the 3rd of breath；

Wherein, the training video storehouse includes at least two training videos for belonging to a different category.

The dynamic scene sorting technique of multiple features fusion as above, i-th dimension in arbitrary characteristic information Fei Sheer differentiate ratio acquisition formula it is as follows：

K=S_b/S_i；

The classification sum of category, x_ijFor the characteristic matrix of i-th dimension of all training videos of j-th classification, m_ijFor j-th The average value matrix of the characteristic matrix of i-th dimension of all training videos of classification, m_ihFor all instructions of h-th classification Practice the average value matrix of the characteristic matrix of i-th dimension of video.The span of i is 1 to I positive integer, and I is described The dimension sum of the characteristic information belonging to i-th dimension, the span of j is 1 to J positive integer, and it is to remove that the span of h is Positive integer of outside j 1 to J.The dynamic scene sorting technique of multiple features fusion as above, the employing Three dimensional convolution nerve Network characterization extractor carries out feature extraction to the video to be sorted, obtains fisrt feature information, including：

The video to be sorted is divided, at least one video segment comprising N two field pictures is obtained；

Feature extraction is carried out to all video segments using Three dimensional convolution Neural Network Feature Extractor, obtains described Fisrt feature information；

Wherein, N is the positive integer more than 1.

The dynamic scene sorting technique of multiple features fusion as above, it is described to be extracted using improved dense track characteristic Device carries out feature extraction to the video to be sorted, obtains second feature information, including：

Obtain the dense track characteristic and homography matrix of the video to be sorted；

The dense track characteristic is corrected using the homography matrix, obtains the second feature information.

The dynamic scene sorting technique of multiple features fusion as above, the employing visual geometric neural network characteristics are carried Taking device carries out feature extraction to the video to be sorted, obtains third feature information, including：

An at least frame key frame is extracted in the video to be sorted, using VGG feature extractors to an at least frame Key frame carries out feature extraction, obtains the third feature information.

The dynamic scene sorting technique of multiple features fusion as above, it is described to be treated point to described according to the fusion feature Class video is classified, and obtains the classification results of the video to be sorted, including：

According to the fusion feature, the video to be sorted is classified using support vector machine classifier, obtain institute State the classification results of video to be sorted.

The dynamic scene sorter of multiple features fusion provided in an embodiment of the present invention is described below, the apparatus and method one One correspondence, to realize above-described embodiment in Feature Fusion dynamic scene sorting technique, with identical technical characteristic and Technique effect, the embodiment of the present invention is repeated no more to this.

On the other hand, the present invention provides a kind of dynamic scene sorter of multiple features fusion, including：

Video acquiring module to be sorted, for obtaining video to be sorted；

Characteristic extracting module, for carrying out spy to the video to be sorted using Three dimensional convolution Neural Network Feature Extractor Extraction is levied, fisrt feature information is obtained；Feature is carried out to the video to be sorted using improved dense track characteristic extractor Extract, obtain second feature information；Feature is carried out to the video to be sorted using visual geometric Neural Network Feature Extractor Extract, obtain third feature information；

Fusion Module, for entering to the fisrt feature information, the second feature information and the third feature information Row fusion, obtains fusion feature；

Sort module, for classifying to the video to be sorted according to the fusion feature, obtains described to be sorted The classification results of video.

The dynamic scene sorting technique and device of the Feature Fusion that the present invention is provided, it is special using C3D feature extractors, iDT The different characteristic that three kinds of feature extractors of extractor and VGG feature extractors extract video to be sorted is levied, different characteristic is merged Laggard Mobile state scene classification.Not only allow for the behavioral characteristics in short-term of video to be sorted, it is also contemplated that the length of video to be sorted When behavioral characteristics, more merged the static information of video to be sorted so that dynamic scene classification it is more accurate.

Description of the drawings

In order to be illustrated more clearly that the embodiment of the present invention or technical scheme of the prior art, below will be to embodiment or existing The accompanying drawing to be used needed for having technology description is briefly described, it should be apparent that, drawings in the following description are these Some bright embodiments, for those of ordinary skill in the art, without having to pay creative labor, can be with Other accompanying drawings are obtained according to these accompanying drawings.

The schematic flow sheet of the dynamic scene sorting technique embodiment one of the multiple features fusion that Fig. 1 is provided for the present invention；

The schematic flow sheet of the dynamic scene sorting technique embodiment two of the multiple features fusion that Fig. 2 is provided for the present invention；

The structural representation of the dynamic scene sorter embodiment one of the multiple features fusion that Fig. 3 is provided for the present invention.

Specific embodiment

To make purpose, technical scheme and the advantage of the embodiment of the present invention clearer, below in conjunction with the embodiment of the present invention In accompanying drawing, the technical scheme in the embodiment of the present invention is clearly and completely described, it is clear that described embodiment is The a part of embodiment of the present invention, rather than the embodiment of whole.Based on the embodiment in the present invention, those of ordinary skill in the art The every other embodiment obtained under the premise of creative work is not made, belongs to the scope of protection of the invention.

Before target detection tracking is carried out, determine dynamic scene type be favorably improved target detection tracking speed with Precision.The difficult point of dynamic scene classification problem is, due to factors such as illumination, view transformations, to cause in same class dynamic scene class Difference it is larger, and difference is less between the class of inhomogeneity dynamic scene.For example, equally it is the scene of forest fire, Ke Nengyou Different in the degree of fire, the visual angle of shooting is different, and the concentration of smog is different, shoots the dynamic scene for coming and there is very big difference It is different；Different dynamic scenes can be constituted by the different combination of same object, the gap resulted between class diminishes, such as waterfall Difference is less between the class of cloth scene and river scene.The existing dynamic scene sorting technique based on object identification in scene is being entered During Mobile state scene classification, it is impossible to solve that difference in above-mentioned class is big, the little technical problem of difference between class, cause classification speed it is slow, Precision is low.

To solve the above problems, the embodiment of the present invention provides a kind of dynamic scene sorting technique of multiple features fusion.Fig. 1 is The schematic flow sheet of the dynamic scene sorting technique embodiment one of the multiple features fusion that the present invention is provided.The executive agent of the method For the dynamic scene sorter of multiple features fusion, the device can be by software or hardware realization, and exemplary, the device can Think server, apparatus such as computer.As shown in figure 1, the method includes：

S101, acquisition video to be sorted；

S102, feature extraction is carried out to video to be sorted using Three dimensional convolution Neural Network Feature Extractor, obtain first Characteristic information；

S103, feature extraction is carried out to video to be sorted using improved dense track characteristic extractor, obtain second special Reference ceases；

S104, feature extraction is carried out to video to be sorted using visual geometric Neural Network Feature Extractor, obtain the 3rd Characteristic information；

S105, fisrt feature information, second feature information and third feature information are merged, obtain fusion feature；

S106, video to be sorted is classified according to fusion feature, obtain the classification results of video to be sorted.

Wherein, S102, S103 and S104 can be performed simultaneously, also can successively be performed, and the present invention is not limited this.

Specifically, exemplary in S101, video to be sorted can shoot what is obtained when being patrolled and examined for unmanned plane Video, can be using the video of real-time Transmission meeting server as video to be sorted.

Specifically, in S102, extracted using Three dimensional convolution (Convolution 3D, abbreviation C3D) neural network characteristics Device extracts the fisrt feature information of video to be sorted.C3D feature extractors are convolutional neural networks (Convolutional Neural Network, abbreviation CNN) framework, internal convolution kernel is 3 × 3 × 3 three dimensional convolution kernel.This feature extractor will be treated Classification video is divided into multiple segments and is processed, while the information in make use of whole frames of video to be sorted, therefore it is available In the multidate information in short-term and some static informations that extract input video.

Before specifically used C3D feature extractors, C3D feature extractors are entered initially with the video in training video storehouse Row training, generally comprises substantial amounts of classified sport video in short-term, comprising abundant movable information in training video storehouse.Example Property, the video that the training stage uses can come from sport-1M data bases, and the data base is regarded by 1,000,000 motions in short-term Frequency is constituted, the sport video in short-term such as example play basketball, play soccer.Therefore, it is special using the first of the acquisition of C3D network characterizations extractor Reference breath can be used for characterizing the static information hidden in video to be sorted and in short-term multidate information.

Optionally, the employing C3D feature extractors in S102 carry out feature extraction to video to be sorted, obtain fisrt feature Information, specifically includes：

S1021, video to be sorted is divided, obtained at least one video segment comprising N two field pictures；

S1022, using C3D feature extractors all video segments are carried out with feature extraction, obtain fisrt feature information；

Wherein, N is the positive integer more than 1.

Exemplary, it is contemplated that C3D networks are the feature extractors of a process video-frequency band, can be carried out video to be sorted Divide, obtain at least one video segment comprising N two field pictures, wherein N is the positive integer more than 1, exemplary, and N can be 16.Optionally, totalframes Ts of the N again smaller than video to be sorted.

When N is 16, exemplary, the basic configuration of C3D feature extractors can be：Comprising five convolutional layers, five Pond layer, each volume basic unit one pond layer of heel, two full articulamentums and a classification layer are used for predicting classification results.Five The neuron number of convolutional layer is respectively 64,128,256,256,256.Meanwhile, all of convolutional layer has the convolution of formed objects Core, is 3 × 3 × 3.Using maximum pond, the size of the core used by it is 2 × 2 × 2 to all of pond layer.Each full articulamentum Neuron number be 4096.When feature is extracted using C3D convolutional neural networks, made using the feature of second full articulamentum For result output.

Specifically, in S103, using improved dense track (Improved Dense Trajectory, abbreviation iDT) Feature extractor carries out feature extraction to video to be sorted, obtains second feature information.IDT feature extractors are used to extract to treat point Trace information in class video.

Optionally, the improved dense track characteristic extractor of employing in S103 carries out feature extraction to video to be sorted, Second feature information is obtained, including：

S1031, the dense track characteristic and homography matrix that obtain video to be sorted；

S1032, dense track characteristic is corrected using homography matrix, obtain second feature information.

Specifically, the existing dense track acquisition algorithm based on optical flow field can be adopted to obtain the dense rail of video to be sorted Mark feature.Obtaining after dense track is got, process is being filtered to all dense tracks, removing static in all tracks Motionless trace information and the trace information for having position to be mutated.

Further, after dense track is obtained, it is contemplated that photographic head is probably to carry flight by unmanned plane to be clapped Take the photograph, thus there is movement in itself in photographic head, the trace information that the movement of photographic head is caused merges in dense track, may shadow The classification of dynamic scene is rung, therefore the track of the mobile generation of photographic head need to be filtered.Produce to eliminate this photographic head movement Trace information, can model generation homography matrix and carries out track elimination using homography matrix.

To obtain homography matrix, first registration is carried out to the successive frame in video to be sorted.Exemplary adopting adds The method that fast robust features (Speed-up Robust Features, abbreviation SURF) and optical flow method are combined carries out registration.Then Reuse stochastical sampling concordance (Random Sample Consensus, abbreviation RANSAC) algorithm and obtain homography matrix. After acquiring homography matrix, dense track can be corrected using the homography matrix, be removed due to photographic head The wrong trace information that movement is caused, so as to obtain second feature information.Because iDT feature extractors are extracted video to be sorted From starting to all trace informations for terminating, thus second feature information can be used for characterizing video to be sorted it is long when track it is special Levy, i.e., multidate information when long.

Specifically, in S104, the visual geometric proposed using the visual geometric group of engineering science institute of Oxford University (Visual Geometry Group, abbreviation VGG) Neural Network Feature Extractor, according to the part two field picture of video to be sorted Extract the static information of video to be sorted.Wherein, VGG feature extractors are also CNN frameworks, in specifically used VGG feature extractions Before device, VGG feature extractors are trained initially with training picture library.It is different from C3D feature extractors, VGG feature extractions Comprising a large amount of classified static scene images in the training picture library that device is used.Exemplary, the picture that the training stage uses comes From Places-365 data bases, and places365 data bases are made up of the static scene picture of 365 classes, and each class is one Specific scene.Therefore, this feature extractor can well extract the static information for describing scene in video to be sorted partially, with Make up the problem of scene static information disappearance when C3D feature extractors extract feature.

Optionally, the employing VGG feature extractors in S104 carry out feature extraction to video to be sorted, obtain the 3rd special Reference ceases, and specifically includes：

An at least frame key frame is extracted in video to be sorted, an at least frame key frame is entered using VGG feature extractors Row feature extraction, obtains third feature information.

Specifically, key-frame extraction is carried out to video to be sorted first, one section of video is often made up of hundreds of frame picture, especially It is when unmanned plane during flying speed is slower, and the content difference in consecutive frame is little, if all carrying out feature extraction to each frame, Extraction rate will be caused relatively slow and consumed compared with multiple resource, in order to more efficiently extract the static information hidden in video, can Choosing key frame at initial, the middle, ending three in video to be sorted carries out the representative of the static information as video to be sorted.

VGG network characterizations extractor include 16 convolutional layers, 16 pond layers, each convolutional layer is followed by one Pond layer, three full articulamentums, a classification layer is used for output category result.Wherein the size of the convolution kernel of convolutional layer be 3 × 3.When feature is extracted using VGG convolutional neural networks, exported as a result using the feature of second full articulamentum.

Specifically, in S105, the fisrt feature information acquired using different characteristic extractor, second feature are believed Breath and third feature information are blended, and obtain fusion feature, fusion feature can be used to characterizing video to be sorted it is long when and in short-term Multidate information, and the different static information that different characteristic extractor is extracted.

Specifically, in S106, according to the fusion feature acquired in S105, using traditional support vector machine (Support Vector Machine, abbreviation SVM) linear classifier is classified, you can obtain the classification letter of video to be sorted Breath.

Exemplary, SVM classifier parameter C is set to 100, and core adopts linear kernel, before using SVM classifier, need to make Sorter model is trained with training set data, and sorter model is tested using test set data.

The dynamic scene sorting technique of the Feature Fusion that the present invention is provided, using C3D feature extractors, iDT feature extractions Three kinds of feature extractors of device and VGG feature extractors extract the different characteristic of video to be sorted, carry out after different characteristic is merged Dynamic scene is classified.Not only allow for the multidate information in short-term of video to be sorted, it is also contemplated that video to be sorted it is long when dynamic Information, has more merged the static information of video to be sorted so that dynamic scene classification results are more accurate.

Optionally, on the basis of above-described embodiment, in S105 to fisrt feature information, second feature information and Three characteristic informations are merged, and obtain fusion feature, are specifically included：

The first corresponding fisrt feature data of default dimension, obtain second feature letter in S1051, acquisition fisrt feature information The second default corresponding second feature data of dimension in breath, obtain in third feature information the 3rd default dimension corresponding 3rd special Levy data；

S1052, according to fisrt feature data, second feature data, third feature data acquisition fusion feature.

Specifically, each characteristic information is size identical two-dimensional matrix, for example, when fisrt feature information is that 1*4096 is big During little matrix, wherein 1 is row, 4096 are row, then it is believed that fisrt feature packet contains 4096 dimensions, the first dimension correspondence Characteristic the first column data for being designated as in fisrt feature information.The data of different dimensions are regarded for be sorted in characteristic information The classification of frequency affects different, when Feature Fusion is carried out, selects to affect the larger corresponding data of dimension to merge classification, The accuracy of dynamic scene classification can be improved.Exemplary, the corresponding default dimension of different characteristic data is different, presets dimension Quantity can also be different.

Further, on the basis of any of the above-described embodiment, in conjunction with specific embodiments, the determination feature in S1051 is melted The default dimension closed is described in detail.The dynamic scene sorting technique embodiment of the multiple features fusion that Fig. 2 is provided for the present invention Two schematic flow sheet.As shown in Fig. 2 before Feature Fusion is carried out, also including：

S201, obtain the respective fisrt feature information of all training videos in training video storehouse, second feature information and the Three characteristic informations；

S202, according to the respective fisrt feature information of all training videos, second feature information and third feature information, obtain The Fei Sheer for taking all dimensions of fisrt feature information, second feature information and third feature information differentiates ratio；

S203, the Fei Sheer of all dimensions in fisrt feature information differentiate than determine fisrt feature information first Default dimension, the Fei Sheer of all dimensions in second feature information differentiates more default than determine second feature information second Dimension, the Fei Sheer of all dimensions in third feature information differentiates dimension more default than determine third feature information the 3rd；

Wherein, training video storehouse includes at least two training videos for belonging to a different category.

Specifically, after the Fei Sheer of all dimensions in getting a characteristic information differentiates ratio, can be exemplary, will All dimensions differentiate the big minispread of ratio according to Fei Sheer, choose Fei Sheer and differentiate than the dimension higher than preset value as default dimension Degree.It is exemplary, can also choose Fei Sheer and differentiate than several dimensions before high as default dimension, exemplary, different characteristic The quantity of the default dimension of information can be with difference.

Larger default dimension is affected on visual classification in obtain each characteristic information, can be carried out point using training video Analysis, determines that Fei Sheer differentiates higher dimension.

Specifically, on the basis of with reference to any of the above-described embodiment, using specific embodiment to any feature information in appoint The Fei Sheer of dimension differentiates the acquisition of ratio, is described in detail.The Fei Sheer of the i-th dimension degree in any feature information differentiates ratio Acquisition formula it is as follows：

K=S_b/S_i；

Wherein, S_iFor the within-cluster variance of i-th dimension,S_bFor i-th dimension Inter _ class relationship,J is that the classification belonging to all training videos is total, x_ijFor The characteristic matrix of i-th dimension of all training videos of j-th classification, m_ijFor all training videos of j-th classification I-th dimension characteristic matrix average value matrix, m_ihFor i-th dimension of all training videos of h-th classification Characteristic matrix average value matrix.The span of i is 1 to I positive integer, and I is the spy belonging to i-th dimension The dimension sum of reference breath, the span of j is 1 to J positive integer, the span of h be in addition to j 1 to J positive integer

Wherein, S_iLess expression dimension is more similar in same class video, S_bIt is bigger represent between the dimension and class other The similarity of video is lower, therefore the value of k is the bigger the better, and is more conducive to helping visual classification.When the value of k is bigger, show this I-th dimension in characteristic information is bigger to the impact that dynamic scene is classified.After all dimensions corresponding k value is got, can It is combined from the dimension with larger k value in each characteristic information, obtains fusion feature.

It is exemplary, if there are 9 training videos for belonging to 3 classifications in training video storehouse, there are 3 training to regard per classification Frequently.For fisrt feature information C3D feature, the method for Fei Sheer differentiation ratios of the 1st dimension of this characteristic information is calculated such as Under：

First each video is divided into into 5 sections of short-sighted frequencies containing only 16 frame pictures, using C3D feature extractors to each short-sighted frequency Feature is extracted, the feature of fc7 layers is taken, the characteristic matrix of 5 1*4096 dimensions is obtained, taking can be with a 1*4096 after average The characteristic matrix of dimension represents a video.So, 9 training videos can use 9* Jing after C3D feature extractors extract feature 4096 characteristic matrix is represented.

For the feature of this 4096 dimensions, when the Fei Sheer that calculate the 1st dimension differentiates to be compared.First, 9* is obtained First column matrix of 4096 characteristic matrixes, obtains the matrix of a 9*1.The matrix of 3*1 above is first classification The feature that video extraction goes out, the matrix of middle 3*1 is the video extraction feature out of second classification, the square of 3*1 below Battle array for the 3rd classification video extraction feature out.Example, the matrix of this 9*1 is [1,2,1,2,3,2,3,1,3 ]^T。

Wherein, the average value matrix of [1,2,1] is [1.3,1.3,1.3]；[2,3,2] average value matrix for [2.3, 2.3,2.3]；[3,1,3] average value matrix is [2.3,2.3,2.3].

Then, within-cluster variance is calculated：

And calculating inter _ class relationship：

Finally, according to S_iAnd S_bK is compared in the Fei Sheer differentiations for obtaining the 1st dimension in fisrt feature information；

K=S_b/S_i=20.02/4.01=4.99.

On the other hand the embodiment of the present invention provides a kind of dynamic scene sorter of multiple features fusion, to perform as above Described embodiment of the method, with identical technical characteristic and technique effect, the present invention is repeated no more to this.

The structural representation of the dynamic scene sorter embodiment one of the multiple features fusion that Fig. 3 is provided for the present invention, such as Shown in Fig. 3, the device includes：

Video acquiring module to be sorted 301, for obtaining video to be sorted；

Characteristic extracting module 302, for carrying out spy to video to be sorted using Three dimensional convolution Neural Network Feature Extractor Extraction is levied, fisrt feature information is obtained；Feature extraction is carried out to video to be sorted using improved dense track characteristic extractor, Obtain second feature information；Feature extraction is carried out to video to be sorted using visual geometric Neural Network Feature Extractor, is obtained Third feature information；

Fusion Module 303, for merging to fisrt feature information, second feature information and third feature information, obtains Take fusion feature；

Sort module 304, for classifying to video to be sorted according to fusion feature, obtains the classification of video to be sorted As a result.

Further, Fusion Module 303 is specifically for obtaining the first default dimension corresponding first in fisrt feature information Characteristic, obtains the second corresponding second feature data of default dimension in second feature information, in obtaining third feature information The 3rd corresponding third feature data of default dimension；

According to fisrt feature data, second feature data, third feature data acquisition fusion feature.

Further, the device also includes：Default dimension acquisition module, preset dimension acquisition module specifically for：

According to the respective fisrt feature information of all training videos, second feature information and third feature information, the is obtained The Fei Sheer of all dimensions of one characteristic information, second feature information and third feature information differentiates ratio；

The Fei Sheer of all dimensions in fisrt feature information differentiates more default than determine fisrt feature information first Dimension, the Fei Sheer of all dimensions in second feature information differentiates dimension more default than determine second feature information second Degree, the Fei Sheer of all dimensions in third feature information differentiates dimension more default than determine third feature information the 3rd；

Further, the Fei Sheer of i-th dimension in any feature information differentiates that the acquisition formula of ratio is as follows：

K=S_b/S_i；

Further, characteristic extracting module 302 specifically for：

Video to be sorted is divided, at least one video segment comprising N two field pictures is obtained；

Feature extraction is carried out to all video segments using Three dimensional convolution Neural Network Feature Extractor, fisrt feature is obtained Information；

Wherein, N is the positive integer more than 1.

Further, characteristic extracting module 302 specifically for：

Obtain the dense track characteristic and homography matrix of video to be sorted；

Dense track characteristic is corrected using homography matrix, obtains second feature information.

Further, characteristic extracting module 302 specifically for：

An at least frame key frame is extracted in video to be sorted, using visual geometric Neural Network Feature Extractor at least One frame key frame carries out feature extraction, obtains third feature information.

Further, sort module 304 is specifically for according to fusion feature, using support vector machine classifier to be sorted Video is classified, and obtains the classification results of video to be sorted.

One of ordinary skill in the art will appreciate that：Realizing all or part of step of above-mentioned each method embodiment can lead to Cross the related hardware of programmed instruction to complete.Aforesaid program can be stored in a computer read/write memory medium.The journey Sequence upon execution, performs the step of including above-mentioned each method embodiment；And aforesaid storage medium includes：ROM, RAM, magnetic disc or Person's CD etc. is various can be with the medium of store program codes.

Finally it should be noted that：Various embodiments above only to illustrate technical scheme, rather than a limitation；To the greatest extent Pipe has been described in detail with reference to foregoing embodiments to the present invention, it will be understood by those within the art that：Its according to So the technical scheme described in foregoing embodiments can be modified, either which part or all technical characteristic are entered Row equivalent；And these modifications or replacement, do not make the essence disengaging various embodiments of the present invention technology of appropriate technical solution The scope of scheme.

Claims

1. the dynamic scene sorting technique of a kind of multiple features fusion, it is characterised in that include：

Obtain video to be sorted；

Feature extraction is carried out to the video to be sorted using Three dimensional convolution Neural Network Feature Extractor, fisrt feature letter is obtained Breath；Feature extraction is carried out to the video to be sorted using improved dense track characteristic extractor, second feature information is obtained； Feature extraction is carried out to the video to be sorted using visual geometric Neural Network Feature Extractor, third feature information is obtained；

The fisrt feature information, the second feature information and the third feature information are merged, fusion is obtained special Levy；

The video to be sorted is classified according to the fusion feature, obtains the classification results of the video to be sorted.

2. method according to claim 1, it is characterised in that described to the fisrt feature information, the second feature Information and the third feature information are merged, and obtain fusion feature, including：

The first corresponding fisrt feature data of default dimension in the fisrt feature information are obtained, the second feature information is obtained In the second default corresponding second feature data of dimension, obtain the 3rd default dimension the corresponding 3rd in the third feature information Characteristic；

The fusion feature according to the fisrt feature data, the second feature data, the third feature data acquisition.

3. method according to claim 2, it is characterised in that described to the fisrt feature information, the second feature Information and the third feature information are merged, and before obtaining fusion feature, are also included：

Obtain the respective fisrt feature information of all training videos in training video storehouse, second feature information and third feature letter Breath；

The Fei Sheer of all dimensions in the fisrt feature information differentiates the first of fisrt feature information more described than determination Default dimension, the Fei Sheer of all dimensions in the second feature information is differentiated than determining the second feature information Second default dimension, the Fei Sheer of all dimensions in the third feature information is differentiated than determining the third feature information 3rd default dimension；

4. method according to claim 3, it is characterised in that the Fei Sheer of i-th dimension in any feature information sentences Not than acquisition formula it is as follows：

K=S_b/S_i；

Wherein, S_iFor the within-cluster variance of i-th dimension,S_bBetween the class for i-th dimension Dispersion,J is that the classification belonging to all training videos is total, x_ijFor jth The characteristic matrix of i-th dimension of all training videos of individual classification, m_ijFor the of all training videos of j-th classification The average value matrix of the characteristic matrix of i dimension, m_ihFor the spy of i-th dimension of all training videos of h-th classification Levy the average value matrix of data matrix.The span of i is 1 to I positive integer, and I is the feature letter belonging to i-th dimension The dimension sum of breath, the span of j is 1 to J positive integer, the span of h be in addition to j 1 to J positive integer.

5. method according to claim 1, it is characterised in that the employing Three dimensional convolution Neural Network Feature Extractor pair The video to be sorted carries out feature extraction, obtains fisrt feature information, including：

Feature extraction is carried out to all video segments using Three dimensional convolution Neural Network Feature Extractor, described first is obtained Characteristic information；

Wherein, N is the positive integer more than 1.

6. method according to claim 1, it is characterised in that it is described using improved dense track characteristic extractor to institute Stating video to be sorted carries out feature extraction, obtains second feature information, including：

7. method according to claim 1, it is characterised in that the employing visual geometric Neural Network Feature Extractor pair The video to be sorted carries out feature extraction, obtains third feature information, including：

An at least frame key frame is extracted in the video to be sorted, using visual geometric Neural Network Feature Extractor to described An at least frame key frame carries out feature extraction, obtains the third feature information.

8. the method according to any one of claim 1 to 7, it is characterised in that it is described according to the fusion feature to described Video to be sorted is classified, and obtains the classification results of the video to be sorted, including：

According to the fusion feature, the video to be sorted is classified using support vector machine classifier, obtain described treating The classification results of classification video.

9. the dynamic scene sorter of a kind of multiple features fusion, it is characterised in that include：

Video acquiring module to be sorted, for obtaining video to be sorted；

Characteristic extracting module, carries for carrying out feature to the video to be sorted using Three dimensional convolution Neural Network Feature Extractor Take, obtain fisrt feature information；Feature extraction is carried out to the video to be sorted using improved dense track characteristic extractor, Obtain second feature information；Feature extraction is carried out to the video to be sorted using visual geometric Neural Network Feature Extractor, Obtain third feature information；

Fusion Module, for melting to the fisrt feature information, the second feature information and the third feature information Close, obtain fusion feature；

Sort module, for classifying to the video to be sorted according to the fusion feature, obtains the video to be sorted Classification results.

10. device according to claim 9, it is characterised in that the Fusion Module specifically for：