CN104244113A

CN104244113A - Method for generating video abstract on basis of deep learning technology

Info

Publication number: CN104244113A
Application number: CN201410525704.0A
Authority: CN
Inventors: 袁飞; 唐矗
Original assignee: Institute of Automation of Chinese Academy of Science
Current assignee: Beijing Casd Technology Co ltd
Priority date: 2014-10-08
Filing date: 2014-10-08
Publication date: 2014-12-24
Anticipated expiration: 2034-10-08
Also published as: CN104244113B

Abstract

The invention discloses a method for generating video abstract on the basis of a deep learning technology. The method includes modeling backgrounds of video stream frame by frame and acquiring moving foregrounds to be used as candidate moving objects; tracking the candidate moving objects of each frame by the aid of a multi-object tracking algorithm and updating candidate objects which form movement tracks; training object classifiers by the aid of convolutional neural networks, confirming the candidate objects and determining categories of the objects by the aid of the classifiers after real moving objects are confirmed; fitting all the real moving objects and relevant information on a small quantity of images, forming video snapshots and displaying the video snapshots to users. The method has the advantages that the real objects and noise can be accurately differentiated from one another by the aid of the deep learning technology; the objects do not need to be confirmed frame by frame owing to an accurate multi-object tracking technology, accordingly, the computational complexity can be greatly reduced, an omission factor of the objects and a false alarm rate of the noise can be effectively reduced, the video processing speeds can be increased, and the method can be applied to various complicated scenes.

Description

A kind of video abstraction generating method based on degree of deep learning art

Technical field

The present invention relates to technical field of image processing, more specifically, relate to a kind of video abstraction generating method based on degree of deep learning art.

Background technology

In modern society, video monitoring system all plays important role in all trades and professions, is maintaining public order, and reinforcement social management and safety guarantee aspect play an important role; But along with the growth at full speed of camera number, the event that records in these videos of the storage of the monitor video data of magnanimity and understanding can the human and material resources of at substantial.According to ReportLinker corporate statistics, in 2011, the whole world has more than 1.65 hundred million CCTV cameras, produce the monitor data of 1.4 trillion hours, if there are the important monitor video data of 20% to need artificial viewing, then need to employ the labour more than 100,000,000 (every day works 8 hours, annual work 300 days).Therefore, a large amount of videos is concentrated, help user to understand event in video fast, lock searching object rapidly, effectively can improve the utilization ratio of magnanimity monitor video.

In image processing field, in order to improve the browse efficiency of video, can video summarization technique be adopted, by interested for user in video contents extraction out, then they being rearranged in a compact fashion, with the form of video snap-shot by the content displaying of video out.In order to the interested content of user in video can be extracted automatically, the simplest method extracts the key frame in original video, form video frequency abstract (such as list of references: the triumphant nurse of Chadwick etc., " a kind of video frequency abstract overall plan of based target ", " the 8th ACM's multimedia international conference transactions ", 2000, 303-311 page (Kim, C., Hwang, J.N.:An integrated scheme for object-based video abstraction.In:Proceedings of the eighth ACM international conference on Multimedia. (2000) 303-311)), but the description that key frame cannot be complete whole section of video, the loss of important information in video can be caused, and due to video content of a great variety, suitable key frame how is selected to be a difficult problem.Another kind method first analyzes video content, extract the relevant information of moving target in original video, then the movable information extracted is arranged compactly, generating video summary (such as list of references: Ya Aierpuruiqi etc., " non-sequential video frequency abstract and index ", " IEEE pattern analysis and machine intelligence transactions ", 2008, 1971-1984 page (Pritch, Y., Rav-Acha, A., Peleg, S.:Nonchronological video synopsis and indexing.IEEE Trans.Pattern Anal.Mach.Intell.30 (2008) 1971-1984)), this method can retain the dynamic content of video preferably.For this method, the key of problem is to extract the interested all events of user how exactly.

For monitor video, the photographed scene of monitor video is very complicated: the scene vehicle had is many, and movement velocity is fast, as highway; In some scenes, moving target shared elemental area on picture is very little; In some scenes, the uninterested object such as trees, flag produces motion etc. equally due to wind; The complexity of scene is that the accurate detection of moving target brings very large challenge.Current video summarization technique can not solve the test problems of moving target in complex scene well, usually make the loss of moving target very high, simultaneously larger by noise jamming, accurately cannot extract the critical event in video, thus cause the video frequency abstract of generation to miss important information in original video.

Summary of the invention

In view of this, the object of the invention is to propose a kind of video abstraction generating method based on degree of deep learning art, to facilitate user to carry out fast browsing to long monitor video, reduce loss and the fallout ratio of moving target in complex scene.

To achieve these goals, the invention provides a kind of video abstraction generating method based on degree of deep learning art, comprise the following steps:

Step 1, carries out background modeling to the image sequence of the original video of input, extracts the foreground area that moving target is corresponding;

Step 2, using the described foreground area of acquisition as motion candidates target, uses Multitarget Tracking to follow the tracks of described motion candidates target, calculates the movement locus of motion candidates target described in each frame;

Step 3, to determining that described movement locus is that the motion candidates target enlivening track uses the object classifiers based on degree of deep learning art to confirm further, judge whether described motion candidates target is real target, and after confirmation target, re-use the classification that grader judges described motion candidates target;

Step 4, is fitted in same piece image by multiple described moving target detected, generating video snapshot, shows the described moving target detected in video with described video snap-shot.

Wherein, in step 1 the step of background modeling is carried out to the image sequence of original video of input before also comprise the step image sequence of the described original video of input being zoomed to formed objects.

Wherein, described in step 1, extract in the step of foreground area corresponding to moving target the step also comprising and the described sport foreground obtained is carried out to reprocessing, specifically comprise:

Step 11, with morphological structuring elements, carries out morphology opening operation and closing operation of mathematical morphology to described foreground area, obtains the foreground area of contour smoothing, and eliminates the less noise block of area;

Step 12, carries out areal calculation to described foreground area, if described foreground area pixel number is less than T ₁when=5, then foreground area described in filtering, otherwise, retain described foreground area, determine that described foreground area is candidate target.

Wherein, Multitarget Tracking described in step 2 builds based on Hungary Algorithm, specifically comprises:

Step 21, calculates the color histogram feature of described motion candidates target of present frame, and the similitude of motion candidates target in described color histogram feature and previous frame;

Step 22, utilizes the described motion candidates target in Kalman prediction previous frame in the positional information of present frame, calculates the Euclidean distance between motion candidates target location described in the predicted position of described motion candidates target and present frame;

Step 23, according to above-mentioned result of calculation, use Hungary Algorithm, the described motion candidates target in present frame is mated with the track of the target of motion candidates described in previous frame, obtain matching result, and upgrade the track of described motion candidates target according to described matching result.

Wherein, object classifiers described in step 3 uses the convolutional neural networks in degree of deep learning art to carry out off-line training in advance to obtain, for judging whether described motion candidates target is real goal, and the type of described motion candidates target.

Wherein, the step of carrying out off-line training to described object classifiers comprises and adopts sample sets to carry out object classifiers described in off-line training, and the image background regions that the sample set of training described object classifiers to use comprises the five type games targets that occur in monitor video or image corresponding to object and removes outside this five class: 1. pedestrian; 2. bicycle; 3. the compact car such as car; 4. the large car such as truck; 5. the non-interesting target such as trees, flag still understands the local of moving object; 6. the image-region except above-mentioned five type games objects in monitoring scene; The object classifiers of one six classification is obtained, for confirming whether described motion candidates target is real goal by above-mentioned training.

Wherein, by in above-mentioned sample 1., 2. class sample and 3., 4. sample be combined into two large classifications respectively: people, motor vehicle, use two classification graders of this two classes sample training people/motor vehicle, after being interesting target in the described motion candidates target of confirmation, the classification of described motion candidates target is judged.

Wherein, step 3 specifically comprises the following steps:

Step 31, for the described motion candidates target not forming track, use described six classification graders to classify, only have when described motion candidates target be judged as above-mentioned the 5. or the 6. class time just think that this candidate target is noise, otherwise think real goal; And for forming the described motion candidates target of track, select the image comprising described motion candidates target that three positions in its track are corresponding, described six classification graders are used to classify respectively, judge whether described motion candidates target is real goal, 5. or the 6. class if in these three positions, described motion candidates target is all judged as, just thinks that described motion candidates target is noise, delete this track, otherwise think that described motion candidates target is real motion target;

Step 32, for the situation being judged to be real motion target, if in three subseries when described motion candidates goal verification, to described motion candidates target be people or motor vehicle it is determined that the presence of difference, then to it, type decision is carried out to the grader of described motion candidates target end user/motor vehicle.

Wherein, in step 4, in described target trajectory after validation, select the position that area is maximum, make to be fitted in by image corresponding for described position on a width snapshot, multiple described motion candidates target is sticked to form a snapshot, uses described snapshot to show the moving target occurred in video.

Wherein, in step 4, in the described snapshot of generation, all described motion candidates targets do not have overlap, and the sequencing that described motion candidates target occurs on described snapshot is arrangement actual time occurred according to described motion candidates target on the whole.

Known based on technique scheme, the present invention is directed to the monitor video under complex scene, by the Video content analysis technique of novelty, extract the Candidate Motion target in original video, and pass through multiple target tracking, candidate target is tentatively distinguished, for the Candidate Motion target not forming track and formation track, undertaken confirming and classifying by degree of deep learning method, and be shown to user compactly in the form of images, the picture that user records each moving target event by viewing just can reach the object of watching original video, shorten user widely and watch time spent by video.Method of the present invention fully takes into account the complexity of scene, the technical scheme adopted can ensure the reliability of result of calculation, the loss of moving target event and the interference of noise are controlled in extremely low level, thus make the present invention can be widely used in the actual combat of many departments, such as public security investigation etc.

Accompanying drawing explanation

Fig. 1 is the flow chart of the video abstraction generating method based on degree of deep learning art of the present invention;

Fig. 2 is the flow chart based on multi-object tracking method in the video abstraction generating method of degree of deep learning art of the present invention;

Fig. 3 is the flow chart confirmed based on candidate target in the video abstraction generating method of degree of deep learning art of the present invention.

Embodiment

For making the object, technical solutions and advantages of the present invention clearly understand, below in conjunction with specific embodiment, and with reference to accompanying drawing, the present invention is described in further detail.

The present invention proposes a kind of video abstraction generating method based on degree of deep learning art, the method comprises the following steps:

First, background modeling is carried out to the image sequence of original video, obtain sport foreground block, and prospect reprocessing etc. is carried out to it; Secondly, the moving region extracted as Candidate Motion target, utilize the Multitarget Tracking based on Hungary Algorithm to follow the tracks of these Candidate Motion targets, candidate target is divided into and forms track and do not form track two class; Again, convolutional neural networks grader is used further to confirm for Candidate Motion target and classify; Finally, be fitted in by the moving target of multiple confirmation on same piece image, the image after laminating is called " video snap-shot " by the present invention.It is worthy of note, the inventive method is first followed the tracks of the moving region of extracting as potential moving target, Candidate Motion target is tentatively distinguished, and the convolutional neural networks (CNN) in using the degree of depth to learn further confirms and type decision candidate target, thus significantly reduce probability noise being mistaken for moving target, and ensure that the verification and measurement ratio of moving target, and three subseries judgements are only carried out for the Candidate Motion target forming track, decreases amount of calculation.The original video that can carry out video frequency abstract process includes but not limited to: the video file that the live video stream that video monitoring system gathers, video monitoring system store, conventional multimedia video frequency file, TV programme, film etc.

In order to understand technical scheme of the present invention better, below in conjunction with accompanying drawing, embodiments of the present invention are further described.

The frame diagram of the video abstraction generating method based on degree of deep learning art of the present invention as shown in Figure 1, the present invention proposes a kind of video abstraction generating method based on degree of deep learning art, may be used for complex scene and carry out reliably working, its concrete implementation step is as follows:

Step S101, gathers the video data of video frequency abstract to be generated;

Step S102, stores the original video gathered, and forms original video data storehouse; Original video can be the video of monitoring camera Real-time Collection, also can be the playback video of monitoring video;

Step S103, to the original video of different resolution, zooms to formed objects by each frame of video, carries out background modeling, extracts the foreground area of motion, and carries out reprocessing, alternatively moving target;

The original video frame of different resolution is carried out unifying convergent-divergent, instead of directly high-resolution original image is processed, effectively can improve the arithmetic speed that background modeling extracts moving region.In an embodiment of the present invention, background modeling can adopt multiple related algorithm, and the present embodiment does not enumerate.The object of background modeling is to the background in frame of video and moving target be distinguished.Background in scene refers to that in video, the long period remains unchanged or has the region of minor variations, and corresponding, the prospect in scene refers to the region of significant change.Such as in one section of monitor video, the automobile travelled in scene and the pedestrian walked, only exist in video scene at short notice, so be considered to sport foreground, and the trees of road, traffic lights and both sides, road, exist in video scene for a long time, can by as movement background.By carrying out background modeling to original video, re-using present frame and mating with background model, distinguish sport foreground and background.

But, sport foreground for video under complicated monitoring scene is extracted often exists some noise spots, such as trees etc. belong to the part of background, due to wind disturbance, and be mistaken for prospect, in order to effectively reduce noise spot, in the preferred embodiments of the present invention, two background models are used to same section of video, two background models differ 300 frames and upgrade respectively, but when extracting sport foreground, present frame is used to contrast with these two background models respectively, obtain two width prospect binary map, indicate the moving region on present frame respectively, AND-operation is carried out to this two width prospect binary map, the binary map obtained is as prospect binary map corresponding to present frame, in addition, carry out prospect reprocessing to obtained sport foreground, prospect reprocessing adopts morphology to calculate, and specifically comprises:

First, use morphological structuring elements, morphology opening operation and closing operation of mathematical morphology are carried out to foreground target, the prospect of contour smoothing can be obtained, and eliminate the less noise spot of area, reduce the noise spot that area is larger;

Then, areal calculation is carried out to foreground target, if pixel number is less than threshold value T in the area of foreground target ₁when=5, then think that this foreground target belongs to noise, answer filtering, otherwise, then retain this foreground target.By above method, eliminate the noise jamming in sport foreground, and the edge of prospect can be made to become level and smooth.

Step S104, assigns the sport foreground that each frame in step S103 extracts as the moving target of candidate, utilizes and follow the tracks of these Candidate Motion targets based on Hungary Algorithm Multitarget Tracking.Wherein, enliven track and represent track that following the tracks of, that show in real-time result; Historical track, represent current do not have tracked, but the track enlivening track may be transformed into; Dead track, expression thoroughly terminates, no longer tracked track.

This method adopts the movement locus obtaining moving target based on the multiple target tracking mode of Hungary Algorithm, and wherein Hungary Algorithm is used for calculating the optimum correspondence problem of multiple moving target.Wherein, the description of moving target similarity is colouring information based on moving target and positional information.Colouring information adopts color histogram to quantize, a kind of statistical value of distribution of color in color histogram presentation video, represents the ratio that different color is shared in the picture, calculates simple, and has yardstick, translation and rotational invariance.Positional information calculates in conjunction with Kalman filter, Kalman filtering is the linear system optimal estimation method under minimum mean square error criterion, its basic thought makes variance of estimaion error be minimum, and estimate without inclined, can promote target following effect.

As shown in Figure 2, the movement locus obtaining moving target based on the multiple target tracking mode of Hungary Algorithm in the present invention specifically can be divided into following step:

Step S1041,8 × 8 × 8 color histogram features of all Candidate Motion targets in calculation procedure S103, then calculate the similitude of the color histogram feature of moving target and the color histogram feature of previous frame moving target obtained in present frame.Preferably, the present invention adopts RGB color space to calculate the color histogram of each moving target: first quantize the color component of three in color space RGB, each color space is divided into 8 sub spaces, one dimension (bin) in the corresponding histogram of every sub spaces, add up the number of pixels dropped in subspace corresponding to the every one dimension of histogram, thus obtain color histogram, then calculating previous frame enlivens the similarity between moving target corresponding to track and the color histogram feature of current frame motion target.Preferably, the present invention adopts Hellinger distance to measure the similarity of two histogram distribution:

d (h_{1}, h_{2}) = \sqrt{1 - \frac{1}{\sqrt{\overset{&OverBar;}{h_{1}} \overset{&OverBar;}{h_{2}} N^{2}}} Σ_{q = 1}^{N} \sqrt{h_{1} (q) h_{2} (q)}}

Wherein, h ₁(q) and h ₂q () represents two color histogram vectors, N is 8 × 8 × 8,

\overset{&OverBar;}{h_{k}} = \frac{1}{N} Σ_{j = 1}^{N} h_{k} (j) .

If the color histogram of two targets is more similar, the Hellinger distance namely between color histogram vector is less, then the possibility of two object matchings is higher, and its probability distribution meets Gaussian Profile.Such as, in the monitor video picture of highway, there is a white car W in left side, and there is a black car B on right side, and this method needs to follow the tracks of these two moving targets, thus obtains their movement locus.If in previous frame, color histogram is calculated to two moving object W and B detected in picture and obtains h ₁and h ₂, color histogram is calculated to moving object W and B of two in present frame picture and obtains h ₃and h ₄, by calculating h ₁and h ₃, h ₁and h ₄, h ₂and h ₃, h ₂and h ₄between Hellinger distance, can h be found ₁and h ₃, h ₂and h ₄hellinger distance much smaller than h ₁and h ₄, h ₂and h ₃between Hellinger distance, so can obtain h ₁and h ₃the color histogram of W corresponding to two continuous frames, h ₂and h ₄be the color histogram of B corresponding to two continuous frames, the target that this information can help two continuous frames to occur is mated.

Step S1042, enlivens trace information according to moving target in previous frame image, utilizes the position of Kalman filter prediction moving target.Enliven trace information according to every article in t-1 two field picture, utilize the position that in Kalman filter prediction t frame, moving target occurs.The Candidate Motion target of t frame is obtained in step S103, and successively in the predicted position of t frame and the object detection results of t frame detection module, Euclidean distance calculating is carried out to moving target in this step in S1042, Euclidean distance is less, then predicted position and accurate location more close, so the possibility of two object matchings is higher, and its probability distribution meets Gaussian Profile.Such as, the left side vehicle W in monitored picture mentioned above and right side vehicle B, if in t-1 frame, utilizes Kalman filter to carry out position prediction to moving object W and B of detect two in picture, obtains the predicted position l in t frame ₁' and l ₂', in step S104 after t frame detects two moving object W and B, obtain the physical location l of target ₁and l ₂.Because in continuous print two frame, can not there is huge variation in the position of vehicle, so l ₁' and l ₁, l ₂' and l ₂euclidean distance will be far smaller than l ₁' and l ₂, l ₁' and l ₂euclidean distance, the target that this information can help two continuous frames to occur is mated.

Step S1043, adopts Hungary Algorithm, utilizes colouring information and positional information to carry out multiobject coupling, and Hungary Algorithm is the classic algorithm solving bipartite graph maximum matching problem.Such as, if have m to enliven track in t-1 frame, step S103 obtains n Candidate Motion target in t frame, and the similarity between the moving target color histogram feature enlivening track and t frame being calculated t-1 frame by Hellinger, and obtain the matrix M of m × n ₁; And calculate the Euclidean distance enlivened between the predicted position of track in t frame and the accurate location of t frame moving target of t-1 frame, the matrix M of m × n can be obtained ₂.By matrix M ₁and M ₂the element multiplication of correspondence position, obtains the matrix M of m × n, and using the input value of this matrix M as Hungary Algorithm, Hungary Algorithm can provide the individual matching result enlivening track and t frame n moving target of m in t-1 frame, if similarity is less than threshold value T in matching result ₂when=0.5, then not think and mate, otherwise then the match is successful.

Step S1044, according to the matching result of target in previous step, generates the movement locus of moving target in present frame; Target of prediction positional information etc. in the next frame simultaneously.

If the moving target n enlivening track mi and t frame of t-1 frame _jthe match is successful, then think target n _jmovement locus in front t-1 frame is mi, upgrades and enlivens track mi.Now, for target n _jterminate in the tracing process of t frame.

If the moving target of t frame does not match the track that enlivens of t-1 frame, illustrating that this target does not have movement locus, is fresh target; If t-1 frame enliven the moving target that track does not match t frame, illustrate that target disappears, then this enlivened track and mate with historical track, if can match, then this enlivens track and historical track and is integrated into and new enlivens track, otherwise this enlivens track and changes historical track into.

The present invention is at t frame target n _jafter renewal enlivens track, utilize Kalman filter prediction target n _jin the position of t+1 frame, and preserve target n _jthe information such as type, position, area, the ratio of width to height, to use when t+1 frame target detection.

Step S105, utilizes object classifiers, confirms and classify to Candidate Motion target.

In step S103, in S104, by the method for background modeling, the moving region in original video is extracted and followed the tracks of, but because noise is (as trees, flag etc.) can direct interference to the extraction of background modeling for moving region, therefore at S103, be easy to be mixed with a large amount of noises in the Candidate Motion target extracted in S104, if directly assign these candidate targets as real moving target, in order to generating video snapshot, snapshot number can be caused too much, false-alarm is too much, thus affect the efficiency that user searches interesting target, therefore, need further screen these Candidate Motion targets and judge, distinguish real moving target and noise.In view of degree of deep learning art shows superior performance in the application of more and more field of image recognition, the invention degree of deep learning art is used in video summarization method, make use of the unsurpassed performance of degree of deep learning art in image recognition fully.In the present invention, we used and distinguish real motion target and noise based on the convolutional neural networks (CNN) in degree of deep learning art as object classifiers.

In step S104, extract to each frame in step S103 the Candidate Motion target obtained respectively to follow the tracks of, for the candidate target not forming track, direct use object classifiers judges it, and for forming the Candidate Motion target of track, after its track becomes dead track, object classifiers is used further to confirm it, judge whether this candidate target is real motion target, if it is determined that be real goal, then kind judging is carried out to this target.Do like this, the superior function of CNN grader can be utilized on the one hand, candidate target is accurately judged, distinguish noise and real goal; On the other hand, only a small amount of sort operation is carried out for the target forming track, instead of classifies frame by frame, reduce amount of calculation; Finally, classify to target, the snapshot after convenient generates and target retrieval.

In the preferred embodiment of the present invention, adopt the object classifiers that off-line training is good, the candidate target obtained is confirmed and classify in step S104.The off-line training specific implementation method of object classifiers is as follows:

First, training sample is collected.Sample set can need to classify according to various concrete scene, such as, for traffic monitoring, can be divided into: (1) pedestrian; (2) bicycle; (3) compact car such as car; (4) large car such as truck; (5) the non-interesting target such as trees, flag still understands the local of moving object; (6) image-region except above-mentioned five type games objects in monitoring scene; These samples need to be obtained by the cutting of artificial mark from real monitor video according to concrete scene.Also various different classification can be had for other scenes.In a preferred embodiment of the invention, for traffic monitoring, trained one six classification grader and be used for respectively judging whether candidate target is real goal, and two classification graders of a people/motor vehicle carry out type decision to real target.Why when confirming target, target classification is carried out careful division, because fully taken into account the difference in interesting target class like this, real interesting target and noise can be distinguished accurately more in detail, such as, if car and truck classify as motor vehicle jointly, large class is trained, due to both property of there are differences in appearance, the sorter model that training can be made to obtain is easier is judged to be the large class of motor vehicle by noise, if and be split as two groups, so the discrimination of these two classifications and noise can become large, thus interesting target and noise can be distinguished more accurately.

Secondly, the structure of convolutional neural networks.The present invention uses the convolutional neural networks in degree of deep learning art (CNN) to classify to target image.In a preferred embodiment of the invention, we construct one and comprise three convolutional layers, three down-sampling layers, three nonlinear propagation function layers, a full articulamentum, the neural network structure of a recurrence layer, the sample of collection, after convergent-divergent normalization, inputs network together with its class label, with the different classes of sample of maximized differentiation input for target, use stochastic gradient descent algorithm, be optimized network, study obtains the parameter of each layer in network configuration.This learning process is that off-line carries out, and in order to carry out Fast Learning training for great amount of samples, the present invention proposes a kind of for Image Segmentation Using, the method for parallel computation image convolution:

For convolution kernel size for n × n (n is for odd number):

1, the training sample image of input is divided into some pieces of m × m, if image can not be divided into an integer fritter, then segmentation after edge mends 0;

2, for each small images, get centered by this small images center (m+n-1) × image of (m+n-1) size is as the subgraph of training sample image, by same sample image subgraph carry out convolutional calculation concurrently, after each subgraph convolution obtained like this, characteristic pattern size is m × m;

3, arranged according to its position on former figure by same width subgraph m × m characteristic pattern, namely what be easy to prove to obtain is the characteristic pattern that this image obtains through convolutional calculation with the characteristic pattern of the size such as former figure;

Use such method can realize the parallelization calculated for same width image convolution, thus drastically increase the speed of model training, in addition, in order to make the training of model meticulousr, in the method, Study rate parameter for each layer network dynamically arranges, and Study rate parameter can carry out automatic fine tuning according to the degree of convergence of model, thus makes the robustness of models applying in actual scene higher.

After study obtains the parameter of all optimums, obtain corresponding model.When classifying to image, use this model, calculate this image characteristic of correspondence figure by three convolutional layers, down-sampling layer and nonlinear propagation function layer, the computational methods of characteristic pattern are as follows:

1, image is decomposed into the sized images such as three width according to RGB triple channel, as the input of whole convolutional network;

2, inputting convolutional layer, by training the N number of convolution kernel obtained to carry out convolution to input picture, obtaining N width characteristic pattern;

3, down-sampling is carried out to N width characteristic pattern, obtain new characteristic pattern;

4, a nonlinear propagation function layer is passed through for the characteristic pattern after sampling, each characteristic value is amplified;

5, using the input of the output of nonlinear propagation function layer as next convolutional layer, repeat the step of 2-4, altogether by cubic convolution layer, down-sampling layer and nonlinear propagation function layer, using the characteristic pattern of the output of last nonlinear propagation function layer as input picture;

Using the input of the characteristic pattern obtained as full articulamentum and logistic regression layer, to the characteristic pattern obtained be trained respectively by full articulamentum network, each convolution kernel of full articulamentum carries out convolutional calculation to all characteristic patterns, and convolution results is lined up in a certain order the characteristic vector group of a composition N dimension, obtained the probability matrix of a 1 × M by N × M parameter matrix of logistic regression layer after this characteristic vector transposition, M element of this matrix represents that this image belongs to the probability of M classification respectively, thus reaches the object of classifying to input picture.

As shown in Figure 3, candidate target confirmation and classification specifically can be divided into following step:

Step S1051, judges whether candidate target defines track;

Step S1053, S1054, for the candidate target not forming track, use six classification graders to judge, if this candidate target type belongs to (5) or (6) class, then thinks that this candidate target is noise really;

Step S1052, judge whether the track of candidate target is dead track, if not dead track, target being described still tracked, now not carrying out goal verification, is to only confirm once each track like this, improves arithmetic speed; In the method, when certain historical track, through the matching operation of N frame, still cannot match with sport foreground, then be considered as this historical track and stop, N=50 in this algorithm.

Step S1055, candidate target is present in the diverse location in frame of video in different frame, according to the positional information of its track record, the correspondence image of this target is obtained from corresponding frame of video, confirm, in a preferred embodiment of the invention, for ensureing the accuracy of goal verification, respectively select a width to comprise the image of candidate target from the initial, middle of a target trajectory and termination, totally three width confirm.

Step S1056, S1057 and S1058, six classification graders are used to classify to three width candidate target images respectively, if the classification of three width images is all judged as belong to (5) or (6) class, then thinks that this candidate target is noise, delete the information of this target; Otherwise, then think that this target is real goal; Record three type decisions of six classification graders for this target simultaneously, if these three type decisions judge to there is difference for the large classification (man/machine motor-car) of this target, such as, once be judged as bicycle, be judged as the compact cars such as car other twice, so kind judging carried out for this target end user/motor vehicle grader.

Step S1059, after the classification determining a real goal, records the type of this target, trace information in order at generating video snapshot afterwards.

Step S106, shows a small amount of snapshot of the moving target of all records.Use the average snapshot background image as generating of some frames in original video, the real motion target of record is fitted on snapshot background image according to the position that it occurs in primitive frame; Because each moving target is except type information, also record the trace information of this target, this target is clearly represented in order to use a small amount of snapshot, the maximum position of target area is selected from each position this target trajectory, extract this target, and be fitted on snapshot background image according to the position that it occurs in the frame; Meanwhile, in order to only use a small amount of snapshot to show all targets, and each target clearly can both represent on snapshot, method proposes a kind of snapshot generating algorithm of local optimum:

1, the target that instant recording detects in algorithm process process, preserves in queue;

2, when queue length is greater than certain threshold value T, carry out a snapshot and generate: first aim O1 in queue is fitted;

3, judge that in queue, whether remaining target is overlapping with O1, find first to be fitted on snapshot with the nonoverlapping target O2 of O1;

4, be initial with O2, search backward, until find first with the non-overlapping target of O2, be fitted on snapshot;

The like, until queue traversal terminates.

In the snapshot that the method generates, all targets do not have overlap, and the sequencing of the appearance of target on snapshot is arrangement actual time occurred according to target on the whole, thus ensure that the clear displaying of target on snapshot, target can be improved neatly in snapshot upper density by improving threshold value T simultaneously, reducing snapshot number; Each is fitted in the target on snapshot, it occurs that the time in video all can be demarcated in target, and convenient, user locates interested target fast in original video.

Through actual verification, the preferred embodiments of the present invention can reach the 12-20 of video normal playback speed doubly to the processing speed of the monitor video (more than 1280 × 720) of high definition on the PC of configuration Intel i7-3770 CPU, and the missing rate of target is less than 2%, false alarm rate is less than 5%.

The present invention pays close attention to the reliability of video summarization system under complex scene, the creationary target classification technology employed based on degree of depth study, greatly reduce the loss of moving target, meanwhile, reduce noise and be mistaken for moving target thus the probability of interference video summary quality; In addition, to in the judgement of moving target and testing process, employ multi-target tracking technology, thus avoid and classification is carried out one by one for the Candidate Motion target in each frame judge, significantly decrease amount of calculation, thus improve the speed of video summarization system process video.Compared with traditional video summarization method, the present invention accurately, fast, intactly can extract foreground moving object in complex scene, clearly show all moving targets in long section video with the form of a small amount of snapshot picture, under complex scene, reliable video frequency abstract can be generated.

Above-described specific embodiment; object of the present invention, technical scheme and beneficial effect are further described; be understood that; the foregoing is only specific embodiments of the invention; be not limited to the present invention; within the spirit and principles in the present invention all, any amendment made, equivalent replacement, improvement etc., all should be included within protection scope of the present invention.

Claims

1., based on a video abstraction generating method for degree of deep learning art, comprise the following steps:

2., as claimed in claim 1 based on the video abstraction generating method of degree of deep learning art, before wherein in step 1 the step of background modeling being carried out to the image sequence of the original video of input, also comprise the step image sequence of the described original video of input being zoomed to formed objects.

3., as claimed in claim 1 based on the video abstraction generating method of degree of deep learning art, described in step 1, wherein extract in the step of foreground area corresponding to moving target the step also comprising and the described sport foreground obtained is carried out to reprocessing, specifically comprise:

4., as claimed in claim 1 based on the video abstraction generating method of degree of deep learning art, wherein described in step 2 Multitarget Tracking builds based on Hungary Algorithm, specifically comprises:

5. as claimed in claim 1 based on the video abstraction generating method of degree of deep learning art, wherein described in step 3 object classifiers uses the convolutional neural networks in degree of deep learning art to carry out off-line training in advance to obtain, for judging whether described motion candidates target is real goal, and the type of described motion candidates target.

6. as claimed in claim 5 based on the video abstraction generating method of degree of deep learning art, the step of wherein carrying out off-line training to described object classifiers comprises and adopts sample sets to carry out object classifiers described in off-line training, and the image background regions that the sample set of training described object classifiers to use comprises the five type games targets that occur in monitor video or image corresponding to object and removes outside this five class: 1. pedestrian; 2. bicycle; 3. the compact car such as car; 4. the large car such as truck; 5. the non-interesting target such as trees, flag still understands the local of moving object; 6. the image-region except above-mentioned five type games objects in monitoring scene; The object classifiers of one six classification is obtained, for confirming whether described motion candidates target is real goal by above-mentioned training.

7. as claimed in claim 5 based on the video abstraction generating method of degree of deep learning art, wherein by above-mentioned sample 1., 2. class sample and 3., 4. sample be combined into two large classifications respectively: people, motor vehicle, use two classification graders of this two classes sample training people/motor vehicle, after being interesting target in the described motion candidates target of confirmation, the classification of described motion candidates target is judged.

8., as claimed in claim 6 based on the video abstraction generating method of degree of deep learning art, wherein step 3 specifically comprises the following steps:

9. as claimed in claim 1 based on the video abstraction generating method of degree of deep learning art, wherein in step 4, in described target trajectory after validation, select the position that area is maximum, make image corresponding for described position to be fitted on a width snapshot, multiple described motion candidates target is sticked to form a snapshot, uses described snapshot to show the moving target occurred in video.

10. as claimed in claim 9 based on the video abstraction generating method of degree of deep learning art, wherein in step 4, in the described snapshot generated, all described motion candidates targets do not have overlap, and the sequencing that described motion candidates target occurs on described snapshot is arrangement actual time occurred according to described motion candidates target on the whole.