CN105303571A

CN105303571A - Time-space saliency detection method for video processing

Info

Publication number: CN105303571A
Application number: CN201510692276.5A
Authority: CN
Inventors: 刘纯平; 朱桂墘; 季怡; 徐鑫; 秦利斌; 龚声蓉
Original assignee: Suzhou University
Current assignee: Suzhou University
Priority date: 2015-10-23
Filing date: 2015-10-23
Publication date: 2016-02-03

Abstract

The invention discloses a time-space saliency detection method for video processing. The method is characterized by comprising steps: a color channel histogram, an optical flow size histogram, and an optical flow direction histogram are used for acquiring feature contrast; as for a first-frame image, the distance with a video frame center point is adopted to serve as a position priori, and as for other frames, the position of a saliency target in the former frame serves as a position priori, and a position priori saliency map is acquired; the optical flow size and the optical flow direction of a pixel are used to acquire a speed priori saliency map and an acceleration priori saliency map; a sliding window in time dimension is provided for pixel mean filter, and a background priori saliency map is acquired; and the feature contrast is fused with each priori saliency map, a time-space saliency map is acquired, and the target is detected. The method is more applicable to detection on surveillance video with an unknown moving target appearing position, and the moving saliency target can be better acquired.

Description

For the time and space significance detection method of Video processing

Technical field

The present invention relates to a kind of method for processing video frequency, be specifically related to a kind of as pretreated time and space significance detection method.

Background technology

By attention selection mechanism, human eye can the area-of-interest not being primarily focused on image and video in skilled situation.It is exactly the interesting target detected fast by this vision noticing mechanism of simulation human eye in image or video that conspicuousness detects, and time and space significance detection is the conspicuousness detection of having merged motion feature in video.

In the application of the treatment and analyses of image and video, the calculating of vision significance is often as a pretreated process, be the important foundation of the task such as image and Video segmentation, target detection and target following, it is address these problems to provide a new thinking.Moreover, conspicuousness also has important application in fields such as the self-adapting compressing of target identification, image/video, the retrievals of image/video.Efficiently and exactly there is significant actively impact on the marking area extraction of image and video to follow-up various process and application.

In recent years, the conspicuousness research for video more and more receives publicity, and the people such as Itti add motion feature and interframe flicker in saliency computation model Itti98, by Itti98 model extension in video.Spectrum residual error method then expands in video by the people such as Guo, motion, red green contrast, Blue Curacao contrast colors and brightness are extracted respectively to frame of video, then obtained the phase spectrum of each feature by quaternary Fourier transform, finally each passage is fused into final remarkable figure.The people such as Lu employ low-level image feature as color, texture and motion and some cognitive characteristicses in conspicuousness model.The people such as Cheng also add movable information in the remarkable model of their still image, by the pixel motion of analysis level direction and vertical direction, calculate and significantly scheme.The people such as Bioman propose the detection method of scrambling on the Spatial dimensionality in video, and the method ties up texture by 2 peacekeepings 3 of video block and training data compares, and detect the irregular motion obtained in video.Meur etc. propose the space-time model of view-based access control model attention mechanism, and the time remarkable figure of the method is by obtaining the analysis of affine.Kienzle, by study to eye movement data, constructs detecting device based on space-time interest points to input signal filtering in Time and place territory respectively, thus detects and obtain well-marked target.

Current conspicuousness computation model is also immature, and various computation model has some limitation, inevitably exists and non-significant region is calculated to be marking area, or marking area is calculated to be the situation in non-significant region.For the first situation, often relevant with the feature calculated selected by conspicuousness, quaternary Fourier (PQFT) the conspicuousness computation model of such as Guo, reservation be the component of irregular frequency spectrum in frequency domain, these component majorities are marginal informations of object.This conspicuousness method has little well-marked target and well highlights effect, but for the larger target of yardstick, easily causes inner cavity.More easily there is the first situation above-mentioned in the conspicuousness computation model that some features are single.The complicacy of the second situation normally scene causes, and under complex scene, color characteristic is often difficult to distinguish front background, and background area also often variant obvious color, this makes some part of background be easy to calculate higher remarkable angle value.

When for video monitoring, there is following problem: often camera is fixed static in the application of video monitoring, background is substantially also static, and the texture of background may be more complicated, and where moving target is occurring it being unknowable; In video monitoring, often the dynamic of background is strong, but scene itself is complicated, and some often must not obtain the result of robust from the background priori of Spatial Dimension calculating.

Particularly, when video monitoring is used in the complicated natural scene that complex background and image-forming condition dynamically change, the difficulty that conspicuousness detects is larger.This complicated natural scene comprises as illumination, viewpoint, dimensional variation and blocking etc. causes the scene etc. of prospect and background motion.In the dynamic natural scene of complexity that multiple transaction forms, scenario objects is not isolated, close ties are had with other target in its surrounding environment, scene and scene, there is the combination change of all kinds of theme, in scene class, change and scene class mesopic vision characterize similarity, and the feature such as differentiation in time, as wave leaf, dense population, flock of birds, flowing water, wave, snow, rain and the environment such as smog, the target effect relying on separately kinergety to go to detect in scene is not good.Therefore dynamic scene well-marked target detect and extract more difficult.

Therefore, how obtaining the conspicuousness detection method that a kind of adaptive video monitoring application needs, is the problem that this area needs to solve.

Summary of the invention

Goal of the invention of the present invention is to provide a kind of time and space significance detection method for Video processing, to improve the effect that in the target detection process for monitor video, conspicuousness detects.

To achieve the above object of the invention, the technical solution used in the present invention is: a kind of time and space significance detection method for Video processing, comprising:

(1) adopt layered video dividing method to split input video, first input video sequence is divided into the subsequence of m section non-overlapping copies, wherein, m is the positive integer being less than video sequence frame number, then carries out Hierarchical Segmentation to video sequence;

The method of Hierarchical Segmentation can adopt prior art, such as, document XuC, XiongC, CorsoJJ.StreamingHierarchicalVideoSegmentation [J] .LectureNotesinComputerScience, 2012, propose a layered video segmentation framework in 7577 (1): 626-639., utilize the thought of data stream, video flowing performs Markov hypothesis by whole Video segmentation.

Based on the hierarchical segmentation algorithm of graph theory, video V is divided into m section, V={V1, V2 ... Vm}, for each video-frequency band process, obtains result S={S1, S2 ... Sm}.

(2) split for each the region obtained, add up the histogram of L, a, b Color Channel respectively, wherein L passage is divided into 8 groups (bin), and a passage and b passage are respectively divided into 16 groups, composition the proper vector of dimension;

(3) utilize the light stream size and Orientation of each pixel in adjacent two frame computed image, for each cut zone, add up the light stream size histogram of 16 groups, and the light stream direction histogram of 9 groups;

(4) for each cut zone, obtain Characteristic Contrast degree respectively, described Characteristic Contrast degree sues for peace to measure by the distance of the histogram of the Color Channel in this region and other region, light stream size histogram and light stream direction histogram;

(5) priori obtaining video is significantly schemed, and comprising:

1. location-prior: for the first two field picture, adopts with the distance of frame of video central point as location-prior, for other frame, adopts the position at well-marked target place in former frame as location-prior, obtains location-prior thus and significantly scheme;

2. speed priori and acceleration priori: the light stream size and Orientation of the pixel utilizing step (3) to obtain, obtains movement velocity and the acceleration of pixel respectively, the speed priori of obtaining significantly is schemed significantly to scheme with acceleration priori;

3. background priori: provide a moving window on time dimension, the frame number of described moving window is greater than 10, to each pixel mean filter in window, the result of filtering, as background corresponding to a frame in the middle of this window, obtains background priori thus and significantly schemes;

(6) conspicuousness merges: the remarkable figure of each priori that the Characteristic Contrast degree obtain step (4) and step (5) obtain merges, method is, all segmentation levels are merged, in a specific frame, the significance of each pixel is the linear combination of Characteristic Contrast degree and priori, obtain space-time remarkable figure thus, and realize the detection to target.

In technique scheme, for the position of the position of the target detected in each frame as well-marked target place in former frame corresponding when carrying out next frame location-prior in step (5) in step (6).

In technique scheme, in described step (2), the proper vector of composition is normalized to the color feature vector obtained for subsequent treatment.

In described step (3), the light stream size of each pixel is expressed as , in formula, fm is light stream size, and (x, y) represents pixel position in the video frame, and u is the speed in x direction, and v is the speed in y direction; The light stream direction of each pixel is expressed as , in formula, fo is light stream direction; When obtaining light stream direction histogram, each group is 40 °, and all histograms are all normalized to the vector that length is 1.

In described step (4), for each cut zone centered by t frame , the cut zone belonging to c pixel namely centered by t frame, Characteristic Contrast degree is obtained by following formula:

In formula, represent the subregional Characteristic Contrast degree centered by t frame, , col is the proper vector of the color histogram that step (2) obtains, represent the cut zone belonging to t frame i-th pixel, it is region size, it is the center of gravity of two zoness of different with the tolerance of degree of closeness, , it is normalization empirical value.

Wherein, such as can get 0.04, object result is normalized to [0,1].

Described step (4) 1. in, for each the pixel i in current video frame, its normalized coordinate is , the center of the well-marked target of previous frame is , then, the location-prior of pixel i is:

。

Described step (4) 2. in, the movement velocity of pixel (x, y) is , in formula, , fm is light stream size, and (x, y) represents pixel position in the video frame, and u is the speed in x direction, and v is the speed in y direction;

The acceleration of pixel (x, y) is , (x-1, y), (x+1, y), (x, y-1), (x, y+1) represent the pixel that pixel (x, y) surrounding is adjacent respectively.

Described step (4) 3. in, the background priori that window intermediate frame is corresponding is provided by following formula:

In formula, (x, y) represents pixel position in the video frame, and I (x, y) is the pixel value at (x, y) place, and B (x, y) is the background value at (x, y) place, it is default normalized parameter;

A given cut zone , the prior definitions in this region is as follows:

Wherein the area of cut zone, it is region in the priori of i-th pixel, four kinds of pixel properties, .

In technique scheme, the conspicuousness in step (6) is fused to,

Wherein in subscript, i, t represent i-th pixel of t frame, and j represents the segmentation of jth layer, represent the cut zone belonging to i-th pixel in t frame.

Because technique scheme is used, the present invention compared with prior art has following advantages:

1, the present invention changes traditional method calculating location-prior with frame of video center, but adopts the location-prior computing method based on detecting, thus is more suitable for the detection that the unknowable monitor video in position appears in moving target.

2, the present invention is when merging background priori, background priori is calculated from time dimension, background is calculated by the filtering of time dimension, the dynamic that can solve due to background in video monitoring is strong, but the complicated and problem of result of often must not obtain robust from the background priori of Spatial Dimension calculating that causes of scene itself.

3, in video monitoring, different target has different movement velocitys and acceleration, and speed priori has been merged in the present invention and acceleration priori obtains motion well-marked target preferably, and meanwhile, this priori takes full advantage of the Optical-flow Feature of pixel, decreases operand.

4, in order to time null object that the attraction mankind solved in monitoring scene note, the present invention adopts Lab as color appearance feature interpretation in space dimension, time dimension then utilizes light stream size and Orientation histogram to be described, and is merged by this two category feature and obtains space-time remarkable target.

Accompanying drawing explanation

Fig. 1 is the method frame schematic diagram of the embodiment of the present invention one;

Fig. 2 is the remarkable figure of the contrast of different characteristic in the embodiment of the present invention one;

Fig. 3 is that the priori in embodiment one is significantly schemed;

Fig. 4 is the conspicuousness comparison diagram of distinct methods in embodiment two;

Fig. 5 is precision recall rate curve on Weizmann data set in embodiment two.

Embodiment

Below in conjunction with drawings and Examples, the invention will be further described:

Embodiment one: a kind of time and space significance detection method merging prior imformation, shown in accompanying drawing 1, according to feature integration theory and the principle mixing remarkable model, the calculating of conspicuousness is divided into two parts, the contrast of feature and priori, merge the space-time remarkable figure of multiple space-time characteristic and priori generation robust.

1. Characteristic Contrast degree

(1) space characteristics

Layered video dividing method is adopted to split input video.First by input video sequence vbe divided into the subsequence of m section non-overlapping copies , for whole video sequence vhierarchical Segmentation obtain , wherein corresponding to the Hierarchical Segmentation of i-th subsequence, the segmentation of i-th layer is by a series of zonule composition, wherein , and for j and k different arbitrarily, .The segmentation of i-th section of video is determined by the segmentation result of the i-th-1 section and the video content of i-th section.

The region centered by t frame obtained is split for each , add up the histogram of L, a, b Color Channel respectively, wherein L passage is divided into 8 bin, a passages and b passage is divided into 16 bin, composition the proper vector of dimension, finally forms final external appearance characteristic vector by proper vector L2 normalization .

(2) temporal characteristics

Temporal characteristics comprises the size and Orientation of light stream.Utilize the light stream size and Orientation of each pixel in adjacent two frame computed image, , , for each cut zone the light stream size histogram of statistics 16 bin, and the light stream direction histogram of 9 bin, 40 degree is 1 bin.All histograms are all normalized to the vector that length is 1.

For each cut zone , Characteristic Contrast degree sues for peace to measure by the distance of the color histogram in this region and other regions, light stream size histogram and light stream direction histogram, as follows:

Wherein , it is region size, it is the center of gravity of two zoness of different with the tolerance of degree of closeness, .Larger close of area region pair conspicuousness contribution larger, otherwise those areas are little, and the contribution in the region of distance is smaller.Different Characteristic Contrast degree as shown in Figure 2.

2. conspicuousness priori

The priori of conspicuousness is usually obtain for the skewed popularity in some region when seeing video and image according to people, for video, people are partial to pay close attention to those foreground targets that is fast or direction of motion change that moves, so movable information and background information can carry out auxiliary remarkable map generalization as priori more.In addition because video capture person is usually the centre being received within picture in most important in video, so the center location information of video is also usually as the priori generating remarkable figure.

For general shooting video use the distance at pixel and picture center to be suitable selection as location-prior.And in the application of video monitoring, situation is distinguished to some extent, and the camera for video monitoring is fixed on certain position often, is static, therefore the well-marked target in video may appear at any position, so use the position of pixel distance picture central authorities as priori and improper.Because in video, the motion of target has continuity, the well-marked target of next frame should near present frame well-marked target position.Therefore one of location-prior suitable selection is the position at well-marked target place in using former frame.For the first frame of video, owing to not having the priori of former frame well-marked target, the distance of same employing and frame of video central point is as location-prior.For each the pixel i in current video frame, its normalized coordinate is , the center of the well-marked target of previous frame is , then, the location-prior of pixel i is defined as follows:

Target travel priori is then taken full advantage of to the Optical-flow Feature of the pixel of previous calculations, the priori of motion comprises two parts: speed and acceleration.According to the light stream calculated before, directly can obtain the movement velocity size of pixel, each pixel level and vertical direction then can ask difference to try to achieve by the acceleration of each pixel, as follows:

。

Background can suppress the background area in remarkable figure as priori, and promote the effect that conspicuousness calculates, the people such as Zhu just add background information in their model.In their model, background priori is made up of two parts, border and connectedness.Their model achieves good effect on still image.And often more pay close attention in video monitoring be motion pedestrian, vehicles etc., the object therefore moved should have higher conspicuousness.The background priori of still image considers from the angle of Spatial Dimension, this method considers background priori from time dimension, consider a moving window on time dimension, to each pixel mean filter in window, the result of filtering is as background corresponding to a frame in the middle of this window, and background prior definitions corresponding to this frame is as under formula:

In experiment parameter get 10.In video monitoring, because camera is static, the gray scale of background pixel is substantially constant at short notice, so value will be very little, and this has just reached the effect of Background suppression.A given cut zone , the prior definitions in this region is as follows:

Wherein the area in region segmentation region, it is region in the priori of i-th pixel, .Different priori as shown in Figure 3.

3. conspicuousness merges

Space-time remarkable model is all significantly schemed the time with significant spatial figure according to the final space-time remarkable figure of certain mode combination producing one width in the final step that conspicuousness calculates, we's rule is that the Characteristic Contrast degree previously obtained is become space-time remarkable figure with Prior Fusion, in the process calculating Characteristic Contrast degree and priori item, Time and place feature was merged.Conspicuousness before is independently carried out in the cut zone of every one deck, and when conspicuousness merges, merged by all segmentation levels, in a specific frame, the significance of each pixel is the linear combination of Characteristic Contrast degree and priori.Computing method are as follows:

Embodiment two:

Adopt the method for embodiment one, the database of the dynamic scene of UCSD carries out experimental verification.

Database contains bird, boat, bottle, chopper, cyclist, flock, freeway, hockey, jump, land, land, ocean, peds, rain, skiing, surf, surfers, traffic, the video of the dynamic scene that zodiac these 18 is different, be used herein 17 videos wherein, the remarkable figure binaryzation that we will obtain, compare with the groundtruth that database provides, calculate equal error rate EER(equalerrorrate), equal error rate weighs the one tolerance of biological recognition system recognition accuracy, it be false acceptance rate (FAR) equal with false rejection rate (FRR) time error rate.Experimental Hardware environment: Window7, Corei7 processor, dominant frequency is 3.4G, inside saves as 8G.Code running environment is: Matlab2013a.The remarkable figure of distinct methods as shown in Figure 4.

1, qualitative analysis

This method is compared to other method as can be seen from Figure 4, can suppress background area more exactly, obtains the marking area of relative complete display, has border more clearly relative to PQFT and GBVS this method.

2, qualitative assessment

Add up the equal error rate of distinct methods on UCSD database, the statistics obtained is as follows:

The comparison of the EER of table 1 distinct methods

This method is on average lower by 7% than the equal error rate (EER) of Rahtu and GBMR two kinds of conspicuousness methods as can be seen from Table 1, lower by about 2% than LRMR, KDE and GMM two kinds of background segment methods.The equal error rate of GBVS and PQFT is lower than this method.But due to the PQFT many detailed information of elimination in a frequency domain, make the remarkable figure that obtains have cavity and edge fog effect, be unfavorable for the segmentation of well-marked target.Also there is ill-defined effect in same GBVS method.

In order to compare the effect of distinct methods on color video quantitatively, finally also test on Weizmann data set, Weizmann data set is a conventional Activity recognition data set, it contains 93 videos of 10 personages, be divided into 10 different action classes, and be the color video that camera is static.This method utilizes conspicuousness to carry out background segment to video.Data set contains the groundtruth of the prospect of all videos, and we carry out background segment to significantly desiring to make money or profit of obtaining by threshold value.Experimental result adopts precision (precision, P) and recall rate (recall, R) to measure.Fig. 5 shows the different conspicuousness algorithms precision on Weizmann data set-recall rate curve.

Fig. 5 shows this method, and on the database of static color, compare additive method effect more excellent, and the result this method robustness on comprehensive two data sets is better.

Claims

1., for a time and space significance detection method for Video processing, it is characterized in that, comprising:

(2) split for each the region obtained, add up the histogram of L, a, b Color Channel respectively, wherein L passage is divided into 8 groups, and a passage and b passage are respectively divided into 16 groups, composition the proper vector of dimension;

(5) priori obtaining video is significantly schemed, and comprising:

2. the time and space significance detection method for Video processing according to claim 1, is characterized in that: for the position of the position of the target detected in each frame as well-marked target place in former frame corresponding when carrying out next frame location-prior in step (5) in step (6).

3. the time and space significance detection method for Video processing according to claim 1, is characterized in that: in described step (2), the proper vector of composition is normalized to the color feature vector obtained for subsequent treatment.

4. the time and space significance detection method for Video processing according to claim 1, it is characterized in that: in described step (3), the light stream size of each pixel is expressed as , in formula, fm is light stream size, and (x, y) represents pixel position in the video frame, and u is the speed in x direction, and v is the speed in y direction; The light stream direction of each pixel is expressed as , in formula, fo is light stream direction; When obtaining light stream direction histogram, each group is 40 °, and all histograms are all normalized to the vector that length is 1.

5. the time and space significance detection method for Video processing according to claim 1, is characterized in that: in described step (4), for each cut zone centered by t frame , the cut zone belonging to c pixel namely centered by t frame, Characteristic Contrast degree is obtained by following formula:

6. the time and space significance detection method for Video processing according to claim 1, is characterized in that: described step (4) 1. in, for each the pixel i in current video frame, its normalized coordinate is , the center of the well-marked target of previous frame is , then, the location-prior of pixel i is:

。

7. the time and space significance detection method for Video processing according to claim 1, is characterized in that: described step (4) 2. in, the movement velocity of pixel (x, y) is , in formula, , fm is light stream size, and (x, y) represents pixel position in the video frame, and u is the speed in x direction, and v is the speed in y direction;

8. the time and space significance detection method for Video processing according to claim 1, is characterized in that: described step (4) 3. in, the background priori that window intermediate frame is corresponding is provided by following formula:

A given cut zone , the prior definitions in this region is as follows:

Wherein the area of cut zone, it is region in the priori of i-th pixel, .

9. the time and space significance detection method for Video processing according to claim 1, is characterized in that: the conspicuousness in step (6) is fused to,