CN102932605B

CN102932605B - Method for selecting camera combination in visual perception network

Info

Publication number: CN102932605B
Application number: CN201210488434.1A
Authority: CN
Inventors: 孙正兴; 李骞; 陈松乐
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2012-11-26
Filing date: 2012-11-26
Publication date: 2014-12-24
Anticipated expiration: 2032-11-26
Also published as: CN102932605A

Abstract

The invention discloses a method for selecting camera combination in a visual perception network. The method comprises the following steps: on-line generation of a target image visual histogram: in the case that the vision fields of a plurality of cameras overlap each other, performing motion detection on online obtained video data of multiple cameras observing the same object, determining the subregion of the object in a video frame image space according to the detection result, to obtain a target image region; performing local feature extraction on the target image region, and calculating the visual histogram of the target image region at the visual angle according to a visual dictionary generated by pre-training; and sequential forward camera selection: selecting an optimal visual angle, that is, the optimal camera, in the set of unselected cameras, selecting a secondary optimal camera, adding the secondary optimal camera to the set of selected cameras, removing the secondary optimal camera from the set of candidate cameras, and repeating the steps until the count of the selected cameras reaches the count of needed cameras.

Description

The combination system of selection of video camera in a kind of visually-perceptible network

Technical field

The present invention relates to video camera system of selection, belong to computer vision and video data processing technology field, the specifically combination system of selection of video camera in a kind of visually-perceptible network.

Background technology

In recent years, be widely used in the fields such as security monitoring, man-machine interaction, navigator fix, battlefield surroundings perception due to video camera, multi-camera system becomes one of study hotspot of computer vision and application thereof.Especially based in the application such as the monitoring of video and man-machine interaction, the visually-perceptible network VSN (VisualSensor Network) be made up of multiple video camera effectively can solve in the target observations process of single camera existence problems such as certainly blocking, but also create bulk redundancy information, add system storage, vision calculates and the burden of Internet Transmission.Therefore, how to choose from multi-channel video and to push the video of rich amount of information, just becoming one of key issue of visually-perceptible network and application thereof.Similar with the video camera select permeability based on video data, also extensive research has been carried out for the visual angle select permeability of three-dimension model observation in graphics field, as document 1Vazquez P, Sbert M.Fast adaptive selection ofbest views.Lecture Notes in Computer Science, 2003, in 2669:295 – 305, known geometrical model is asked for the viewpoint entropy under different visual angles, and select optimum visual angle according to its size, but with video camera select permeability unlike, the former requires the accurate model definition obtaining observed object in advance, and model builds thus analytic process mostly in special pattern environment does not need to consider the impact of the factor such as background and illumination.On the other hand, general sensor network nodes select permeability is as document 2Mo Y., AmbrosinoR., and Bruno Sinopoli.Sensor Selection Strategies for StateEstimation in Energy Constrained Wireless Sensor Networks.Automatica, 2011, 47 (7): 1330-1338 and document 3Huber, M.F.Optimal pruning for multi-Step sensor scheduling.IEEETransactions on automatic control.2012, the foundation that the position between target being observed and transducer is selected as sensor node is all adopted in 57 (5): 1338 – 1343, and video camera perception environment has directivity, simply can not select optimum video camera according to the position relationship of target and camera node, such as, more wish direct picture instead of the closely figure viewed from behind image of seeing people in security monitoring application.

Existing video camera system of selection can be divided in wide area without the video camera system of selection of ken overlap and the video camera system of selection with the part or all of overlapping ken according to camera node ken overlapped coverage situation in visually-perceptible network.Video camera system of selection wherein without ken overlap is the demands such as the target Continuous tracking realized on a large scale, according to selecting the node in the camera network of dispersion laying the prediction of target travel; The present invention is for meeting the application demand such as security monitoring and man-machine interaction, and main research has the video camera system of selection observing same target in part or all of ken overlapping range.In these class methods, single camera system of selection and camera chain system of selection two class can be divided into again according to the quantity of on-camera.Wherein single camera system of selection is on specific select time point, only select an optimum visual angle as output according to the selection criterion proposed, now, the design of choice criteria and visual information amount evaluation criterion become the key of the selection of video camera, and without the need to considering the similitude between respective capturing information between video camera and video camera.At the design aspect of choice criteria, usually can be divided into based on video image content selection with based on target in video in objective world spatial relation two class, as document 4Daniyal F., Taj M., Cavallaro A Contentand task-based view selection from multiple video streams.Multimedia Tools andApplications, 2010, the number of moving in video product is extracted in 46:235 – 258, the type of target, whether there is the video features such as shooting event in size and position and video, the selection of content-based video camera is realized according to the contextual information of feature, the video image content that these class methods only obtain video camera carries out feature extraction and scoring is compared, do not need to carry out similarity measurement to the content of node perceived each in camera network.Based on the method that the spatial relation of target in video is selected video camera, as document 5Park J, Bhat C, Kak AC.A look-up table basedapproach for solving the camera selection problem in large camera networks.ACMWorkshop on Distributed Smart Cameras, space within the scope of camera angles is created as a corresponding video camera look-up table in 2006, according to the spatial relation of target and each video camera in video camera selection course, the camera node that chosen distance is nearest in table, this class methods prerequisite is that in scene, video camera has to pass through accurate camera calibration process, otherwise the exact space position of target cannot be obtained from the video image of each video camera, the method does not consider the orientation information of target in scene simultaneously, cannot target acquisition direct picture all the time when being applied to the fields such as security monitoring.Said method selection result all only has an optimum video camera, does not consider the restriction being made up mutually visual angle each other in selection result by the compound mode of multiple video camera, thus can not investigate the information similarity degree between video camera and redundancy condition.

Under certain resource and treatment conditions allow, by selecting multiple video camera to form the method for camera chain compared with single camera system of selection, by problems such as multiple visual angle certainly blocking of increasing that information source effectively overcomes that the latter occurs and blind areas.Although can by successively selecting the mode at optimum visual angle to form camera chain, but the video camera because of different angles obtains video image to be existed content similarities and thus there is data redundancy in various degree, generally, the selection result of each selected optimum video camera composition the camera chain of non-optimal, although such as two video cameras photographing target front amount of information with regard to single are all comparatively large simultaneously, contained Global Information amount usually not as a front and a side cameras obtain the visual information amount of target.Up to the present existing camera chain system of selection is studied relatively less.

Summary of the invention

Goal of the invention: technical problem to be solved by this invention is for the deficiencies in the prior art, proposes the combination system of selection of video camera in a kind of visually-perceptible network.

Technical scheme: the combination system of selection of video camera in a kind of visually-perceptible network disclosed by the invention, comprises the following steps:

Step 1, target image vision histogram generates online: in multiple camera field overlapping ranges situation, motion detection is carried out to the video data of the online multichannel video camera containing target obtained, by testing result determination target at the subregion in video frame images space, namely obtain object region; Local shape factor is carried out to object region; In conjunction with the visual dictionary that training in advance generates, calculate the vision histogram of object region under this visual angle;

Step 2, sequential forward direction video camera is selected: on each time point, select an optimum visual angle, i.e. optimum video camera; In non-selected video camera set, according to the information gain of target image in the vision histogram calculation candidate camera video that step 2 calculates and candidate's video camera to the mutual information selecting video camera set, select comparatively large to target observations information gain and with select the less suboptimum video camera that namely mutual information is less of camera review content similarity, added and selected video camera set, and rejected from the set of candidate's video camera; Continuous repetition above-mentioned steps is until selected video camera counting reaches preset value.

Visual dictionary: due to Scale invariant transform characteristics SIFT (Scale Invariant Feature Transform, SIFT) illumination and the impact such as convergent-divergent that different cameras produces can be overcome preferably, therefore the present invention using it as visual dictionary lemma.To input as the two field picture in the multi-path video data of training data, first extract Scale invariant transform characteristics SIFT (Scale Invariant Feature Transform, SIFT) local feature description's subvector set of every width image; K-mean cluster is carried out to the SIFT feature descriptor set of all image zooming-out; Each cluster centre is regarded as a visual word, and the set of the visual word obtained forms the visual dictionary of off-line training, specifically comprises the following steps:

Extract image SIFT feature descriptor vector: to every frame input video two field picture, adopt Gaussian template to carry out filtering to image respectively and ask for x and y direction gradient component I _xand I _y, and with this calculating pixel point gradient magnitude and direction θ (x, y)=arctan (I _y, I _x), from the image upper left corner, the window of 16 × 16 sizes is got as feature extraction sampling window at interval of 8 pixels at image x and y direction, window is divided into 4 × 4 square net regions, sampled point in each region is calculated respectively and the gradient relative direction at sampling window center, by the gradient magnitude of sampled point by being included into the gradient orientation histogram in region on 8 directions after distance Gauss weighting respectively, each sampling window generates i.e. 128 characteristic vectors tieed up of one 4 × 4 × 8 dimension, gained characteristic vector is normalized and forms window local feature description subvector, the descriptor vector that every width image calculates is added feature interpretation subclass F={f ⁽¹⁾, f ⁽²⁾, f ⁽³⁾... f ^(t), f ⁽ⁱ⁾∈ R ¹²⁸, 1≤i≤t, wherein f ⁽ⁱ⁾for this width image feature descriptor set i-th descriptor vector, R ¹²⁸represent that this vector dimension is 128 dimensions, the Feature Descriptor sum that t extracts for this width image,

K-mean cluster is carried out to characteristic vector: the SIFT feature descriptor vector set F that two field picture is extracted, in random selecting set, k vector is initial cluster center, after clustering is carried out by cluster centre to all characteristic vectors, recalculate new cluster centre, continuous iteration until reach iterations restriction or cluster centre distance change be less than certain threshold value, the present invention arrange when iterations reach 50 ~ 200 times or cluster centre distance be less than 0.02 as stop iterated conditional.

Visual dictionary is formed: each cluster centre is considered as a visual word, obtains and stores the set of visual word, forming the visual dictionary of off-line training.

Step 1 target image vision histogram of the present invention generates online and specifically comprises the following steps:

Step 11, Video Motion Detection: adopt mixed Gauss model to carry out Video Motion Detection respectively to the video data that each video camera inputs, every frame testing result is eliminated to the shade produced in scene by target based on texture method, extract moving target in the region of image space.

Step 12, area image local feature description extracts: the motion target area image zooming-out SIFT feature descriptor vector set of extracting step 11;

Step 13, vision histogram generates: using the cluster centre of the visual dictionary of training in advance generation as histogram bucket, motion target area image SIFT feature descriptor vector step 12 extracted incorporates in the corresponding bucket of histogram, add up descriptor vector number in each bucket respectively, finally to histogram normalized, generate the vision histogram of moving target under multiple visual angle thus.

Step 2 of the present invention specifically comprises the following steps:

Step 21, initialization is selected: there is video camera set C={c in scene ₁, c ₂... c _mobserving moving target simultaneously, m is the sum of video camera, by the video camera set selected candidate's video camera set C _u=C, merges the SIFT feature descriptor vector set of all candidate's video cameras, generates the vision histogram H after merging by step 13 _merge;

Step 22, optimum video camera is selected: from candidate camera chain C _uin the optimum video camera c of Standard Selection one such as comprehensive Face datection result, the gain of motion target area image information and definition ^*, added by the set of selection video camera, i.e. C _s={ c ^*, reject from the set of candidate's video camera, i.e. C simultaneously _u=C _u{ c ^*, C _u=C _u{ c ^*, the video camera counting count value selected by initial setting up is 1; Its concrete steps are:

Step 221, Face datection: utilize AdaBoost(Adaptive Boosting, the Weak Classifier algorithm improved, Adaboost is a kind of iterative algorithm, its core concept trains different graders (Weak Classifier) for same training set, then these weak classifier set are got up, form a stronger final grader (strong classifier).Adaboost algorithm is the Boosting algorithm improved, and it can carry out accommodation to the mistake of the Weak Classifier that weak study obtains.) human-face detector carries out Face datection to candidate's video camera c' motion target area image, testing result is V _face=0,1}, 1 expression detects face, otherwise is 0;

Step 222, the gain of motion target area image information calculates: to the vision histogram H of video camera c' _{c '}with merging after-vision histogram H _mergecalculate and select video camera c' and visual information amount information gain V when not selecting this video camera _iG, namely

V_{IG} = \underset{k}{Σ} p (b_{k}, c^{'}) \log (\frac{p (b_{k}, c^{'})}{p (b_{k}) p (c^{'})}) + \underset{k}{Σ} p (b_{k}, \overset{&OverBar;}{c^{'}}) \log (\frac{p (b_{k}, \overset{&OverBar;}{c^{'}})}{p (b_{k}) p (\overset{&OverBar;}{c^{'}})}),

Wherein: p (b _k, c') and for selecting video camera c' and vision histogram H _mergethe joint probability of a middle kth bucket, represent and do not select video camera c' and vision histogram H _mergethe joint probability of a middle kth histogram bucket, p (b _k) be a kth bucket probability, p (c') and represent the probability selecting and do not select video camera c' respectively, all according to histogram H _{c '}and H _mergecalculate;

Step 223, target image sharpness computation: to the gradient of target area image calculate average gradient size characterize target readability in the picture.

Step 224, optimum video camera is selected: establish weight coefficient α ₁, α ₂, α ₃, α ₁+ α ₂+ α ₃=1, select video camera c ^*as optimum video camera, make arranging selected video camera counting count value is 1;

Step 23, suboptimum video camera is selected: to selecting video camera set the information gain of each iterative computation candidate video camera and to the vision histogram mutual information selecting video camera, selects suboptimum video camera c ^*, join and select video camera set and from candidate's video camera set C _umiddle rejecting, and increase selected video camera counting, i.e. count=count+1;

Step 231, adopts target area image information gain IG in the method calculated candidate video camera c' of step 222 _{c '};

Step 232, candidate's video camera calculates in selecting video camera mutual information: the vision histogram H of calculated candidate video camera target area image c' _c'with selected video camera c in video camera set _j, c _j∈ C _svision histogram between mutual information MI (c', c _j) represent target area image vision content similarity degree between two video cameras:

MI (c^{'}, c_{j}) = Σ_{x = 1}^{n_{c^{'}}} Σ_{y = 1}^{n_{c_{j}}} p (H_{c^{'}}^{x}, H_{c_{j}}^{y}) \log (\frac{p (H_{c^{'}}^{x}, H_{c_{j}}^{y})}{p (H_{c^{'}}^{x}) p (H_{c_{j}}^{y})}),

Wherein, for histogram H _c'an xth bucket, for histogram y bucket, n _{c '}for the vision histogram H of candidate's video camera c' _c'the sum of bucket, for selecting video camera c _jvision histogram the sum of bucket;

Step 233, suboptimum video camera is selected: establish weight coefficient β, 0≤β≤1, selects video camera c ' ^*as optimum video camera, make

c^{' *} = \underset{c^{'}}{\arg \max} ({IG}_{c^{'}} - β \underset{c_{j} &Element; C_{s}}{Σ} MI (c^{'}, c_{j}));

Step 234, to the suboptimum video camera c chosen ^*, added and selected video camera set C _s=C _s∪ c ^*, from the set of candidate's video camera, reject this video camera C simultaneously _u=C _u{ c ^*, and increase selected video camera counting, i.e. count=count+1.

Step 24, repeats step 23, until selected video camera counting count reaches the video camera counting n, the n that preset get natural number, sets according to specific needs.

Beneficial effect: for solving the loss of learning that in the application such as target monitoring and man-machine interaction, single camera produces from blocking due to target, and use multiple video camera to there is the problem of bulk redundancy information simultaneously, the present invention is directed to the system of selection that application demand discloses a kind of camera chain, on select time point, selected by gradual camera node, pick out the richest amount of information and the minimum camera chain of redundant information, namely observe candidate's video camera of same target from m and choose n video camera (n<m), with satisfied calculating, the constraints of storage and network capacity.

Specifically the present invention has the following advantages compared with existing method: 1, to the visually-perceptible network with common FOV, under calculating, storage capacity confined condition, multiple video camera is selected to form camera chain, efficiently solve the problem such as certainly to block that single camera is selected to bring, decrease the information redundancy problem using all video cameras to bring simultaneously; 2, carry out optimum video camera selection in conjunction with recognition of face, video camera information gain and image definition, video camera that guarantee front, that have higher level of detail and target image is more clearly selected; 3, from candidate's video camera, suboptimum video camera is progressively selected with sequential progression, both the contribute information of video camera to observed target had been considered, adopt again the information redundancy degree between mutual information form minimizing different cameras, under avoiding producing same standard, select the angle same problem that multiple optimum video camera easily brings; 4, off-line learning visual dictionary vision histogram under constructing different visual angles, makes to be associated to the target observations image of the multiple video camera of same target; 5, choose SIFT local feature description as vision lemma, effectively reduce the impact of the factors such as convergent-divergent in different cameras, illumination and visual angle.

Accompanying drawing explanation

To do the present invention below in conjunction with the drawings and specific embodiments and further illustrate, above-mentioned and/or otherwise advantage of the present invention will become apparent.

Fig. 1 is handling process schematic diagram of the present invention.

Fig. 2 a ~ 2d is an embodiment video camera the 180th frame video frame images.

Fig. 3 a ~ 3d carries out the moving image after the process of step 11 motion detection to Fig. 2 a ~ 2d.

Fig. 4 a ~ 4d is the image obtaining target after carrying out step 11 process to Fig. 2 a ~ 2d according to motion detection result.

Fig. 5 a ~ Fig. 5 d is the vision histogram that Fig. 4 a ~ 4d carries out that step 13 processes rear acquisition.

Fig. 6 a ~ Fig. 6 h is second embodiment the 62nd frame, 8 camera video two field pictures.

Fig. 7 a ~ Fig. 7 h is the image that Fig. 6 a ~ Fig. 6 h carries out that step 11 processes rear moving target.

Embodiment:

The invention discloses a kind of video camera system of selection according to information redundancy between video camera information gain and video camera, comprise the following steps:

Step 1, target image vision histogram generates online: assuming that multiple camera field overlapping ranges in system, therefore can observe the moving target in scene simultaneously.First motion detection is carried out, by the subregion of testing result determination target in video frame images space to the online multi-path video data that there is target obtained; Local shape factor is carried out to object region; In conjunction with the visual dictionary that training in advance generates, calculate the vision histogram of object region under this visual angle.Specifically comprise following content:

Step 13, vision histogram generates: (feature space is divided into several little intervals in the present invention using the cluster centre that calculated off-line obtains as histogram bucket, each interval is a histogram bucket), motion target area image SIFT feature descriptor vector step 12 extracted incorporates in corresponding bucket, add up descriptor vector number in each bucket respectively, finally to histogram normalized, generate the vision histogram of moving target under multiple visual angle thus.

Step 2, sequential forward direction video camera is selected: select an optimum visual angle, i.e. optimum video camera; In non-selected video camera set, according to the information gain of target image in the vision histogram calculation candidate camera video that step 1 calculates and candidate's video camera to the mutual information selecting video camera set, select comparatively large to target observations information gain and with select the less suboptimum video camera that namely mutual information is less of camera review similarity, added and selected video camera set, and rejected from the set of candidate's video camera; Continuous repetition above-mentioned steps is until selected video camera counting reaches preset value.

Step 21, initialization is selected: there is video camera set C={c in visually-perceptible network scenarios ₁, c ₂... c _mobserving moving target simultaneously, m is the sum of video camera, has selected video camera set candidate's video camera set C _u=C, merges the Scale invariant transform characteristics descriptor vector set of all candidate's video cameras, generates the vision histogram H after merging _merge;

Step 22, optimum video camera is selected: from candidate camera chain C _uin the optimum video camera c of Standard Selection one such as comprehensive Face datection result, the gain of motion target area image information and definition ^*, added by the set of selection video camera, i.e. C _s={ c ^*, reject from the set of candidate's video camera, i.e. C simultaneously _u=C _u{ c ^*, the video camera counting count value selected by initial setting up is 1, and its concrete steps are:

Step 221, Face datection: utilize AdaBoost(Adaptive Boosting, the Weak Classifier algorithm of improvement) human-face detector carries out Face datection to candidate's video camera c' motion target area image, and testing result is V _face=0,1}, 1 expression detects face, otherwise is 0;

V_{IG} = \underset{k}{Σ} p (b_{k}, c^{'}) \log (\frac{p (b_{k}, c^{'})}{p (b_{k}) p (c^{'})}) + \underset{k}{Σ} p (b_{k}, \overset{&OverBar;}{c^{'}}) \log (\frac{p (b_{k}, \overset{&OverBar;}{c^{'}})}{p (b_{k}) p (\overset{&OverBar;}{c^{'}})}),

Wherein p (b _k, c') and for selecting video camera c' and vision histogram H _mergethe joint probability of a middle kth bucket, represent and do not select video camera c' and vision histogram H _mergethe joint probability of a middle kth histogram bucket, p (b _k) be a kth bucket probability, p (c') and represent the probability selecting and do not select video camera c' respectively, all according to histogram H _{c '}and H _mergecalculate;

Step 232, candidate's video camera calculates in selecting video camera mutual information: the vision histogram H of calculated candidate video camera target area image c' _c'with selected video camera c in video camera set _j, c _j∈ C _svision histogram between mutual information MI (c', c _j) represent target area image vision content similarity degree between two video cameras

MI (c^{'}, c_{j}) = Σ_{x = 1}^{n_{c^{'}}} Σ_{y = 1}^{n_{c_{j}}} p (H_{c^{'}}^{x}, H_{c_{j}}^{y}) \log (\frac{p (H_{c^{'}}^{x}, H_{c_{j}}^{y})}{p (H_{c^{'}}^{x}) p (H_{c_{j}}^{y})}),

c^{' *} = \underset{c^{'}}{\arg \max} ({IG}_{c^{'}} - β \underset{c_{j} &Element; C_{s}}{Σ} MI (c^{'}, c_{j}));

Step 24, repeats step 23, until selected video camera counting count reaches the video camera counting n preset.

Visual dictionary off-line training: to the multi-channel video two field picture as training data of input, first Scale invariant transform characteristics (the Scale Invariant Feature Transform of every width image is extracted, SIFT) local feature description's subvector set, then k-mean cluster is carried out to the SIFT feature descriptor set of all image zooming-out, each cluster centre is a descriptor vector, be regarded as a visual word, the set of the visual word obtained forms the visual dictionary of off-line training.Specifically comprise following steps:

Extract image SIFT feature descriptor vector: to every frame input video two field picture, adopt Gaussian template to carry out filtering to image respectively and ask for x and y direction gradient component I _xand I _y, and with this calculating pixel point gradient magnitude and direction θ (x, y)=arctan (I _y, I _x), from the image upper left corner, the window of 16 × 16 sizes is got as feature extraction sampling window at interval of 8 pixels at image x and y direction, window is divided into 4 × 4 square net regions, sampled point in each region is calculated respectively and the gradient relative direction at sampling window center, by the gradient magnitude of sampled point by being included into the gradient orientation histogram in region on 8 directions after distance Gauss weighting respectively, each sampling window generates i.e. 128 characteristic vectors tieed up of one 4 × 4 × 8 dimension, gained characteristic vector is normalized and forms window local feature description subvector, the descriptor vector that every width image calculates is added feature interpretation subclass F={f ⁽¹⁾, f ⁽²⁾, f ⁽³⁾... f ^(t), f ⁽ⁱ⁾∈ R ¹²⁸, 1≤i≤t, wherein f ⁽ⁱ⁾for this width image feature descriptor set i-th descriptor vector, R ¹²⁸represent that this vector dimension is 128 dimensions, the feature interpretation subvector sum that t extracts for this width image,

K-mean cluster is carried out to characteristic vector: the SIFT feature descriptor vector set F that two field picture is extracted, in random selecting set, k vector is initial cluster center, after clustering is carried out by cluster centre to all characteristic vectors, recalculate new cluster centre, continuous iteration until reach iterations restriction or cluster centre change be less than certain threshold value, the present invention arrange when iterations reach 100 times or cluster centre distance be less than 0.02 as stop iterated conditional.

Embodiment 1

The present embodiment comprises off-line training generation visual dictionary, target image vision histogram generates online and the video camera of sequential forward direction is selected, its process chart as shown in Figure 1, whole method is divided into online target image vision histogram to generate and two key steps selected by video camera, introduces the main flow of each embodiment part below respectively.

1, target image vision histogram generates online

In order to set up the information association between each video camera, first the present embodiment chooses multi-path video data under Same Scene, extract the local feature information in frame of video, cluster is carried out to local feature vectors, using the visual dictionary that cluster centre generates as off-line training, compare so that Online Video can generate corresponding vision histogram according to visual dictionary and carry out related information to it.Because SIFT local feature can overcome the visual effect difference that target produces because of illumination, convergent-divergent and visual angle under multiple visual angle preferably, therefore the present embodiment extracts the SIFT feature vector of the multichannel visual angle training video image of input, k-mean cluster is carried out to it, finally generates visual dictionary.Concrete steps are:

Extract image SIFT feature descriptor vector: to every frame input video two field picture, adopt Gaussian template G_x respectively, G_y carries out filtering to image and asks for x and y direction gradient component I _xand I _y, wherein

G_x = |\begin{matrix} 0.0067 & 0.0085 & 0 & - 0.0085 & - 0.0067 \\ 0.0964 & 0.1224 & 0 & - 0.1224 & 0.0964 \\ 0.2344 & 0.2977 & 0 & - 0.2977 & - 0.2344 \\ 0.0964 & 0.1224 & 0 & - 0.1224 & - 0.0964 \\ 0.0067 & 0.0085 & 0 & - 0.0085 & - 0.0067 \end{matrix}|,

G_y = |\begin{matrix} 0.0067 & 0.0964 & 0.2344 & 0.0964 & 0.0067 \\ 0.0085 & 0.1224 & 0.2977 & 0.1224 & 0.0085 \\ 0 & 0 & 0 & 0 & 0 \\ - 0.0085 & 0.1224 & - 0.2977 & - 0.1224 & - 0.0085 \\ - 0.0067 & - 0.0964 & - 0.2344 & - 0.0964 & - 0.0067 \end{matrix}|,

And with this calculating pixel point gradient magnitude and direction θ (x, y)=arctan (I _y, I _x); From the image upper left corner, the window of 16 × 16 sizes is got as feature extraction sampling window at interval of 8 pixels at image x and y direction, window is divided into 4 × 4 square net regions, sampled point in each region is calculated respectively and the gradient relative direction at sampling window center, by the gradient magnitude of sampled point by being included in region 0 after distance Gauss weighting respectively π, gradient orientation histogram on totally 8 directions, each sampling window generates i.e. 128 characteristic vectors tieed up of one 4 × 4 × 8 dimension, is normalized forms window local feature description subvector to gained characteristic vector; The descriptor vector that every width image calculates is added feature interpretation subclass F={f ⁽¹⁾, f ⁽²⁾, f ⁽³⁾... f ^(t), to each f ⁽ⁱ⁾∈ R ¹²⁸, wherein f ⁽ⁱ⁾for this width image feature descriptor set i-th descriptor vector, this vector dimension is 128 dimensions, the feature interpretation subvector sum that t extracts for this width image.

K-mean cluster is carried out to characteristic vector: the SIFT feature descriptor vector set F that two field picture is extracted, if cluster centre number is k, carries out k-mean cluster as follows:

Cluster centre is selected: random selecting k local feature description subvector { μ from training sample SIFT local feature description subclass ⁽¹⁾, μ ⁽²⁾... μ ^(k), as the center of k cluster;

Clustering: remain descriptor vector f in feature interpretation subclass ⁽ⁱ⁾, calculate it to each cluster centre μ ^(j)squardx ²distance wherein f _l ⁽ⁱ⁾for descriptor vector f ⁽ⁱ⁾l component, by descriptor vector f ⁽ⁱ⁾incorporate into and have in the cluster of minimum range d;

Recalculate cluster centre: according to cluster result, calculate the average of each dimension of all elements in k cluster, as the center that cluster is new;

Again cluster: all elements in feature interpretation subclass F is pressed the new center of cluster according to the minimum distance criterion in step 122 again cluster;

Iterative computation cluster centre also carries out cluster to feature interpretation subclass again according to new center, until iterations reach the iterations or new center and iteration preset before old centre distance be less than setting threshold, the present invention arrange when iterations reach 100 times or cluster centre distance be less than 0.02 as stop iterated conditional.

For setting up target image statistical representation model under each video camera, the present invention, to the multiple paths of video images of online input, extracts the area image of moving target under respective visual angle, reduces the visual effect produced because of background difference different; The vision histogram under each visual angle is generated to the target area image extraction SIFT local feature extracted and according to the visual dictionary of off-line training.Concrete steps are:

Step 11, Video Motion Detection: carry out Video Motion Detection respectively to the video data that each video camera inputs, extracts moving target in the region of image space.Concrete steps are as follows:

Step 111, sport foreground image extracts: adopt mixed Gauss model (GMM) to carry out background modeling and foreground extraction to video camera sequential input picture.Concrete steps are as follows:

Step 1111, initial setting up: the first frame of input is set to background, and Gauss's number of mixed Gauss model, background threshold and window size are set.

Step 1112, by a new frame video image input model, upgrades background, extracts present frame foreground image.

Step 112, shadow removing: each the frame foreground image obtained step 111, carries out target shadow in foreground image and eliminates operation.Concrete steps are as follows:

Step 1121, adopts Gaussian template G_x respectively, and G_y calculates original image I gradient I in the x and y direction _xand I _y.

Step 1122, adopts step 1121 same procedure to calculate background image I _bgradient I in the x and y direction _bxand I _by.

Step 1123, computed image I and background image I _bgradient vector included angle cosine,

\cos θ = \frac{I_{x} I_{bx} + I_{y} I_{by}}{\sqrt{(I_{x}^{2} + I_{y}^{2}) (I_{bx}^{2} + I_{by}^{2})}} .

Step 1124, to image mid point (x, y) compute gradient texture in 5 × 5 scopes.

S (x, y) = \frac{Σ_{x - 2}^{x + 2} Σ_{y - 2}^{y + 2} (2 \cdot \sqrt{(I_{x}^{2} + I_{y}^{2})} \sqrt{(I_{bx}^{2} + I_{by}^{2})}) \cos θ}{Σ_{x - 2}^{x + 2} Σ_{y - 2}^{y + 2} (I_{x}^{2} + I_{y}^{2} + I_{bx}^{2} + I_{by}^{2})} .

Step 1125, when S (x, y) is greater than certain threshold value, and when point (x, y) motion detection result is foreground point, then point (x, y) should be removed from prospect part as shade.

Step 113, movement destination image extracted region: to the foreground image eliminated after shade, carry out rim detection, extracts object boundary, obtain the boundary rectangle of target at image space, with this rectangle for the area image of template extraction moving target in original video two field picture.

Step 12, area image local feature description extracts: the motion target area image zooming-out SIFT feature descriptor vector set of extracting step 11.

Step 13, vision histogram generates: the visual word in the visual dictionary that calculated off-line obtains is as histogram bucket, motion target area image SIFT feature descriptor vector step 12 extracted incorporates in corresponding bucket, add up descriptor vector number in each bucket respectively, finally to histogram normalized, generate the vision histogram of moving target under multiple visual angle thus.

2, video camera is selected

At each select time point, it is a select time point that the present embodiment arranges intervals of video 10 frame, according to whether detect under visual angle face, target area visual information gain number and this region be that the image definition characterized selects an optimum visual angle with average gradient, i.e. optimum video camera, to guarantee obtaining front, more clear and level of detail selected as far as possible compared with the video camera of hi-vision; In non-selected video camera set, according to the information gain of target image in the vision histogram calculation candidate camera video calculated online and candidate's video camera to the mutual information selecting video camera set, select comparatively large to target observations information gain and with select the less suboptimum video camera that namely mutual information is less of camera review similarity, added and selected video camera set, and rejected from the set of candidate's video camera; Continuous repetition above-mentioned steps is until selected video camera counting reaches preset value.Its concrete steps are as follows:

Step 21, initialization is selected: there is video camera set C={c in scene ₁, c ₂... c _mobserve moving target, by the video camera set selected simultaneously candidate's video camera set C _u=C, merges the SIFT feature descriptor vector set of all candidate's video cameras, generates the vision histogram H after merging by step 13 _merge;

Step 22, optimum video camera is selected: from candidate camera chain C _uin the optimum video camera c of Standard Selection one such as comprehensive Face datection result, the gain of motion target area image information and definition ^*, added by the set of selection video camera, i.e. C _s={ c ^*, reject from the set of candidate's video camera, i.e. C simultaneously _u=C _u{ c ^*, arranging selected video camera counting count value is 1, and its concrete steps are:

V_{IG} = \underset{k}{Σ} p (b_{k}, c^{'}) \log (\frac{p (b_{k}, c^{'})}{p (b_{k}) p (c^{'})}) + \underset{k}{Σ} p (b_{k}, \overset{&OverBar;}{c^{'}}) \log (\frac{p (b_{k}, \overset{&OverBar;}{c^{'}})}{p (b_{k}) p (\overset{&OverBar;}{c^{'}})}),

Step 223, target image sharpness computation: to the gradient of target area image calculate average gradient size characterize the readability of target in video image observed by video camera, wherein N _xfor the width of video image, N _yfor the height of video image.

Step 224, optimum video camera is selected: establish weight coefficient α ₁, α ₂, α ₃, α ₁+ α ₂+ α ₃=1, select video camera c ^*as optimum video camera, make the present embodiment arranges α ₁=0.3, α ₂=0.4, α ₃=0.3;

Step 23, suboptimum video camera is selected: to selecting video camera set follow these steps to from candidate's video camera set C _uin choose a suboptimum video camera:

MI (c^{'}, c_{j}) = Σ_{x = 1}^{n_{c^{'}}} Σ_{y = 1}^{n_{c_{j}}} p (H_{c^{'}}^{x}, H_{c_{j}}^{y}) \log (\frac{p (H_{c^{'}}^{x}, H_{c_{j}}^{y})}{p (H_{c^{'}}^{x}) p (H_{c_{j}}^{y})}),

c^{' *} = \underset{c^{'}}{\arg \max} ({IG}_{c^{'}} - β \underset{c_{j} &Element; C_{s}}{Σ} MI (c^{'}, c_{j})),

The present embodiment arranges β=0.5;

Embodiment 2

Use the video camera selective system of this programme realization to POM data set (the Fleuret F based on outdoor scene, Berclaz J, Lengane R, Fua P.Multi-camera people tracking with a probabilistic occupancymap [J] .IEEE Transaction on Pattern Analysis and Machine Intelligence, in Terrace video sequence select 2008.vol30 (2): 267-282), this video sequence configures 4 video cameras altogether, wherein choose Terrace2 video sequence as the visual dictionary under this scene of scene training data training generation, on-line selection test is carried out with Terrace1 video sequence, 180th frame original image as shown in Figure 2, Fig. 2 a ~ 2d represents video camera C0 respectively, C1, C2, the target image that C3 obtains.Fig. 3 a ~ 3d is to Fig. 2 a ~ 2d video camera C0, C1, C2, after the target image that C3 obtains utilizes mixed Gauss model to carry out background modeling and foreground extraction, combined with texture information carries out the foreground image that shadow removing detects, Fig. 4 a ~ 4d is the target area image extracted respectively from Fig. 2 a ~ 2d, Fig. 5 a ~ 5d is the normalization after-vision histogram of Fig. 4 a ~ 4d, wherein the target local feature description subvector extracted under corresponding visual angle is all sorted out by visual word in visual dictionary (i.e. histogrammic bucket) by each histogram, the visual word number of the present embodiment visual dictionary is set to 200, namely the number of histogram bucket is represent a barrel sequence number with x-axis coordinate in 200(figure), the number that statistics is included into histogram Nei Getongnei local feature description subvector is normalized divided by characteristic vector sum, show with bar graph form the local feature description's subvector extracted and be integrated into probability distribution in visual dictionary, in figure, y-axis coordinate represents probability size, in units of bucket, the probability distribution of feature interpretation subvector is content shown in each camera vision histogram of the present embodiment.For target area, this programme detects face in C2 video camera, the factors such as combining information gain and image definition, therefore the result that optimum video camera is selected is C2, selection result shows to obtain the video image in this moment target comparatively front, based on this, system calculate successively other video cameras information gain and with the mutual information selecting video camera, order of preference is followed successively by C2, C0, C3, C1, namely when selecting video camera counting m=2, camera chain selection result is { C2, C0}, when selecting video camera counting to be set to m=3, selection result is { C2, C0, C3}.

Embodiment 3

Use the video camera selective system of this programme realization to the i3DPost data set (N.Gkalelis based on indoor scene, H.Kim, A.Hilton, N.Nikolaidis, and I.Pitas.The i3dpost multi-view and 3dhuman action/interaction database.In CVMP, 2009.) in, video sequence is selected, this video set configures 8 video cameras altogether, wherein choose Walk video sequence D1-002 and D1-015 as visual dictionary under this scene of training data training generation, on-line selection test is carried out with Run video sequence D1-016, 62nd frame original image is as shown in Fig. 6 a ~ Fig. 6 h, Fig. 6 a ~ Fig. 6 h represents video camera C0 respectively, C1, C2, C3, C4, C5, C6, the image that C7 obtains.Fig. 7 a ~ Fig. 7 h represents that under each visual angle target area after motion detection and shadow removing, for target area, in the optimum video camera selection course of this programme, C5 and C6 Face datection value is 1, all the other video camera values are 0, integrated information gain and clear picture degree, it is optimum video camera that system chooses C5, based on this, system calculate successively other video cameras information gain and with the mutual information selecting video camera, order of preference is followed successively by C5, C6, C1, C3, C4, C7, C0, C2, namely when selecting video camera counting m=2, camera chain selection result is { C5, C6}, when selecting video camera counting to be set to m=3, selection result is { C5, C6, C1}, when selecting video camera counting to be set to m=4, selection result is { C5, C6, C1, C3}, the selection result of all the other combined number the like.

Claims

1. the combination system of selection of video camera in visually-perceptible network, is characterized in that, comprise the following steps:

Step 1, target image vision histogram generates online: in multiple camera field overlapping ranges situation, motion detection is carried out to the online video data observing the multichannel video camera of same target obtained, by testing result determination target at the subregion in video frame images space, namely obtain object region; Local shape factor is carried out to object region; According to the visual dictionary that training in advance generates, calculate the vision histogram of object region under multiple visual angle;

Step 2, sequential forward direction video camera is selected: select an optimum visual angle, i.e. optimum video camera; In non-selected video camera set, according to the information gain of target image in the vision histogram calculation candidate camera video data that step 1 calculates and candidate's video camera to the mutual information selecting video camera set, select suboptimum video camera, added and selected video camera set, and rejected from the set of candidate's video camera; Continuous repetition is until selected video camera counting reaches the video camera counting of needs;

The visual dictionary that described training generates is: to the multi-path video data as training data of input, first extract the Scale invariant transform characteristics local feature description subvector set of every width image; K-mean cluster is carried out to the Scale invariant transform characteristics descriptor set of all image zooming-out; Each cluster centre is a descriptor vector, and as a visual word, the set of the visual word obtained forms the visual dictionary of off-line training;

Training generates visual dictionary and specifically comprises the following steps:

Extract graphical rule constant transform characteristics descriptor vector: to every frame video frame images, adopt Gaussian template to carry out filtering to image respectively and ask for x direction and y direction gradient component I _xwith gradient component I _y, and calculating pixel point gradient magnitude mag (x, y) and direction θ (x, y), wherein θ (x, y)=arctan (I _y, I _x); From the image upper left corner, the window of 16 × 16 sizes is got as feature extraction sampling window at interval of 8 pixels in x and the y direction of image, each sampling window is divided into 4 × 4 square net regions, sampled point in each region is calculated respectively and the gradient relative direction at sampling window center, by the gradient magnitude of sampled point by being included into the gradient orientation histogram in region on 8 directions after distance Gauss weighting respectively, each sampling window generates the characteristic vector of one 128 dimension, is normalized forms window local feature description subvector to gained characteristic vector; The descriptor vector that every width image calculates is added feature interpretation subclass F={f ⁽¹⁾, f ⁽²⁾, f ⁽³⁾... f ^(t), f ⁽ⁱ⁾∈ R ¹²⁸, 1≤i≤t, wherein f ⁽ⁱ⁾for this width image feature descriptor set i-th descriptor vector, R ¹²⁸represent that this vector dimension is 128 dimensions, the feature interpretation subvector sum that t extracts for this width image;

K-mean cluster is carried out to characteristic vector: to Feature Descriptor vector set F, random selecting k vector is as cluster centre, the distance of all vector distance cluster centres of iterative computation also carries out clustering, cluster centre is recalculated, until reach cluster centre distance change before and after the iterations of regulation or iteration to be less than setting threshold according to division result;

Visual dictionary is formed: using each cluster centre as a visual word, obtains and stores the set of visual word, forming visual dictionary;

Described step 1 target image vision histogram generates online and specifically comprises the following steps:

Step 11, video frequency motion target detects: carry out Video Motion Detection to the video data that each video camera inputs respectively based on mixed Gauss model, eliminate target shadow to motion detection result based on texture information, and the final moving target that extracts is in the region of image space;

Step 12, area image local feature description extracts: the motion target area image zooming-out Scale invariant transform characteristics descriptor vector set of extracting step 11;

Step 13, vision histogram generates: using the cluster centre of the visual dictionary of training in advance generation as a histogram bucket, the Scale invariant transform characteristics descriptor vector of motion target area image step 12 extracted incorporates in the corresponding bucket of histogram, add up descriptor vector number in each histogram bucket respectively, finally to histogram normalized, generate the vision histogram of moving target under multiple visual angle;

The sequential forward direction video camera of described step 2 is selected specifically to comprise the following steps:

Step 22, optimum video camera is selected: from candidate camera chain C _uthe optimum video camera c of middle selection one ^*, added and selected video camera set C _s, i.e. C _s={ c ^*, from the set of candidate's video camera, reject video camera c simultaneously ^*, i.e. C _u=C _u{ c ^*, the video camera counting count value selected by initial setting up is 1;

Step 24, repeats step 23, until selected video camera counting count reaches the video camera counting n preset;

The optimum video camera of described step 22 is selected specifically to comprise the following steps:

Step 221, Face datection: utilize human-face detector to carry out Face datection to candidate's video camera c' motion target area image, testing result is V _face=0,1}, 1 expression detects face, otherwise is 0;

Step 222, the gain of motion target area image information calculates: to the vision histogram H of video camera c' _c'with merging after-vision histogram H _mergecalculate and select video camera c' and visual information amount information gain V when not selecting video camera c' _iG, namely

V_{IG} = \underset{k}{Σ} p (b_{k}, c^{'}) \log (\frac{p (b_{k}, c^{'})}{p (b_{k}) p (c^{'})}) + \underset{k}{Σ} p (b_{k}, \overset{&OverBar;}{c^{'}}) \log (\frac{p (b_{k}, \overset{&OverBar;}{c^{'}})}{p (b_{k}) p (\overset{&OverBar;}{c^{'}})}),

Wherein p (b _k, c') and for selecting video camera c' and vision histogram H _mergethe joint probability of a middle kth histogram bucket, represent and do not select video camera c' and vision histogram H _mergethe joint probability of a middle kth histogram bucket, p (b _k) be a kth histogram bucket probability, p (c') and represent the probability selecting and do not select video camera c' respectively;

Step 223, target image sharpness computation: to the gradient calculation average gradient size of target area image characterize target readability in the picture;

Step 224, optimum video camera is selected: establish weight coefficient α ₁, α ₂, α ₃, α ₁+ α ₂+ α ₃=1, select video camera c ^*as optimum video camera, make

c^{*} = \underset{c}{\arg \max} (α_{1} V_{face} + α_{2} V_{IG} + α_{3} V_{\overset{&OverBar;}{G}}) .

2. the combination system of selection of video camera in a kind of visually-perceptible network according to claim 1, is characterized in that, described step 23 suboptimum video camera is selected specifically to comprise the following steps:

Step 231, adopts target area image information gain IG in the method calculated candidate video camera c' of step 222 _c';

Step 232, candidate's video camera calculates in selecting video camera mutual information: the vision histogram H of target area image in calculated candidate video camera c' _c'with select video camera set C _smiddle video camera c _jvision histogram between mutual information MI (c', c _j), c _j∈ C _s, MI (c', c _j) represent target area image vision content similarity degree between two video cameras:

MI (c^{'}, c_{j}) = Σ_{x = 1}^{n_{c^{'}}} Σ_{y = 1}^{n_{c_{j}}} p (H_{c^{'}}^{x}, H_{c_{j}}^{y}) \log (\frac{p (H_{c^{'}}^{x}, H_{c_{j}}^{y})}{p (H_{c^{'}}^{x}) p (H_{c_{j}}^{y})}),

Wherein, for histogram H _c'an xth bucket, for histogram y bucket, n _c'for the vision histogram H of candidate's video camera c' _c'the sum of bucket, for selecting video camera c _jvision histogram the sum of bucket;

Step 233, suboptimum video camera is selected: establish weight coefficient β, 0≤β≤1, selects video camera c' ^*as optimum video camera, make

c^{' *} = \underset{c^{'}}{\arg \max} ({IG}_{c^{'}} - β \underset{c_{j} &Element; C_{s}}{Σ} MI (c^{'}, c_{j}));

Step 234, to the suboptimum video camera c chosen ^*, added and selected video camera set C _s=C _s∪ c ^*, from the set of candidate's video camera, reject this video camera C simultaneously _u=C _u{ c ^*.