CN102932605A

CN102932605A - Method for selecting camera combination in visual perception network

Info

Publication number: CN102932605A
Application number: CN2012104884341A
Authority: CN
Inventors: 孙正兴; 李骞; 陈松乐
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2012-11-26
Filing date: 2012-11-26
Publication date: 2013-02-13
Anticipated expiration: 2032-11-26
Also published as: CN102932605B

Abstract

The invention discloses a method for selecting camera combination in a visual perception network. The method comprises the following steps: on-line generation of a target image visual histogram: in the case that the vision fields of a plurality of cameras overlap each other, performing motion detection on online obtained video data of multiple cameras observing the same object, determining the subregion of the object in a video frame image space according to the detection result, to obtain a target image region; performing local feature extraction on the target image region, and calculating the visual histogram of the target image region at the visual angle according to a visual dictionary generated by pre-training; and sequential forward camera selection: selecting an optimal visual angle, that is, the optimal camera, in the set of unselected cameras, selecting a secondary optimal camera, adding the secondary optimal camera to the set of selected cameras, removing the secondary optimal camera from the set of candidate cameras, and repeating the steps until the count of the selected cameras reaches the count of needed cameras.

Description

The combination system of selection of video camera in a kind of visually-perceptible network

Technical field

The present invention relates to the video camera system of selection, belong to computer vision and video data processing technology field, the specifically combination system of selection of video camera in a kind of visually-perceptible network.

Background technology

In recent years and since video camera be widely used in the fields such as security monitoring, man-machine interaction, navigator fix, battlefield surroundings perception, multi-camera system becomes one of study hotspot of computer vision and application thereof.Especially in using based on the monitoring of video and man-machine interaction etc., the visually-perceptible network VSN (VisualSensor Network) that is comprised of a plurality of video cameras can effectively solve in the target observation process that single camera exists from the problem such as blocking, but also produced bulk redundancy information, increased the burden of system storage, vision calculating and Internet Transmission.Therefore, how from multi-channel video, to choose and to push the video of rich amount of information, just become one of key issue of visually-perceptible network and application thereof.Select question marks seemingly with the video camera based on video data, in the graphics field, select problem also to carry out broad research for the visual angle of three-dimension model observation, such as document 1Vazquez P, Sbert M.Fast adaptive selection ofbest views.Lecture Notes in Computer Science, 2003, among the 2669:295 – 305 known geometrical model is asked for viewpoint entropy under the different visual angles, and according to its optimum visual angle of size selection, but different from video camera selection problem is, the former requires to obtain in advance the accurate model definition of observed object, and model mostly makes up in the special pattern environment thereby analytic process does not need to consider the factor affecting such as background and illumination.On the other hand, general sensor network nodes is selected problem such as document 2Mo Y., AmbrosinoR., and Bruno Sinopoli.Sensor Selection Strategies for StateEstimation in Energy Constrained Wireless Sensor Networks.Automatica, 2011,47 (7): 1330-1338 and document 3Huber, M.F.Optimal pruning for multi-Step sensor scheduling.IEEETransactions on automatic control.2012, the foundation that all adopts the position between target being observed and the transducer to select as sensor node among 57 (5): 1338 – 1343, and video camera perception environment has directivity, can not simply select optimum video camera according to the position relationship of target and camera node, more wish to see people's direct picture rather than figure viewed from behind image closely during for example security monitoring is used.

Existing video camera system of selection can be divided in the wide area without the overlapping video camera system of selection of the ken and the video camera system of selection with part or all of overlapping ken according to camera node ken overlapped coverage situation in the visually-perceptible network.Wherein continue the demands such as tracking without the overlapping video camera system of selection of the ken for the target in realizing on a large scale, according to the prediction to target travel the node in the camera network of disperseing to lay is selected; The present invention is for satisfying the application demands such as security monitoring and man-machine interaction, and main research has the video camera system of selection of observing same target in the part or all of ken overlapping range.In these class methods, can be divided into again single camera system of selection and camera chain system of selection two classes according to the quantity of on-camera.Wherein the single camera system of selection is on specific select time point, only select an optimum visual angle as output according to the selection criterion that proposes, at this moment, the design of choice criteria is the key that the visual information amount evaluation criterion becomes the selection of video camera, and need not to consider the separately similitude between the capturing information between video camera and the video camera.Design aspect in choice criteria, usually can be divided into based on the selection of video image content with based on target in the video in objective world spatial relation two classes, such as document 4Daniyal F., Taj M., Cavallaro A Contentand task-based view selection from multiple video streams.Multimedia Tools andApplications, 2010, extract what that move in the video product among the 46:235 – 258, the type of target, whether the video features such as shooting event occur in size and position and the video, realize the selection of content-based video camera according to the contextual information of feature, the video image content that these class methods are only obtained video camera carries out feature extraction and scoring is compared, and does not need the content of each node perceived in the camera network is carried out similarity measurement.The method of video camera being selected based on the spatial relation of target in the video, such as document 5Park J, Bhat C, Kak AC.A look-up table basedapproach for solving the camera selection problem in large camera networks.ACMWorkshop on Distributed Smart Cameras, in 2006 the space in the video camera angular field of view is created as a corresponding video camera look-up table, in the video camera selection course according to the spatial relation of target and each video camera, the nearest camera node of chosen distance in table, this class methods prerequisite is that video camera must be processed through accurate camera calibration in the scene, otherwise can't from the video image of each video camera, obtain the accurate locus of target, simultaneously the method is not considered the orientation information of target in scene, target acquisition direct picture all the time when being applied to the field such as security monitoring.The said method selection result all only has an optimum video camera, does not consider that the compound mode by a plurality of video cameras remedies mutually the each other restriction at visual angle in the selection result, thereby can not investigate information similarity degree and redundant situation between the video camera.

Under certain resource and treatment conditions allow, compare with the single camera system of selection by the method for selecting a plurality of video cameras to form camera chain, can be by problems such as a plurality of visual angles certainly blocking of increasing that information sources overcome effectively that the latter occurs and blind areas.Although can form camera chain by the mode of selecting one by one optimum visual angle, but there is the content similitude thereby has in various degree data redundancy because the video camera of different angles obtains video image, generally speaking, the selection result that each selected optimum video camera forms not is optimum camera chain, although for example two video camera amount of information with regard to single that photograph simultaneously the target front are all larger, contained Global Information amount is not so good as the visual information amount of a front and a side target that video camera obtains usually.Up to the present existing camera chain system of selection is studied relatively less.

Summary of the invention

Goal of the invention: technical problem to be solved by this invention is for the deficiencies in the prior art, has proposed the combination system of selection of video camera in a kind of visually-perceptible network.

Technical scheme: the combination system of selection of video camera in a kind of visually-perceptible network disclosed by the invention may further comprise the steps:

Step 1, target image vision histogram generates online: in the overlapping situation of a plurality of video camera FOVs, video data to the multichannel video camera that contains target of online acquisition carries out motion detection, determine that by testing result target at the subregion in video frame images space, namely obtains object region; Object region is carried out local feature to be extracted; In conjunction with the visual dictionary that training in advance generates, calculate the vision histogram of object region under this visual angle;

Step 2, sequential forward direction video camera is selected: on each time point, select an optimum visual angle, i.e. optimum video camera; In non-selected video camera set, the information gain of target image and candidate's video camera are to selecting the mutual information of video camera set in the vision histogram calculation candidate camera video that calculates according to step 2, selection is large and be the less suboptimum video camera of mutual information with selecting camera review content similarity less to target observed information gain, the video camera set has been selected in its adding, and from the set of candidate's video camera, rejected; Constantly repeat above-mentioned steps until selected video camera counting reaches preset value.

Visual dictionary: because the constant transform characteristics SIFT of yardstick (Scale Invariant Feature Transform, SIFT) can overcome the impacts such as illumination that different cameras produces and convergent-divergent preferably, so the present invention with it as the visual dictionary lemma.To input as the two field picture in the multi-path video data of training data, at first extract the constant transform characteristics SIFT of yardstick (Scale Invariant Feature Transform, SIFT) the local feature description subvector set of every width of cloth image; The k-mean cluster is carried out in the SIFT Feature Descriptor set that all images extract; Each cluster centre is regarded as a visual word, and the set of the visual word that obtains consists of the visual dictionary of off-line training, specifically may further comprise the steps:

Extract image SIFT Feature Descriptor vector: to every frame input video two field picture, adopt respectively Gauss's template that image is carried out filtering and ask for x and y direction gradient component I _xAnd I _y, and with this calculating pixel point gradient magnitude and direction

θ (x, y)=arctan (I _y, I _x); From the image upper left corner, get the window of 16 * 16 sizes as the feature extraction sampling window in 8 pixels of image x and the every interval of y direction, window is divided into 4 * 4 square nets zone, sampled point in each zone is calculated respectively gradient relative direction with the sampling window center, the gradient magnitude of sampled point is included into respectively gradient orientation histogram on interior 8 directions in zone after by distance Gauss weighting, the characteristic vectors that it is 128 dimensions that each sampling window generates one 4 * 4 * 8 dimension are carried out normalization to the gained characteristic vector and are formed window local feature description subvector; The descriptor vector that every width of cloth image calculation is obtained adds Feature Descriptor set F={f ⁽¹⁾, f ⁽²⁾, f ⁽³⁾... f ^(t), f ⁽ⁱ⁾∈ R ¹²⁸, 1≤i≤t, wherein f ⁽ⁱ⁾For this width of cloth image feature descriptor is gathered i descriptor vector, R ¹²⁸Represent that this vector dimension is 128 dimensions, the Feature Descriptor sum that t extracts for this width of cloth image;

Characteristic vector is carried out the k-mean cluster: the SIFT Feature Descriptor vector that two field picture extracts is gathered F, choose at random in the set k vector and be initial cluster center, after all characteristic vectors are carried out clustering by cluster centre, recomputate new cluster centre, constantly iteration until reach the iterations restriction or the cluster centre change of distance less than certain threshold value, setting of the present invention when iterations reach 50 ~ 200 times or the cluster centre distance less than 0.02 as stopping iterated conditional.

Visual dictionary consists of: each cluster centre is considered as a visual word, obtains and store the set of visual word, consist of the visual dictionary of off-line training.

The online generation of step 1 target image vision histogram of the present invention specifically may further comprise the steps:

Step 11, Video Motion Detection: the video data to each video camera input adopts respectively mixed Gauss model to carry out Video Motion Detection, every frame testing result is eliminated the shade that is produced by target based on texture method in scene, extract moving target in the zone of image space.

Step 12, area image local feature description extracts: the motion target area image that step 11 is extracted extracts the set of SIFT Feature Descriptor vector;

Step 13, the vision histogram generates: the cluster centre of the visual dictionary that generates with training in advance is as the histogram bucket, the motion target area image SIFT Feature Descriptor vector that step 12 is extracted incorporates in the corresponding bucket of histogram, add up respectively descriptor vector number in each barrel, to the histogram normalized, generate thus the vision histogram of moving target under a plurality of visual angles at last.

Step 2 of the present invention specifically may further comprise the steps:

Step 21, initialization is selected: have video camera set C={c in the scene ₁, c ₂... c _mObserving simultaneously moving target, m is the sum of video camera, selecteed video camera set

Candidate's video camera set C _u=C merges the SIFT Feature Descriptor vector set of all candidate's video cameras, the vision histogram H after merging by step 13 generation _Merge

Step 22, optimum video camera is selected: from candidate's camera chain C _uIn optimum video camera c of Standard Selection such as comprehensive people's face testing result, the gain of motion target area image information and definition ^*, it is added in the selected video camera set, i.e. C _s={ c ^*, from the set of candidate's video camera, reject simultaneously, i.e. C _u=C _u{ c ^*, C _u=C _u{ c ^*, the selected video camera counting of initial setting up count value is 1; Its concrete steps are:

Step 221, people's face detects: utilize AdaBoost(Adaptive Boosting, improved Weak Classifier algorithm, Adaboost is a kind of iterative algorithm, its core concept is for the different grader (Weak Classifier) of same training set training, then these Weak Classifiers are gathered, consist of a stronger final grader (strong classifier).The Adaboost algorithm is improved Boosting algorithm, and the mistake of the Weak Classifier that it can obtain weak study is carried out accommodation.) human-face detector carries out people's face to candidate's video camera c' motion target area image and detect, testing result is V _Face=0,1}, 1 expression detects people's face, otherwise is 0;

Step 222, the gain of motion target area image information is calculated: to the vision histogram H of video camera c' _{C '}With merging after-vision histogram H _MergeCalculate visual information amount information gain V when selecting video camera c' and not selecting this video camera _IG, namely

V_{IG} = \underset{k}{Σ} p (b_{k}, c^{'}) \log (\frac{p (b_{k}, c^{'})}{p (b_{k}) p (c^{'})}) + \underset{k}{Σ} p (b_{k}, \overset{&OverBar;}{c^{'}}) \log (\frac{p (b_{k}, \overset{&OverBar;}{c^{'}})}{p (b_{k}) p (\overset{&OverBar;}{c^{'}})}),

Wherein: p (b _k, c') for selecting video camera c' and vision histogram H _MergeIn k the bucket joint probability,

Video camera c' and vision histogram H are not selected in expression _MergeIn the joint probability of k histogram bucket, p (b _k) be k the bucket probability, p (c') and The probability of video camera c' is selected and is not selected in expression respectively, all according to histogram H _{C '}And H _MergeCalculate;

Step 223, target image sharpness computation: to the gradient of target area image

Calculating average gradient size

Characterize the readability of target in image.

Step 224, optimum video camera is selected: establish weight coefficient α ₁, α ₂, α ₃, α ₁+ α ₂+ α ₃=1, select video camera c ^*As optimum video camera, make

It is 1 that selected video camera counting count value is set;

Step 23, the suboptimum video camera is selected: to selecting the video camera set The information gain of each iterative computation candidate video camera and to selecting the vision histogram mutual information of video camera is selected suboptimum video camera c ^*, join and select the video camera set and gather C from candidate's video camera _uMiddle rejecting, and increase selected video camera counting, i.e. count=count+1;

Step 231, target area image information gain IG among the method calculated candidate video camera c' of employing step 222 _{C '}

Step 232, candidate's video camera is in selecting the video camera mutual information to calculate: the vision histogram H of calculated candidate video camera target area image c' _C'With video camera c in the selected video camera set _j, c _j∈ C _sThe vision histogram Between mutual information MI (c', c _j) target area image vision content similarity degree between two video cameras of expression:

MI (c^{'}, c_{j}) = Σ_{x = 1}^{n_{c^{'}}} Σ_{y = 1}^{n_{c_{j}}} p (H_{c^{'}}^{x}, H_{c_{j}}^{y}) \log (\frac{p (H_{c^{'}}^{x}, H_{c_{j}}^{y})}{p (H_{c^{'}}^{x}) p (H_{c_{j}}^{y})}),

Wherein, Be histogram H _C'X bucket, Be histogram

Y bucket, n _{C '}Vision histogram H for candidate's video camera c' _C'The sum of bucket,

For selecting video camera c _jThe vision histogram

The sum of bucket;

Step 233, the suboptimum video camera is selected: establish weight coefficient β, video camera c ' is selected in 0≤β≤1 ^*As optimum video camera, make

c^{' *} = \underset{c^{'}}{\arg \max} ({IG}_{c^{'}} - β \underset{c_{j} &Element; C_{s}}{Σ} MI (c^{'}, c_{j}));

Step 234 is to the suboptimum video camera c that chooses ^*, video camera set C has been selected in its adding _s=C _s∪ c ^*, from the set of candidate's video camera, reject this video camera C simultaneously _u=C _u{ c ^*, and increase selected video camera counting, i.e. count=count+1.

Step 24, repeating step 23, until selected video camera counting count reaches predefined video camera counting n, n gets natural number, sets according to concrete needs.

Beneficial effect: for solve in the application such as target monitoring and man-machine interaction single camera owing to target from blocking the loss of learning that produces, and use simultaneously a plurality of video cameras to have the problem of bulk redundancy information, the present invention is directed to the system of selection that application demand discloses a kind of camera chain, on select time point, select by gradual camera node, pick out the minimum camera chain of the richest amount of information and redundant information, namely from candidate's video camera of m the same target of observation, choose n video camera (n＜m), calculate to satisfy, the constraints of storage and network capacity.

Particularly the present invention compares with existing method and has the following advantages: 1, to having the visually-perceptible network of common FOV, under calculating, storage capacity confined condition, select a plurality of video cameras to form camera chain, the problem such as certainly block that efficiently solves that single camera selects to bring has reduced the information redundancy problem of using all video cameras to bring simultaneously; 2, carry out optimum video camera in conjunction with the gain of recognition of face, video camera information and image definition and select, guarantee positive, have higher level of detail and the video camera of target image is selected more clearly; 3, from candidate's video camera, progressively select the suboptimum video camera with sequential progression, both considered the contribute information of video camera to observed target, adopt again the information redundancy degree between the mutual information form minimizing different cameras, avoid producing the angle same problem of selecting a plurality of optimum video cameras to bring easily under the same standard; 4, off-line learning visual dictionary and construct vision histogram under the different visual angles makes and the target of a plurality of video cameras of same target is observed image sets up related; 5, choose SIFT local feature description as the vision lemma, effectively reduced the impact of the factors such as convergent-divergent, illumination and visual angle in the different cameras.

Description of drawings

Below in conjunction with the drawings and specific embodiments the present invention is done further to specify, above-mentioned and/or otherwise advantage of the present invention will become apparent.

Fig. 1 is handling process schematic diagram of the present invention.

Fig. 2 a ~ 2d is an embodiment video camera the 180th frame video frame images.

Fig. 3 a ~ 3d is for carrying out the moving image after step 11 motion detection is processed to Fig. 2 a ~ 2d.

Fig. 4 a ~ 4d processes the rear image that obtains target according to motion detection result for Fig. 2 a ~ 2d being carried out step 11.

Fig. 5 a ~ Fig. 5 d is that Fig. 4 a ~ 4d carries out the vision histogram that obtains after step 13 is processed.

Fig. 6 a ~ Fig. 6 h is 8 camera video two field pictures of second embodiment the 62nd frame.

Fig. 7 a～Fig. 7 h is that Fig. 6 a～Fig. 6 h carries out the image that step 11 is processed rear moving target.

Embodiment:

The invention discloses a kind of video camera system of selection according to information redundancy between video camera information gain and the video camera, may further comprise the steps:

Step 1, target image vision histogram generates online: a plurality of video camera FOVs are overlapping in the supposition system, therefore can observe simultaneously the moving target in the scene.At first the multi-path video data that has target of online acquisition carried out motion detection, determine that by testing result target is at the subregion in video frame images space; Object region is carried out local feature to be extracted; In conjunction with the visual dictionary that training in advance generates, calculate the vision histogram of object region under this visual angle.Specifically comprise following content:

Step 13, the vision histogram generates: the cluster centre that obtains with calculated off-line (among the present invention is divided into feature space several little intervals as the histogram bucket, each interval is a histogram bucket), the motion target area image SIFT Feature Descriptor vector that step 12 is extracted incorporates in the corresponding bucket, add up respectively descriptor vector number in each barrel, to the histogram normalized, generate thus the vision histogram of moving target under a plurality of visual angles at last.

Step 2, sequential forward direction video camera is selected: select an optimum visual angle, i.e. optimum video camera; In non-selected video camera set, the information gain of target image and candidate's video camera are to selecting the mutual information of video camera set in the vision histogram calculation candidate camera video that calculates according to step 1, selection is large and be the less suboptimum video camera of mutual information with selecting the camera review similarity less to target observed information gain, the video camera set has been selected in its adding, and from the set of candidate's video camera, rejected; Constantly repeat above-mentioned steps until selected video camera counting reaches preset value.

Step 21, initialization is selected: have video camera set C={c in the visually-perceptible network scenarios ₁, c ₂... c _mObserving simultaneously moving target, m is the sum of video camera, has selected the video camera set

Candidate's video camera set C _u=C merges the constant transform characteristics descriptor vector of the yardstick set of all candidate's video cameras, generates the vision histogram H after merging _Merge

Step 22, optimum video camera is selected: from candidate's camera chain C _uIn optimum video camera c of Standard Selection such as comprehensive people's face testing result, the gain of motion target area image information and definition ^*, it is added in the selected video camera set, i.e. C _s={ c ^*, from the set of candidate's video camera, reject simultaneously, i.e. C _u=C _u{ c ^*, the selected video camera counting of initial setting up count value is 1, its concrete steps are:

Step 221, people's face detect: utilize AdaBoost(Adaptive Boosting, improved Weak Classifier algorithm) human-face detector carries out people's face to candidate's video camera c' motion target area image and detects, and testing result is V _Face=0,1}, 1 expression detects people's face, otherwise is 0;

V_{IG} = \underset{k}{Σ} p (b_{k}, c^{'}) \log (\frac{p (b_{k}, c^{'})}{p (b_{k}) p (c^{'})}) + \underset{k}{Σ} p (b_{k}, \overset{&OverBar;}{c^{'}}) \log (\frac{p (b_{k}, \overset{&OverBar;}{c^{'}})}{p (b_{k}) p (\overset{&OverBar;}{c^{'}})}),

P (b wherein _k, c') for selecting video camera c' and vision histogram H _MergeIn k the bucket joint probability,

Calculating average gradient size

Characterize the readability of target in image.

Step 224, optimum video camera is selected: establish weight coefficient α ₁, α ₂, α ₃, α ₁+ α ₂+ α ₃=1, select video camera c ^*As optimum video camera, make It is 1 that selected video camera counting count value is set;

Step 23, the suboptimum video camera is selected: to selecting the video camera set

The information gain of each iterative computation candidate video camera and to selecting the vision histogram mutual information of video camera is selected suboptimum video camera c ^*, join and select the video camera set and gather C from candidate's video camera _uMiddle rejecting, and increase selected video camera counting, i.e. count=count+1;

Step 232, candidate's video camera is in selecting the video camera mutual information to calculate: the vision histogram H of calculated candidate video camera target area image c' _C'With video camera c in the selected video camera set _j, c _j∈ C _sThe vision histogram

Between mutual information MI (c', c _j) target area image vision content similarity degree between two video cameras of expression

MI (c^{'}, c_{j}) = Σ_{x = 1}^{n_{c^{'}}} Σ_{y = 1}^{n_{c_{j}}} p (H_{c^{'}}^{x}, H_{c_{j}}^{y}) \log (\frac{p (H_{c^{'}}^{x}, H_{c_{j}}^{y})}{p (H_{c^{'}}^{x}) p (H_{c_{j}}^{y})}),

Wherein,

Be histogram H _C'X bucket,

Be histogram

For selecting video camera c _jThe vision histogram

The sum of bucket;

c^{' *} = \underset{c^{'}}{\arg \max} ({IG}_{c^{'}} - β \underset{c_{j} &Element; C_{s}}{Σ} MI (c^{'}, c_{j}));

Step 24, repeating step 23 is until selected video camera counting count reaches predefined video camera counting n.

Visual dictionary off-line training: to the multi-channel video two field picture as training data of input, at first extract the constant transform characteristics of yardstick (the Scale Invariant Feature Transform of every width of cloth image, SIFT) local feature description's subvector set, the k-mean cluster is carried out in the SIFT Feature Descriptor set of then all images being extracted, each cluster centre is a descriptor vector, be regarded as a visual word, the set of the visual word that obtains consists of the visual dictionary of off-line training.Specifically comprise following steps:

θ (x, y)=arctan (I _y, I _x); From the image upper left corner, get the window of 16 * 16 sizes as the feature extraction sampling window in 8 pixels of image x and the every interval of y direction, window is divided into 4 * 4 square nets zone, sampled point in each zone is calculated respectively gradient relative direction with the sampling window center, the gradient magnitude of sampled point is included into respectively gradient orientation histogram on interior 8 directions in zone after by distance Gauss weighting, the characteristic vectors that it is 128 dimensions that each sampling window generates one 4 * 4 * 8 dimension are carried out normalization to the gained characteristic vector and are formed window local feature description subvector; The descriptor vector that every width of cloth image calculation is obtained adds Feature Descriptor set F={f ⁽¹⁾, f ⁽²⁾, f ⁽³⁾... f ^(t), f ⁽ⁱ⁾∈ R ¹²⁸, 1≤i≤t, wherein f ⁽ⁱ⁾For this width of cloth image feature descriptor is gathered i descriptor vector, R ¹²⁸Represent that this vector dimension is 128 dimensions, the Feature Descriptor vector sum that t extracts for this width of cloth image;

Characteristic vector is carried out the k-mean cluster: the SIFT Feature Descriptor vector that two field picture extracts is gathered F, choose at random in the set k vector and be initial cluster center, after all characteristic vectors are carried out clustering by cluster centre, recomputate new cluster centre, constantly iteration is until reach the iterations restriction or cluster centre changes less than certain threshold value, setting of the present invention when iterations reach 100 times or the cluster centre distance less than 0.02 as stopping iterated conditional.

Embodiment 1

Present embodiment comprises that off-line training generates the video camera selection of visual dictionary, the online generation of target image vision histogram and sequential forward direction, its process chart as shown in Figure 1, whole method is divided into online target image vision histogram generation and video camera is selected two key steps, and the below introduces respectively the main flow process of each embodiment part.

1, target image vision histogram generates online

In order to set up the information association between each video camera, present embodiment is at first chosen multi-path video data under the Same Scene, extract the local feature information in the frame of video, local feature vectors is carried out cluster, with the visual dictionary of cluster centre as the off-line training generation, so that Online Video can generate corresponding vision histogram and it is carried out related information relatively according to visual dictionary.Because the SIFT local feature can overcome the visual effect difference that Target Factor illumination under a plurality of visual angles, convergent-divergent and visual angle produce preferably, therefore present embodiment extracts the SIFT characteristic vector of the multichannel visual angle training video image of input, it is carried out the k-mean cluster, finally generate visual dictionary.Concrete steps are:

Extract image SIFT Feature Descriptor vector: to every frame input video two field picture, adopt respectively Gauss's template G_x, G_y carries out filtering to image and asks for x and y direction gradient component I _xAnd I _y, wherein

G_x = |\begin{matrix} 0.0067 & 0.0085 & 0 & - 0.0085 & - 0.0067 \\ 0.0964 & 0.1224 & 0 & - 0.1224 & 0.0964 \\ 0.2344 & 0.2977 & 0 & - 0.2977 & - 0.2344 \\ 0.0964 & 0.1224 & 0 & - 0.1224 & - 0.0964 \\ 0.0067 & 0.0085 & 0 & - 0.0085 & - 0.0067 \end{matrix}|,

G_y = |\begin{matrix} 0.0067 & 0.0964 & 0.2344 & 0.0964 & 0.0067 \\ 0.0085 & 0.1224 & 0.2977 & 0.1224 & 0.0085 \\ 0 & 0 & 0 & 0 & 0 \\ - 0.0085 & 0.1224 & - 0.2977 & - 0.1224 & - 0.0085 \\ - 0.0067 & - 0.0964 & - 0.2344 & - 0.0964 & - 0.0067 \end{matrix}|,

And with this calculating pixel point gradient magnitude and direction

θ (x, y)=arctan (I _y, I _x); From the image upper left corner, get the window of 16 * 16 sizes as the feature extraction sampling window in 8 pixels of image x and the every interval of y direction, window is divided into 4 * 4 square nets zone, sampled point in each zone is calculated respectively gradient relative direction with the sampling window center, the gradient magnitude of sampled point is included into respectively in the zone 0 after by distance Gauss weighting

π,

Gradient orientation histogram on totally 8 directions, each sampling window generate the i.e. characteristic vectors of 128 dimensions of one 4 * 4 * 8 dimension, the gained characteristic vector is carried out normalization form window local feature description subvector; The descriptor vector that every width of cloth image calculation is obtained adds Feature Descriptor set F={f ⁽¹⁾, f ⁽²⁾, f ⁽³⁾... f ^(t), to each f ⁽ⁱ⁾∈ R ¹²⁸, f wherein ⁽ⁱ⁾Gather i descriptor vector for this width of cloth image feature descriptor, this vector dimension is 128 dimensions, the Feature Descriptor vector sum that t extracts for this width of cloth image.

Characteristic vector is carried out the k-mean cluster: to the SIFT Feature Descriptor vector set F that two field picture extracts, establishing the cluster centre number is k, carries out as follows the k-mean cluster:

Cluster centre is selected: choose at random k the subvector { μ of local feature description from training sample SIFT local feature description subclass ⁽¹⁾, μ ⁽²⁾... μ ^(k), as the center of k cluster;

Clustering: to residue descriptor vector f in the Feature Descriptor set ⁽ⁱ⁾, calculate it to each cluster centre μ ^(j)Squardx ²Distance

F wherein _l ⁽ⁱ⁾Be the descriptor vector f ⁽ⁱ⁾L component is with the descriptor vector f ⁽ⁱ⁾Incorporate in the cluster with minimum range d;

Recomputate cluster centre: according to cluster result, calculate the average of each dimension of all elements in k the cluster, as the new center of cluster;

Again cluster: with Feature Descriptor gather all elements among the F by the new center of cluster according to the minimum distance criterion in the step 122 again cluster;

The iterative computation cluster centre and according to new center again to Feature Descriptor set carry out cluster, until iterations reaches centre distance old before predefined iterations or new center and the iteration less than setting threshold, setting of the present invention when iterations reach 100 times or the cluster centre distance less than 0.02 as stopping iterated conditional.

For setting up target image statistical representation model under each video camera, the present invention is to the multiple paths of video images of online input, extracts the area image of moving target under the visual angle separately, reduce because of the different visual effects that produce of background different; The target area image of extracting is extracted the SIFT local feature and generated vision histogram under each visual angle according to the visual dictionary of off-line training.Concrete steps are:

Step 11, Video Motion Detection: the video data to each video camera input carries out respectively Video Motion Detection, extracts moving target in the zone of image space.Concrete steps are as follows:

Step 111, the sport foreground image extracts: adopt mixed Gauss model (GMM) that video camera sequential input picture is carried out background modeling and foreground extraction.Concrete steps are as follows:

Step 1111, initial setting up: the first frame that will input is made as background, and Gauss's number, background threshold and the window size of mixed Gauss model are set.

Step 1112, the frame video image input model with new upgrades background, extracts the present frame foreground image.

Step 112, shade is eliminated: to each frame foreground image that step 111 obtains, carry out target shadow elimination operation in the foreground image.Concrete steps are as follows:

Step 1121 adopts respectively Gauss's template G_x, and G_y calculates in the x and y direction gradient I of original image I _xAnd I _y

Step 1122 adopts step 1121 same procedure to calculate background image I _bGradient I in the x and y direction _BxAnd I _By

Step 1123, computed image I and background image I _bThe gradient vector included angle cosine,

\cos θ = \frac{I_{x} I_{bx} + I_{y} I_{by}}{\sqrt{(I_{x}^{2} + I_{y}^{2}) (I_{bx}^{2} + I_{by}^{2})}} .

Step 1124 is to image mid point (x, y) compute gradient texture in 5 * 5 scopes.

S (x, y) = \frac{Σ_{x - 2}^{x + 2} Σ_{y - 2}^{y + 2} (2 \cdot \sqrt{(I_{x}^{2} + I_{y}^{2})} \sqrt{(I_{bx}^{2} + I_{by}^{2})}) \cos θ}{Σ_{x - 2}^{x + 2} Σ_{y - 2}^{y + 2} (I_{x}^{2} + I_{y}^{2} + I_{bx}^{2} + I_{by}^{2})} .

Step 1125, when S (x, y) greater than certain threshold value, and point (x, y) motion detection result is when being the foreground point, then point (x, y) should partly be removed from prospect as shade.

Step 113, the movement destination image extracted region: the foreground image to behind the elimination shade, carry out rim detection, extract object boundary, obtain target at the boundary rectangle of image space, take this rectangle as the area image of template extraction moving target in the original video two field picture.

Step 12, area image local feature description extracts: the motion target area image that step 11 is extracted extracts the set of SIFT Feature Descriptor vector.

Step 13, the vision histogram generates: the visual word in the visual dictionary that obtains with calculated off-line is as the histogram bucket, the motion target area image SIFT Feature Descriptor vector that step 12 is extracted incorporates in the corresponding bucket, add up respectively descriptor vector number in each barrel, to the histogram normalized, generate thus the vision histogram of moving target under a plurality of visual angles at last.

2, video camera is selected

At each select time point, it is a select time point that present embodiment arranges intervals of video 10 frames, according to whether detect under the visual angle gain of people's face, target area visual information what and should select an optimum visual angle take average gradient as the image definition that characterizes in the zone, be optimum video camera, to guarantee to obtain front, more clear and level of detail is selected as far as possible than the video camera of hi-vision; In non-selected video camera set, information gain and the mutual information of candidate's video camera to selecting video camera to gather according to target image in the vision histogram calculation candidate camera video that calculates online, selection is large and be the less suboptimum video camera of mutual information with selecting the camera review similarity less to target observed information gain, the video camera set has been selected in its adding, and from the set of candidate's video camera, rejected; Constantly repeat above-mentioned steps until selected video camera counting reaches preset value.Its concrete steps are as follows:

Step 21, initialization is selected: have video camera set C={c in the scene ₁, c ₂... c _mObserve simultaneously moving target, selecteed video camera is gathered

Step 22, optimum video camera is selected: from candidate's camera chain C _uIn optimum video camera c of Standard Selection such as comprehensive people's face testing result, the gain of motion target area image information and definition ^*, it is added in the selected video camera set, i.e. C _s={ c ^*, from the set of candidate's video camera, reject simultaneously, i.e. C _u=C _u{ c ^*, it is 1 that selected video camera counting count value is set, its concrete steps are:

V_{IG} = \underset{k}{Σ} p (b_{k}, c^{'}) \log (\frac{p (b_{k}, c^{'})}{p (b_{k}) p (c^{'})}) + \underset{k}{Σ} p (b_{k}, \overset{&OverBar;}{c^{'}}) \log (\frac{p (b_{k}, \overset{&OverBar;}{c^{'}})}{p (b_{k}) p (\overset{&OverBar;}{c^{'}})}),

Video camera c' and vision histogram H are not selected in expression _MergeIn the joint probability of k histogram bucket, p (b _k) be k the bucket probability, p (c') and

The probability of video camera c' is selected and is not selected in expression respectively, all according to histogram H _{C '}And H _MergeCalculate;

Step 223, target image sharpness computation: to the gradient of target area image Calculating average gradient size

Characterize the readability of the observed target of video camera in video image, wherein N _xBe the width of video image, N _yHeight for video image.

Present embodiment arranges α ₁=0.3, α ₂=0.4, α ₃=0.3;

Follow these steps to from candidate's video camera set C _uIn choose a suboptimum video camera:

Step 232, candidate's video camera is in selecting the video camera mutual information to calculate: the vision histogram H of calculated candidate video camera target area image c' _C'With video camera c in the selected video camera set _j, c _j∈ C _sThe vision histogram Between mutual information MI (c', c _j) target area image vision content similarity degree between two video cameras of expression

MI (c^{'}, c_{j}) = Σ_{x = 1}^{n_{c^{'}}} Σ_{y = 1}^{n_{c_{j}}} p (H_{c^{'}}^{x}, H_{c_{j}}^{y}) \log (\frac{p (H_{c^{'}}^{x}, H_{c_{j}}^{y})}{p (H_{c^{'}}^{x}) p (H_{c_{j}}^{y})}),

Wherein,

Be histogram H _C'X bucket,

Be histogram

Y bucket, n _{C '}Vision histogram H for candidate's video camera c' _C'The sum of bucket, For selecting video camera c _jThe vision histogram

The sum of bucket;

c^{' *} = \underset{c^{'}}{\arg \max} ({IG}_{c^{'}} - β \underset{c_{j} &Element; C_{s}}{Σ} MI (c^{'}, c_{j})),

Present embodiment arranges β=0.5;

Embodiment 2

Use video camera selective system that this programme realizes to take outdoor scene as main POM data set (Fleuret F, Berclaz J, Lengane R, Fua P.Multi-camera people tracking with a probabilistic occupancymap[J] .IEEE Transaction on Pattern Analysis and Machine Intelligence, 2008.vol30 (2): the Terrace video sequence is selected 267-282), this video sequence disposes 4 video cameras altogether, wherein choose the Terrace2 video sequence as the visual dictionary under this scene of scene training data training generation, carry out the on-line selection test with the Terrace1 video sequence, the 180th frame original image as shown in Figure 2, Fig. 2 a ~ 2d represents respectively video camera C0, C1, the target image that C2, C3 obtain.Fig. 3 a ~ 3d is to Fig. 2 a ~ 2d video camera C0, C1, C2, after the target image that C3 obtains utilizes mixed Gauss model to carry out background modeling and foreground extraction, combined with texture information carries out shade and eliminates detected foreground image, the target area image of Fig. 4 a ~ 4d for from Fig. 2 a ~ 2d, extracting respectively, Fig. 5 a ~ 5d is the normalization after-vision histogram of Fig. 4 a ~ 4d, wherein each histogram is all sorted out the target local feature description subvector that extracts under the corresponding visual angle by visual word in the visual dictionary (being histogrammic bucket), the visual word number of present embodiment visual dictionary is made as 200, the number that is the histogram bucket is to represent a barrel sequence number with the x axial coordinate among the 200(figure), statistics is included in the histogram number of local feature description's subvector carries out normalized divided by the characteristic vector sum in each barrel, show that with the block diagram form local feature description's subvector that is extracted is integrated into the probability distribution in the visual dictionary, the y axial coordinate represents the probability size among the figure, and the probability distribution take bucket as unit character descriptor vector is content shown in each video camera vision histogram of present embodiment.For the target area, this programme detects people's face in the C2 video camera, the factors such as combining information gain and image definition, the result that therefore optimum video camera is selected is C2, selection result shows has obtained the comparatively positive video image of this moment target, based on this, system calculate successively other video cameras information gain and with the mutual information that selects video camera, order of preference is followed successively by C2, C0, C3, C1 is namely when selecting video camera counting m=2, the camera chain selection result is { C2, C0}, when selecting the video camera counting to be made as m=3, selection result is { C2, C0, C3}.

Embodiment 3

Use video camera selective system that this programme realizes to take indoor scene as main i3DPost data set (N.Gkalelis, H.Kim, A.Hilton, N.Nikolaidis, and I.Pitas.The i3dpost multi-view and 3dhuman action/interaction database.In CVMP, 2009.) in video sequence select, this video set disposes 8 video cameras altogether, training generates visual dictionary under this scene as training data wherein to choose Walk video sequence D1-002 and D1-015, carries out the on-line selection test with Run video sequence D1-016, and the 62nd frame original image is shown in Fig. 6 a ~ Fig. 6 h, Fig. 6 a ~ Fig. 6 h represents respectively video camera C0, C1, C2, C3, C4, C5, the image that C6, C7 obtain.Fig. 7 a～Fig. 7 h represents target area under each visual angle after motion detection and the shade elimination, and for the target area, C5 and C6 people's face detected value are 1 in the optimum video camera selection course of this programme, all the other video camera values are 0, integrated information gain and clear picture degree, and it is optimum video camera that system chooses C5, based on this, system calculate successively other video cameras information gain and with the mutual information that selects video camera, order of preference is followed successively by C5, C6, C1, C3, C4, C7, C0, C2, namely when selecting video camera counting m=2, the camera chain selection result is { C5, C6}, when selecting the video camera counting to be made as m=3, selection result is { C5, C6, C1}, when selecting the video camera counting to be made as m=4, selection result is { C5, C6, C1, C3}, the selection result of all the other combined number the like.

Claims

1. the combination system of selection of video camera in the visually-perceptible network is characterized in that, may further comprise the steps:

Step 1, target image vision histogram generates online: in the overlapping situation of a plurality of video camera FOVs, video data to the multichannel video camera of observing same target of online acquisition carries out motion detection, determine that by testing result target at the subregion in video frame images space, namely obtains object region; Object region is carried out local feature to be extracted; According to the visual dictionary that training in advance generates, calculate the vision histogram of object region under this visual angle;

Step 2, sequential forward direction video camera is selected: select an optimum visual angle, i.e. optimum video camera; In non-selected video camera set, the information gain of target image and candidate's video camera are to selecting the mutual information of video camera set in the vision histogram calculation candidate camera video data that calculate according to step 1, select the suboptimum video camera, the video camera set has been selected in its adding, and from the set of candidate's video camera, rejected; Constantly repeat until selected video camera counting reaches the video camera counting that needs.

2. the combination system of selection of video camera in a kind of visually-perceptible network according to claim 1, it is characterized in that, the visual dictionary that described training generates is: to the multi-path video data as training data of input, at first extract the constant transform characteristics of the yardstick local feature description subvector set of every width of cloth image; The k-mean cluster is carried out in the constant transform characteristics descriptor set of yardstick that all images extract; Each cluster centre is a descriptor vector, and as a visual word, the set of the visual word that obtains consists of the visual dictionary of off-line training;

Training generates visual dictionary and specifically may further comprise the steps:

Extract the constant transform characteristics descriptor vector of graphical rule: to every frame video frame images, adopt respectively Gauss's template that image is carried out filtering and ask for x direction and y direction gradient component I _xWith gradient component I _y, and calculating pixel point gradient magnitude mag (x, y) and direction θ (x, y), wherein

θ (x, y)=arctan (I _y, I _x); From the image upper left corner, get the window of 16 * 16 sizes as the feature extraction sampling window in 8 pixels of the x and the every interval of y direction of image, each sampling window is divided into 4 * 4 square nets zone, sampled point in each zone is calculated respectively gradient relative direction with the sampling window center, the gradient magnitude of sampled point is included into respectively gradient orientation histogram on interior 8 directions in zone after by distance Gauss weighting, each sampling window generates the characteristic vector of one 128 dimension, the gained characteristic vector is carried out normalization form window local feature description subvector; The descriptor vector that every width of cloth image calculation is obtained adds Feature Descriptor set F={f ⁽¹⁾, f ⁽²⁾, f ⁽³⁾... f ^(t), f ⁽ⁱ⁾∈ R ¹²⁸, 1≤i≤t, wherein f ⁽ⁱ⁾For this width of cloth image feature descriptor is gathered i descriptor vector, R ¹²⁸Represent that this vector dimension is 128 dimensions, the Feature Descriptor vector sum that t extracts for this width of cloth image;

Characteristic vector is carried out the k-mean cluster: to Feature Descriptor vector set F, choose at random k vector as cluster centre, the distance of all vector distance cluster centres of iterative computation is also carried out clustering, recomputate cluster centre according to dividing the result, until before and after reaching the iterations of regulation or iteration the cluster centre change of distance less than setting threshold.

Visual dictionary consists of: each cluster centre as a visual word, is obtained and store the set of visual word, consist of visual dictionary.

3. the combination system of selection of video camera in a kind of visually-perceptible network according to claim 2 is characterized in that the online generation of described step 1 target image vision histogram specifically may further comprise the steps:

Step 11, video frequency motion target detects: the video data to each video camera input carries out respectively Video Motion Detection based on mixed Gauss model, and motion detection result is eliminated target shadow based on texture information, finally extracts moving target in the zone of image space;

Step 12, area image local feature description extracts: the motion target area image that step 11 is extracted extracts the set of the constant transform characteristics descriptor vector of yardstick;

Step 13, the vision histogram generates: the cluster centre of the visual dictionary that generates with training in advance is as a histogram bucket, the constant transform characteristics descriptor of the yardstick vector of the motion target area image that step 12 is extracted incorporates in the corresponding bucket of histogram, add up respectively descriptor vector number in each histogram bucket, to the histogram normalized, generate the vision histogram of moving target under a plurality of visual angles at last.

4. the combination system of selection of video camera in a kind of visually-perceptible network according to claim 3 is characterized in that, the sequential forward direction video camera of described step 2 is selected specifically to may further comprise the steps:

Step 22, optimum video camera is selected: from candidate's camera chain C _uOptimum video camera c of middle selection ^*, video camera set C has been selected in its adding _s, i.e. C _s={ c ^*, from the set of candidate's video camera, reject video camera c simultaneously ^*, i.e. C _u=C _u{ c ^*, the selected video camera counting of initial setting up count value is 1;

5. the combination system of selection of video camera in a kind of visually-perceptible network according to claim 4 is characterized in that, the optimum video camera of described step 22 is selected specifically to may further comprise the steps:

Step 221, people's face detects: utilize human-face detector that candidate's video camera c' motion target area image is carried out people's face and detect, testing result is V _Face=0,1}, 1 expression detects people's face, otherwise is 0;

Step 222, the gain of motion target area image information is calculated: to the vision histogram H of video camera c' _c' and merge after-vision histogram H _MergeCalculate visual information amount information gain V when selecting video camera c' and not selecting video camera c' _IG, namely

P (b wherein _k, c') for selecting video camera c' and vision histogram H _MergeIn the joint probability of k histogram bucket,

Video camera c' and vision histogram H are not selected in expression _MergeIn the joint probability of k histogram bucket, p (b _k) be k histogram bucket probability, p (c') and

The probability of video camera c' is selected and is not selected in expression respectively;

Step 223, target image sharpness computation: to the gradient calculation average gradient size of target area image

Characterize the readability of target in image;

6. the combination system of selection of video camera in a kind of visually-perceptible network according to claim 5 is characterized in that, described step 23 suboptimum video camera is selected specifically to may further comprise the steps:

Step 232, candidate's video camera is in selecting the video camera mutual information to calculate: the vision histogram H of target area image among the calculated candidate video camera c' _C'With select video camera set C _sMiddle video camera c _jThe vision histogram

Between mutual information MI (c', c _j),, c _j∈ C _s, MI (c', c _j) target area image vision content similarity degree between two video cameras of expression:

Wherein,

Be histogram H _C'X bucket,

Be histogram Y bucket, n _C'Vision histogram H for candidate's video camera c' _C'The sum of bucket,

For selecting video camera c _jThe vision histogram The sum of bucket;

Step 234 is to the suboptimum video camera c that chooses ^*, video camera set C has been selected in its adding _s=C _s∪ c ^*, from the set of candidate's video camera, reject this video camera C simultaneously _u=C _u{ c ^*.