Background technology
Along with the appearance of computer vision technique and increasing rapidly of computer computation ability, Novel monitoring video system has obtained swift and violent development.Simultaneously along with the variation of video surveillance applications, when monitoring on a large scale scene, single camera is difficult to meet monitoring requirement because visual field is limited, therefore, use a plurality of camera supervised scenes on a large scale to become the important development trend of video monitoring.Between the visual field of a plurality of video cameras, exist overlapping, can usage level camera calibration and the space time information of combining target.But overlapping when not existing between the visual field of a plurality of video cameras, be that pedestrian moves while there is " territory, time blind area " and " caecitas spatialis region " in camera field of view, for the continuity that guarantees that pedestrian follows the tracks of, need to carry out identity consistency checking to the target in different visual fields.
In video monitoring system on a large scale, under different cameras visual field, the consistency checking of the consistency checking of different time one skilled in the art identity or same camera field of view different time one skilled in the art identity becomes needs the major issue that solves.We are called pedestrian in monitor video identification problem again by the problems referred to above.Current pedestrian again identification technique, using pedestrian's clothing outward appearance as judgment basis, and supposes that pedestrian wears clothes and do not change in monitor procedure, then identification technique is carried out the identification of pedestrian's identity by outward appearance similarity matching.
At present, pedestrian again identification mainly contain following a few class technical scheme:
Technical scheme (1)
This scheme attempts to extract from original image the feature that more stable (having stability) has again the property distinguished simultaneously, stability refers to that same person should be the same (stable) in this feature in the same time not, the property distinguished refer to different people (not in the same time or synchronization) this feature should be different.In this technical scheme, from original image, designing the feature that meets above-mentioned requirements is crucial problem.Wherein typical example is Symmetry Driven Accumulation of Local Feature[list of references 1] (the symmetrical accumulation method of local feature), by pedestrian, the health in image detects the method, then health is divided into head, trunk and lower limb from vertical direction, horizontal direction is divided into left and right half body.After removing head, whole health has been divided into left trunk, right trunk, left leg, four parts of right leg.Then for every part, we extract " hsv color histogram ", " maximum stable color region (MSCR) " and " repeating image block " as characteristics of image.The most above-mentioned 3 kinds of image feature vectors of all splicing of connecting, forms the proper vector of whole health.This feature extracting method combines spatial information.
The shortcoming of technical scheme (1)
First: in order to guarantee that extracted feature has stability and the property distinguished, the design of feature needs artificial experience and trial repeatedly.
Second: actual pedestrian, again in identification problem, the parameter configuration of different video cameras is different, and the illumination condition of the visual field of video camera is different, and the shooting angle of same person under different visual fields is different, and may be subject to block interference not identical yet.Under the shooting condition of this complexity, be difficult to design a kind of above-mentioned feature with stability and the property distinguished.
Technical scheme (2)
This technical scheme no longer focuses in the design of image primitive character, but by the method for metric learning, image primitive character is carried out to projection, makes the feature projecting meet stability and the property distinguished.The primitive character of supposing two pictures is expressed as x ∈ R
dwith y ∈ R
d, the direct range of two primitive characters (Euclidean distance) is
Metric learning method attempts to find a projection matrix L ∈ R
d * r, then use this matrix to carry out projection to primitive character and calculate projection Euclidean distance afterwards
How to obtain the key issue that a good projection matrix L is metric learning.[non-patent document 2] attempts same person image distance to be less than the maximum probability of different people image distance.[non-patent document 3] uses for reference classical metric learning method large margin nearest neighbor(maximal clearance nearest neighbor search) and improve according to identification particular problem again.Method based on metric learning can obtain better performance than technical scheme one.
The shortcoming of technical scheme (2)
Although attempting, with rational projection matrix, original image is carried out to linear projection, this technical scheme guarantees that the feature after projection meets stability and the property distinguished.But because two image takings of needs couplings are in different video cameras, cause existing between picture each species diversity, (parameter configuration of video camera is different, the illumination condition of the visual field of video camera is different, the shooting angle of same person under different visual fields is different, and may be subject to block interference not identical) this species diversity causes two pictures can be seen as in different mode.In this case, only use a metric matrix to be not enough to the Projection Character in two mode to carry out distance calculating in same space.
Embodiment
Below, by reference to the accompanying drawings the enforcement of technical scheme is described in further detail.
Fig. 1 is pedestrian's functional block diagram of identification system more according to an embodiment of the invention.Pedestrian's discrimination method is again described according to an embodiment of the invention with reference to the accompanying drawings, and the method mainly comprises the following steps:
The first step: the pedestrian in camera review is positioned
Pedestrian location refers to the position of determining pedestrian in whole video monitoring image, and the whether accurate performance on whole system of pedestrian location has important impact.Prospect of the application background separation technology of the present invention is carried out pedestrian location.First carry out mixed Gauss model and carry out background modeling, then use the method for background subtraction using the target as sport foreground (pedestrian) location out.A rectangle frame that comprises pedestrian target of final acquisition is as pedestrian detection result, and follow-up feature extraction operation is carried out in this rectangle frame.
Second step: pedestrian's image primitive character extracts
The rectangular image that comprises pedestrian target of locating acquisition for pedestrian, carries out feature extraction, finally obtains the characteristics of image of 5895 dimensions.Concrete feature extracting method is as described below.
Can adopt existing method to extract the primitive character of pedestrian's image (rectangle frame).For example, use bilinearity difference approach that the rectangular image that comprises pedestrian is normalized to 128*48 pixel, and the image after normalization is divided into the image block of a plurality of 16*24 pixel sizes, wherein,, there is the overlapping region of 12 pixels the overlapping region that has 8 pixels between the adjacent image block of level between vertical adjacent image block.Like this, original image is divided into 45(15*3 altogether) individual image block.
For each image block, carry out the extraction of color characteristic and textural characteristics.Color characteristic comprises RGB, HSV and YCbCr totally 9 passages, and the color characteristic of each passage is quantified as the color histogram of 8 dimensions, and the distribution situation of presentation video piece in color space obtains altogether 9 histograms.Textural characteristics is used local binary patterns, forms the texture histogram of 59 dimensions, and local binary patterns has the remarkable advantages such as rotational invariance and gray scale unchangeability, fully Description Image local grain.Local binary patterns computing method are referring to [list of references 4].Like this, the characteristic dimension of each image block is 8*9+59=131.Finally, the color characteristic of all image blocks and textural characteristics are connected, the characteristic dimension that obtains whole image is 131*45=5895.Fig. 2 is the schematic diagram of color and texture feature extraction.
The 3rd step: carry out Feature Dimension Reduction by anchor node projection
Due to the primitive character dimension of image too high (5895 dimension), if directly use can cause follow-up operation to consume a large amount of computing times, therefore, sending into pedestrian again before identification system, original image feature is carried out to dimensionality reduction, finally obtain the feature of 150 dimensions.The present invention uses anchor node shadow casting technique [list of references 7] to carry out Feature Dimension Reduction.The mathematic(al) representation of anchor node projection is:
X ∈ R wherein
5895the original image feature that represents 5895 dimensions,
represent 150 anchor nodes, D () represents Euclidean distance.T represents normaliztion constant.Z (x) ∈ R
150represent 150 dimensional features of original image feature x through obtaining after projection, be called anchor node feature.Can find out, anchor node projection can be projected as the primitive character of 5895 dimensions the anchor node feature of 150 dimensions, because 150 much smaller than 5895, so anchor node projection has realized Feature Dimension Reduction.In the present invention the selection of anchor node whether rationally can direct effect characteristics dimensionality reduction effect quality whether, in the present invention, we carry out K-means cluster for all primitive characters, cluster centre number is chosen as 150, then using 150 cluster centres of K-means as anchor node, all like this anchor nodes can relatively be evenly distributed in whole primitive character space, make Feature Dimension Reduction have robustness.
The 4th step: the measurement of similarity between feature
In order to judge whether two images belong to a people, two low dimensional features (being obtained by the 3rd step) corresponding to image are sent into Hash projection model (training process of Hash projection model will illustrate below), obtain the similarity of two images, judge that according to this whether the identity of pedestrian in two images is consistent.
Particularly, it is considered herein that two images that come from different cameras are in different modalities space, therefore first we project to respectively unified Hamming space by two low dimensional features, form respectively binary features, then calculate two Hamming distances between binary features from, finally take Hamming distance from calculating two similarities between feature for basic.Fig. 3 is the structured flowchart of the method.
Below this measuring similarity process is described in detail.
Hash projection is to use Hash function that raw data is projected to a kind of technology [list of references 5] in Hamming space.Suppose that original data space is X ∈ R
d, x ∈ X is the data in X space, H={-1, and+1} is Hamming space, h (x) ∈ H is that data x is through the result after hash projection.Hash function definition is
h(x)=sgn(p
Tx+a)∈{-1,+1} (4)
Wherein { 1 ,+1} represents sign function to sgn () ∈, p
tthe transposition that represents projection vector p, a represents side-play amount (scalar).
Because the result of hash projection is binary data-1 and+1, therefore can define data x and the similarity function of y under Hash function h () condition
Two images of the definition of above similarity function s (x, y) based on to be compared are in same Modal Space.But pedestrian, again in identification problem, the picture that different cameras photographs is sentenced different Modal Spaces, and therefore above-mentioned similarity function can not be directly used in pedestrian's measuring similarity of identification again.
In order to overcome above problem, the present invention proposes the Hash projection of cross-module state and similarity function.Suppose that two video camera photographic images feature x and y are respectively in X space and Y space.Two kinds of different Hash function h
xand h (x)
y(y) respectively the feature in these two spaces is carried out to Hash projection,
h
X(x)=sgn(p
Tx+a)∈{-1,+1}
h
Y(y)=sgn(q
Ty+b)∈{-1,+1} (6)
Wherein, p
t, q
tthe transposition that represents respectively projection vector p, q, a, b represent side-play amount (scalar).
By Hash projection, characteristics of image x and y are projected to respectively in identical Hamming space, and corresponding similarity function is rewritten as
Above a pair of Hash function h
x() and h
y() only can represent two kinds of similarities (s (x, y)=+ 1 represents that x is similar with y, and s (x, y)=-1 represents x and y dissmilarity), and in order to portray better the similarity degree of x and y, as example, the present invention can introduce 50 pairs of Hash functions
(every pair of Hash function has projection vector p, q, side-play amount separately), and the similarity function of x and y is rewritten as:
In addition, consider that different Hash functions are not identical to the effect of playing in measuring similarity, we are that every a pair of Hash function is set a weight α
l, formula (8) is further rewritten as:
Sum up said process below.
Suppose in A video camera and photograph an image Q, in B video camera, photograph N and open image
find the G the most similar to Q, the pedestrian who occurs in Q (or other target) is carried out to judging identity, wherein, the N photographing in B video camera opens image
in the corresponding target classification (for example, certain pedestrian's identity) of every image.Use step 1~3 obtain described image characteristic of correspondence x and
then use formula (9) to calculate similarity
use formula (10) is found out the y with x similarity maximum
*,
Afterwards, by y
*the classification information of corresponding image (from B video camera) is as recognition result.
Below, the parameter training process of Hash projection model is described.
In order to guarantee that formula (9) can carry out rational similarity measurement, need to be to parameter wherein
reasonably arrange, therefore, use training data image (in image, pedestrian's identity is known) to carry out the study of parameter.Suppose and have 316 pairs of training samples
s(x
k, y
k) { 1 ,+1} shows that x and y belong to same person (s (x to ∈
k, y
k)=+ 1) or belong to different people (s (x
k, y
k)=-1).Reasonably cross-module state Hash projection function should have following character:
1) through after projection, belong between the feature of different people (thering is different clothing outward appearances) and have larger distance,
2), through after projection, belong between the feature of same person (thering is identical clothing outward appearance) and have less distance.
The method of use based on AdaBoost carried out parameter training to 50 pairs of Hash functions.Training process be input as 316 pairs of training samples and corresponding label and 150 anchor nodes.50 iteration of whole training process experience, in iteration, first determine optimum projection vector p each time
l, q
lwith side-play amount a
l, b
l, then calculate Hash function to weight, final updating sample weights (for next iteration is prepared).Training process is output as the projection vector of 50 pairs of Hash functions and side-play amount and corresponding Hash function to weight.For the l time iterative process, be described below:
Shown in objective function formula (11), by maximizing objective function, obtain optimum projection vector p
l, q
lwith side-play amount a
l, b
l.
(1) training { p
l, q
l}
In optimizing process, the problem of bringing in order to overcome sign function, simplifies formula (11)
ε wherein
lk=s (x
k, y
k),
the z (x of centralization
k),
the z (y of centralization
k).According to [list of references 6], p
land q
lshould be in Σ
lin the Projection Character space of matrix.Suppose
with
respectively Σ
lfront 50 left eigenvectors and 50 right proper vectors, so p
land q
lcan use
with
linear combination carry out approximate representation:
Wherein,
with
be respectively
with
linear coefficient.
In order to reduce computation complexity, select at random the projection weight of 3000 pairs 50 dimensions
use formula (14) to obtain N to projection vector
then select to make objective function
obtain peaked projection vector to optimal result the most.
(2) training { a
l, b
l}
Obtain projection vector to after, objective function becomes
Find below and can make (a, b) combination of objective function maximum as optimum side-play amount pair.Particularly, (a, b) two-dimensional space is carried out to uniform grid and turn to 100 * 100 grid, common property raw 10000 (a, b) combination, then based on each (a, b) combination calculating target function, and select the combination that can maximize objective function as optimum side-play amount pair.
More than describe the training process of l to Hash projection function, for all Hash functions (totally 50 pairs), used AdaBoost method to carry out joint training.In whole process, for every a pair of Hash projection function, add sample weights to objective function
ω wherein
l(x
k, y
k) be the weight of k to sample.
(3) training { α
l}
The right weight calculation formula of Hash function is
List of references list
1、Michela Farenzena,Loris Bazzani,Alessandro Perina,Vittorio Murino,and Marco Cristani,“Person re-identification by symmetry-driven accumulation of local features,”in Computer Vision and Pattern Recognition(CVPR),2010IEEE Conference on.IEEE,2010,pp.2360–2367.
2、Wei-Shi Zheng,Shaogang Gong,and Tao Xiang,“Person reidentification by probabilistic relative distance comparison,”in Computer Vision and Pattern Recognition(CVPR),2011IEEE Conference on.IEEE,2011,pp.649–656.
3、Mert Dikmen,Emre Akbas,Thomas S Huang,and Narendra Ahuja,“Pedestrian recognition with a learned metric,”in Computer Vision–ACCV2010,pp.501–512.Springer,2011.
4、T.Ojala,M.
and D.Harwood(1994),"Performance evaluation of texture measures with classification based on Kullback discrimination of distributions",Proceedings of the12th IAPR International Conference on Pattern Recognition(ICPR1994),vol.1,pp.582-585
5.A.Torralba,R.Fergus,et al.,“Small codes and large image databases for recognition,”in Computer Vision and Pattern Recognition(CVPR),2008IEEE Conference on.IEEE,2008,
6.M/Bronstein,M.M.Bronstein,et al.,“The video genome,”arXiv preprint arXiv:1003.5320,2010.
7.Liu W,Wang J,Ji R,et al.Supervised hashing with kernels[C]//Computer Vision and Pattern Recognition(CVPR),2012IEEE Conference on.IEEE,2012:2074-2081.
For fear of the description that makes this instructions, be limited to miscellaneous, in description in this manual, may the processing such as omission, simplification, accommodation have been carried out to the part ins and outs that can obtain in above-mentioned list of references or other prior art data, this is understandable for a person skilled in the art, and this can not affect the open adequacy of this instructions.At this, above-mentioned list of references is herein incorporated by reference and in full.
In sum, those skilled in the art will appreciate that the above embodiment of the present invention can be made various modifications, modification and be replaced, it all falls into the protection scope of the present invention limiting as claims.