Summary of the invention
The present invention is directed to current existing heavy recognition methods existing limitation and deficiency in terms of similar purpose identification, proposes
A kind of appearance similar purpose identifies fusion method and system across camera again, this method and system be it is a kind of by Viewing-angle information and
The object table dendrography learning method and system that position constraint relationship and appearance information blend: in learning objective character representation vector
When, the appearance information of target is not only utilized, while the position of target and Viewing-angle information being incorporated among appearance information, and one
Learnt with neural network is sent into so that network not only between learning objective appearance similitude, also learnt position and visual angle
Relevance, vector is indicated with more the target of identification by similitude and the study of the aspect of relevance two.Using boarding steps
Descent method training network is spent, so that the distance indicated between vector in study with common identity target, which is less than, has different identity
The distance of target indicated between vector, i.e. inter- object distance are less than between class distance.In the training process using offline excavation and online
It excavates the method combined and generates and update ternary group data set, improve training effectiveness and convergence rate, model is avoided to fall into
Local optimum.On the basis of obtaining target indicates vector, using the hierarchical clustering based on distance to the mesh from multiple cameras
Mark indicates that vector is clustered, and is thought by the target for being polymerized to a kind of with common identity, to realize the identification again across camera.Layer
The foundation of secondary cluster is the distance between vector, and the vector being closer is polymerized to one kind by it, in order to avoid that will come from same camera
Target be clustered into same identity, the present invention improves the calculation method of distance vector in hierarchical clustering, so that coming from
The distance of identical camera indicated between vector is infinity, ensure that the target under the same camera perspective can not be polymerized to one
Class improves the accuracy identified again.
The present invention is achieved by the following technical solutions.
According to an aspect of the invention, there is provided a kind of appearance similar purpose identifies fusion method across camera again, packet
Include following steps:
Using multiple cameras from different fixed angle synchronous acquisition scene images, target is obtained in sight in different positions
Measurement information;
The target in each image is detected using the object detector based on depth convolutional network, output target inspection
As a result;
The global characteristics figure of each image is extracted using depth convolutional neural networks, and according to object detection results in the overall situation
The local feature figure that target corresponding position is extracted on characteristic pattern, obtains the appearance vector of target;Camera is encoded, packet is generated
The visual angle vector of the Viewing-angle information containing observation;According to target, corresponding target detection frame position generates target under image coordinate system
Position vector;
By appearance vector, visual angle vector sum position vector carry out Vector Fusion, and by transformation after generate target indicate to
Amount;
Depth convolutional neural networks are trained using ternary group data set, study for the target that identifies again indicate to
Amount;Wherein, in the training process, using the offline method generation and update triple data excavated and online mining combines
Collection;
Vector, which gathers, to be indicated to the target after the corresponding study of target in each image using constraint hierarchy clustering method
Class, realization identify again across camera subject.
Preferably, the visual angle vector assigns the vector of a fixed dimension to each observation visual angle, and vector is using random first
Beginning metaplasia by gradient descent method at and being continued to optimize in the training process.
Preferably, the position vector, by the transverse and longitudinal coordinate in the target detection frame upper left corner and the lower right corner respectively according to figure
The width and height of picture normalize, and arrangement generates in sequence.
Preferably, by target detection frame top left corner apex (x1, y1) and bottom right angular vertex (x2, y2) x, the direction y coordinate point
Not divided by the width w of image and height h, normalized apex coordinate (x ' is obtainedi, y 'i):
x’i=xi/w
y’i=yi/h
I=1 in formula, 2, so that coordinate value between 0~1, obtains the normalized upper left corner and lower right corner apex coordinate
(x’1, y '1), (x '2, y '2), it is then arranged in order, obtains position vector b=[x '1, y '1, x '2, y '2]。
Preferably, the Vector Fusion method are as follows: first visual angle vector, position vector and appearance vector are spliced,
Splicing sequence is unlimited;Spliced vector indicates vector by full connecting-type network and by the final target of normalized output.
Preferably, described using the offline method excavated and online mining combines are as follows: to use the multiple cameras of synchronization
Collected target configuration triple;To giving the target observed under some camera, other cameras observe with the target
Target with common identity is positive sample, has the target of different identity for negative sample with the target;First excavated using offline
Method generates all triples as initial data set, after carrying out initial training to depth convolutional neural networks, is dug using online
Pick method is continuously evaluated and removes in data set and easily divide sample, reduces training area, complete the generation and more of data set
Newly.
Preferably, described easily sample to be divided to be the sample that triple loss function is 0.
Preferably, the constraint hierarchy clustering method is the calculating of distance between distance and class cluster between indicating vector based on target
Method, in which:
Target indicates distance between vector are as follows:
IfThe corresponding target of i-th of target indicates vector in the camera for being c for number, then target indicates vector
Between distance
Distance between class cluster are as follows: the target that distance is farthest in two class clusters indicates the distance between vector.
According to another aspect of the present invention, provide a kind of appearance similar purpose identifies emerging system across camera again,
Include:
Image capture module: it including multiple cameras, from different fixed angle synchronous acquisition scene images, is regarded between camera
Wild range mutually covers;
Module of target detection: in the image acquired using the object detector based on depth convolutional network to each camera
Target detected, export target detection frame;
Detection block normalizes module: by the abscissa in the target detection frame upper left corner and the lower right corner and ordinate respectively divided by
The width and height of image, obtain normalization detection block, and the vertex abscissa and ordinate of the normalization detection block be 0~
Dimensionless decimal between 1;
Appearance vector generation module: obtaining whole characteristics of image figure of each camera acquisition using depth convolutional network, then
Local feature figure is extracted according to normalization detection block coordinate on whole characteristics of image figure;
Position vector generation module: according to target, corresponding target detection frame position generates target under image coordinate system
Position vector;
Visual angle vector generation module: encoding camera, generates the visual angle vector comprising observation Viewing-angle information;
Vector Fusion module: splicing appearance vector, position vector and visual angle vector, and spliced vector passes through
Full connecting-type network simultaneously indicates vector by the final target of normalized output;
Triple generation module: triple number is generated and updated using the offline method combined with online mining of excavating
According to collection;
Network training module: being trained depth convolutional neural networks using ternary group data set, and study is for knowing again
Other target indicates vector;
Cluster Analysis module: vector, which clusters, to be indicated to the target after the corresponding study of target in each image, is realized
It is identified again across camera subject.
Preferably, the appearance similar purpose identifies emerging system across camera again, further includes following any one or appoints
It anticipates multinomial feature:
Described image acquisition module includes 4 cameras, is respectively arranged the corner location in scene;The mesh occurred in visual field
Mark can be captured by 4 cameras simultaneously.
The target detection frame is the minimum level boundary rectangle of target under image coordinate system, wherein target detection frame
Vertex abscissa and ordinate unit are pixel;RoI region of the target in image coordinate system is marked according to target detection frame;
That is, when the top left corner apex coordinate of target detection frame is (x1, y1), lower right corner apex coordinate be (x2, y2) when, the region RoI is
(x1, y1)-(x2, y2)。
Compared with prior art, the invention has the following beneficial effects:
Appearance similar purpose provided by the present invention identifies fusion method and system across camera again, and the visual angle of target is believed
Breath and location information and appearance information are fed together depth network and carry out clarification of objective extraction, on the one hand improve the Shandong of feature
Stick, identification of attaching most importance to increase information source;On the other hand reducing target indicates dependence of the vector to target appearance information, so that
Weight identification technology can be extended to the similar target of appearance using upper.In learning process, using offline triple excavate and
Line triple excavates the method combined, avoids the training effectiveness that model is improved while model falls into local optimum.Separately
Outside, the present invention is added to constraint to the cluster of general level, and the observation object avoided under same view angle is polymerized to a kind of mistake
The accidentally generation of situation, improves the accuracy of cluster, and then improves weight recognition performance.
Embodiment
As shown in Figure 1, present embodiments provide a kind of appearance similar object identifies fusion method, including figure across camera again
Shape acquisition, target detection, detection block normalization, generate appearance vector, generate position vector, generate visual angle vector, Vector Fusion,
Generate triple, network training and clustering step.
It is specific as follows:
(1)Image Acquisition: use multiple cameras from visual field model between different fixed angle synchronous acquisition scene image, camera
Enclose mutual covering, target can appear in simultaneously in the visual field of multiple cameras, thus obtain target have in different positions compared with
For complete observation information, while target being made to occur the optimal position of camera as far as possible.
In this step, the module of target detection as shown in Fig. 2 (a) can be used, which is arranged 4 cameras altogether, point
It is not arranged in the corner location of scene;Occur 8 targets in visual field altogether, can be captured simultaneously by 4 cameras.
(2)Target detection: using the object detector based on depth convolutional network to the target in each camera image into
Row detection, exports the detection block of target, and detection block is the minimum level boundary rectangle of target under image coordinate system, the horizontal seat in vertex
Mark is pixel with ordinate unit.For example, being formed by under a certain camera such as Fig. 2 in the module of target detection as shown in Fig. 2 (a)
(b) target detection frame shown in, top left corner apex (x1, y1) and bottom right angular vertex (x2, y2).It is drawn according to this target detection frame
The region RoI (Region ofInterest) of target in image coordinate system out.
The image of one target of the capture of some camera as shown in Fig. 2 (b), the upper left corner of the horizontal detection block of output
Apex coordinate is (x1, y1), lower right corner apex coordinate be (x2, y2).The region RoI can be taken as (x1, y1)-(x2, y2)。
(3)Detection block normalization: by the abscissa in the target detection frame upper left corner and the lower right corner and ordinate respectively divided by figure
The width and height of picture, to obtain new normalization detection block, the vertex abscissa and ordinate of the normalization detection block are equal
For the dimensionless decimal between 0~1.
The direction the x width and the direction y height of detection block are respectively w and h, then the apex coordinate for normalizing detection block is respectively
(x’i, y '1), (x '2, y '2), in which:
x’i=xi/w
y’i=yi/h
In formula, i=1,2, coordinate value x1′、x2′、y1′、y2' value be respectively positioned between 0~1.
(4)Generate appearance vector: feature extraction is used as using depth convolutional network (such as VGGNet, ResNet etc.) first
Device carries out the operations such as a series of convolution, pond to the picture of each camera, obtains the further feature figure of one layer of convolutional layer, the spy
Sign figure has image overall semantic information;Then target pair is extracted on further feature figure according to the detection block that target detection exports
The local feature figure of position is answered, which has different scales according to the size of detection block;In order to obtain unification
The target appearance vector of regular length convenient for operation between subsequent vector and compares, and carries out RoIPooling to local characteristic pattern
Operation exports the normalizing aspect vector with unified regular length;This feature vector passes through the mapping of subsequent full articulamentum
Final appearance vector is exported afterwards.In this step, when extracting the local feature figure of target corresponding position, whole image is first calculated
Characteristic pattern, then on characteristic pattern according to normalization detection block coordinate extract local feature figure.
(5)Generate position vector: camera is fixed, target is in the situation of ground motion, the same target appears in difference
Position vector in camera image has determining relationship, and this relationship, which can be used as, judges whether two observed objects have phase
With one of the foundation of identity.Present invention introduces the new concepts of position vector, using the position of target in the picture as target
One feature is for identifying again.The position vector of target is defined as the upper left of the normalization detection block of the target under image coordinate system
The coordinate arrangement at angle and bottom right angular vertex.
The transverse and longitudinal coordinate for normalizing the detection block upper left corner and the lower right corner is arranged in sequence, generates position vector b=
[x’1, y '1, x '2, y '2]。
(6)Generate visual angle vector: when determining whether the middle target of two images has same identity, the angle of object observing
Spending (visual angle) is also a very important factor.For example, pattern is often different when same target is from front and back, but
In the case where visual angle determines, the feature of interest observed from all angles has a determining relationship, this relationship naturally and
Visual angle is related.The new concept of the present embodiment introducing visual angle vector: designing a two dimension first, and matrix column number is equal to the number of camera
Amount, the line number of matrix generally take 4,8,16 etc., and each element of matrix is filled with random number, and random number, which can be, obeys Gauss point
Number between cloth or equally distributed 0 to 1;The coding of the corresponding camera of each column of matrix, is defined as the visual angle of the camera
Vector;The visual angle vector is continued to optimize by gradient descent method in the training process.
For example, if there is M camera in system, number 0,1 ..., M-1, for i-th of camera, one-hot is compiled
CodeIt is one in addition to the M dimensional vector that the (i-1)-th dimension is that 1 its codimension is 0, defines camera matrixTo matrix
V random initializtion (such as according to standardized normal distribution), a column v of matrixiThe coding for indicating (i-1)-th camera, then have vi=Vci,
Wherein vi is the visual angle vector generated, optimizes the vector in network training process.
(7)Vector Fusion: appearance vector, position vector and visual angle vector are merged to generate final target and indicate
Vector.First vector is spliced, as shown in Figure 4.Spliced vector is by full connecting-type network and passes through normalized output
Final target indicates vector.
(8)Generate triple: in order to learn to have the target of stronger characterization ability to indicate vector, need to feature extraction net
Network is trained.To giving the target observed under some camera, other cameras observe with its mesh with common identity
It is designated as positive sample, is negative sample with its target with different identity, accordingly generates ternary group data set.The present embodiment use from
The method that line excavates and online mining combines generates the triple of training network, convergence of the composition of triple to network training
Property has vital effect.Network in the present embodiment needs relative positional relationship of the learning objective between camera, using same
Target configuration triple in one moment multiple camera acquired images.To giving the target observed under some camera,
What his camera observed is positive sample with its target with common identity, is negative sample with its target with different identity.
Specifically, as shown in figure 5, setting t moment camera c observes that identity is expressed as O as the target of lT, c, lIt is if selecting it
Reference sample, to another observed object OT ', c ', l ', when meeting t '=t, when c ' ≠ c, l '=l, OT ', c ', l 'For positive sample, when full
When sufficient t '=t, c ' ≠ c, l ' ≠ l, OT ', c ', l 'For negative sample.All triples at each moment are first generated according to above-mentioned rule,
The triple of different moments is put together into composing training collection.At training initial stage, model falls into locally optimal solution in order to prevent, first
Model is trained using samples all in training set, since training set is larger, model convergence rate is slower.By several wheels
After (wheel indicates traversal all samples of training set) training, then online triple method for digging is taken, specifically, each small quantities of
Easily point (loss function 0) sample in this batch is removed from training set after the completion of amount (mini-batch) training, to guarantee
The sample will not be collected in subsequent sampling, in this way, difficulty divides sample and half difficult point in training set with trained continuous progress
The ratio of sample is continuously increased, and improves training effectiveness and convergence rate, while still easily dividing sample comprising part in training set, can
To avoid model collapsing.
(9)Network training: in order to learn to indicate vector with the target of strong characterization ability, feature extraction network is instructed
Practice.To giving the target observed under some camera, what other cameras observed is positive sample with its target with common identity
This, is negative sample with its target with different identity, accordingly generates ternary group data set
For example, sample marking software is as shown in fig. 6, detection block b corresponding to certain target in given image/and image,
The camera numbers c of image, neural network forecast target indicate vector e=f (I, b, c) ∈ Rd, f () is a transforming function transformation function, refers to feature
Extract network.Certain target observed (reference sample) on given specific imageAnother is selected to have phase with it
With the observed object (positive sample) of identityWith its observed object (negative sample) with different identity
Then triple loss function are as follows:
Wherein i is ternary deck label, []+=max (0), a, p, n respectively indicate anchor, positive,
Negative, m are the intervals of positive negative sample.Triple lose so that with different identity target indicate vector between distance with
Certain interval is greater than the distance between the vector with common identity.
(10)Clustering: hierarchical clustering is constantly merged and is decomposed to class cluster according to the distance between class cluster, identical
The distance between the corresponding vector of different target in camera is smaller sometimes, if not adding constraint, may miss and gather them
At one kind.In order to solve this problem, the present embodiment using constraint hierarchy clustering method to learn target expression vector into
Row cluster, to realize the target identities identification across camera.Distance between definition vector of the present invention are as follows: when two targets indicate vector
When from different cameral, Euclidean distance of the distance between they between vector, when two targets indicate that vectors come from identical camera
When, the distance between they are positive infinite.In addition, distance is between the point of two class cluster lie farthest aways between the present invention defines class cluster
Distance, when the vector in identical camera is polymerized to one kind, distance is positive infinite, this constraint subtracts to a certain extent
The generation of cluster mistake is lacked.
For example, settingThe corresponding characterization vector of i-th of target in the camera for being c for number, definition vectorBetween away from
From
In addition, distance between class cluster is using the distance between the point of two class cluster lie farthest aways, when in identical camera
When vector is polymerized to one kind, distance be positive it is infinite, this constraint reduce to a certain extent cluster mistake generation.
Hierarchical clustering: after the completion of training, it is corresponding that the target that all cameras detect is calculated using trained network model
Expression vector.Characterization vector is clustered using hierarchical clustering polymerization, wherein the distance between vector calculates according to the following formula:
Distance between class cluster takes the distance between the point of two class cluster lie farthest aways.Cluster is that similar target thinks with phase
Same identity, to complete to identify again.
In the present embodiment:
Provided appearance similar purpose identifies fusion method across camera again, and main thought is to regard the observation of target
Angle and positional relationship indicate vector for learning objective in conjunction with appearance information, and use in the training process is offline and online three
Tuple excavates the method generation and more new data set combined, and using constraint hierarchical clustering to the object table from different cameral
Show that vector is clustered, is identified again to realize across camera subject.
The method that the learning objective indicates vector introduces and encodes the visual angle vector sum of Viewing-angle information and encode mesh
The position vector of mark location information in the picture, and by visual angle vector sum position vector and the appearance for encoding target appearance information
Vector Fusion, which generates target, indicates vector.
The visual angle vector, the vector of a fixed dimension is assigned to each observation visual angle, and vector uses random initializtion
It generates, and is continued to optimize in the training process by gradient descent method.
The position vector, by the transverse and longitudinal coordinate in the detection block upper left corner and the lower right corner respectively according to the width and height of image
Degree normalization, and arrangement generates in sequence.It specifically will test frame top left corner apex (x1, y1) and bottom right angular vertex (x2,
y2) x, the coordinate in the direction y respectively divided by the width w of image and height h,
x’i=xi/w
y’i=yi/h
I=1 in formula, 2, so that coordinate value between 0~1, obtains normalized apex coordinate (x '1, y '1), (x '2,
y’2), it is then arranged in order, then position vector b=[x '1, y '1, x '2, y '2]。
The Vector Fusion method, first to visual angle vector, position vector, appearance vector is spliced, as shown in figure 4,
Splicing sequence is unlimited.Spliced vector indicates vector by full connecting-type network and by the final target of normalized output.
The triple method for digging is referred to using the collected target configuration ternary of the multiple cameras of synchronization
Group, and data set is constantly updated in the training process.To giving the target observed under some camera, what other cameras observed
It is positive sample with its target with common identity, is negative sample with its target with different identity.First use off-line method
All triples are generated as initial data set using online mining method, that is, to be continuously evaluated after carrying out initial training to model
And easily point (loss function 0) sample in training set is removed, reduce training area.
The constraint hierarchy clustering method, be based on improved object vector and class cluster distance computing method, target to
Distance calculating method is as follows between amount:
IfThe corresponding expression vector of i-th of target in the camera for being c for number, then vectorBetween distance
Distance is the distance between the vector that distance is farthest in two class clusters between class cluster.
A kind of appearance similar object provided by the above embodiment of the present invention identifies fusion method across camera again, involved by
And technical solution has diversity and versatility.The above embodiment of the present invention is by the observation visual angle of target and positional relationship and appearance
Information is merged for extracting feature vector.Picture global characteristics figure is extracted using depth convolutional neural networks, and is examined according to target
Survey the appearance vector that result extracts target on global characteristics figure;Camera is encoded, is generated comprising observation Viewing-angle information
Visual angle vector;According to target, corresponding detection block position generates the position vector of target under image coordinate system.By these three to
Amount fusion simultaneously generates target expression vector after transformation.Network is trained by optimizing triple loss function, is learnt
Expression vector for identifying again is generated and is updated using the method that offline excavation and online mining combine in the training process
Ternary group data set.Finally the corresponding expression vector of target in different cameras is carried out using the hierarchical clustering algorithm of belt restraining
Cluster, realization identify again across camera subject.
Specific embodiments of the present invention are described above.It is to be appreciated that the invention is not limited to above-mentioned
Particular implementation, those skilled in the art can make various deformations or amendments within the scope of the claims, this not shadow
Ring substantive content of the invention.