Technical Field
With the continuous development of economic technology and the continuous improvement of the technological level, the quality of life of people correspondingly changes day by day. With the development of machine vision technology in recent years, machine vision has been widely applied to intelligent transportation systems, such as monitoring of traffic events and traffic flows, detection of road surface diseases, automatic navigation of intelligent vehicles, and the like. As an important component of an intelligent traffic system, an intelligent safety vehicle is an existing research hotspot, and the intelligent vehicle utilizes an intelligent algorithm to understand the instant environment of the vehicle and provides guarantee for safe driving.
The vehicle tracking technology is a new technical field following the vehicle detection technology, and the two are closely connected but are not available. According to the position relation between the camera and the target, the tracking method can be divided into target tracking under a static background and dynamic background tracking.
⑴ method for tracking objects in static background:
target tracking in a static background means that the camera is fixed in a certain orientation and the field of view it observes is also static. Background subtraction and gaussian background modeling are commonly employed. The background difference method firstly makes a difference between the foreground image and the background image, and then the target object entering the visual field can be obtained. For the description of the target, the size of the target is usually expressed by the number of pixels of the target connected region, or the aspect ratio of the target region, etc. The position information of the target can be positioned by means of projection. The Gaussian background modeling is to model the background, generally, K (basically 3 to 5) Gaussian models are used to represent the characteristics of each pixel point in an image, a Gaussian mixture model is updated after a new frame of image is obtained, then, an image (called as a foreground image) is read from a video stream, each pixel point in the current image is matched with the Gaussian mixture model, if the matching is successful, the point is determined to be a background point, and if the matching is not successful, the point is determined to be a foreground point.
⑵ method for tracking target in dynamic background:
target tracking in a dynamic background means that the camera is not fixed in a certain position and the field of view observed by the camera is not static, i.e. the background is also moving relative to the camera. For this situation, a tracking method based on detection is usually adopted to detect a single frame image, so as to achieve the effect of continuous detection.
Common tracking algorithms are: kalman filtering, particle filtering, TLD, CT.
⑴ Kalman Filter (Kalman filtering) is an algorithm for carrying out optimal estimation to the state of system by inputting and outputting observation data through the system by using the state equation of linear system.
⑵ Particle Filter the idea of Particle Filter (PF: Particle Filter) is based on Monte Carlo method (Monte Carlo methods), which uses Particle sets to represent probabilities and can be used on any form of state space model, the core idea is to express the distribution by random state particles extracted from the posterior probabilities, which is a Sequential Importance Sampling method (Sequential Importance Sampling).
⑶ TLD (Tracking-Learning-Detection) is to solve the problems of shape change, illumination condition change, dimension change, occlusion, etc. of the tracked target in long-time Tracking, the detector is to process the detected characteristics (representing the target object) in a localization mode and to correct the tracker continuously according to the need.
⑷ CT (computed tomography) is that according to the sparse perception theory, a very sparse measurement matrix meeting RIP conditions is used for projecting the characteristic space of an original image to obtain a low-dimensional compressed subspace, and the low-dimensional compressed subspace can well retain the information of the characteristic space of a high-dimensional image.
⑸ MOSSE (minimum Output Sum of Squared Error filter) filter is a real-time good tracking method which uses the correlation convolution matrix to obtain the minimum Output mean square Error, and has a certain robustness to the illumination, size, posture and non-rigid deformation.
Disclosure of Invention
The invention aims to provide a rapid and accurate vehicle tracking method.
The technical scheme adopted by the invention for solving the technical problems is that the invention provides a vehicle multi-scale tracking method based on a rapid characteristic pyramid, and when a target is detected in the last frame,(x0,y0) And tracking the target of the current frame for the central point of the target area detected by the previous frame, wherein the specific method comprises the following steps:
step 1, calculating the aggregation channel characteristics of the current frame image under the original scale to obtain a characteristic map under the original scale:
1-1, converting the current frame image into an LUV color space to obtain L, U, V three-channel characteristics;
1-2, solving a gradient map of the LUV image to obtain a HOG (histogram of gradients) feature;
1-3, cascading the L, U, V three-channel characteristics and the HOG channel characteristics in each direction to obtain the aggregation channel characteristics of the current frame in the original scale;
step 2, calculating a characteristic pyramid:
according to various scales under the characteristic pyramid, performing characteristic pyramid sampling on the characteristic graph under the original scale according to the energy statistical relationship of the characteristics of all channels of the aggregation channel characteristics to obtain a characteristic graph under the corresponding scale; the ratio of the dimension of the feature map after sampling by each feature pyramid to the dimension of the feature map under the original dimension is s1,s2,...snN is the total number of different scales; each characteristic pyramid sampling comprises up sampling and down sampling;
and 3, calculating tracking results on different scales:
3-1 the width and height of the characteristic channel of the characteristic diagram under the original size are respectively W0And H0Let the center point of the target region of the feature map under the original size be (x)0,y0) (ii) a The central position of the corresponding target area after each characteristic pyramid is sampled is (x)0s1,y0s1),(x0s2,y0s2),...,(x0sn,y0sn) ;
3-2 calculate convolution filter:
for the case where the previous frame was the first time the target was detected:
under the sampling of each characteristic pyramid, a random Gaussian distribution is generated by taking the central position of the target area as an extreme value of the Gaussian distribution, and the random Gaussian distribution corresponds to the output response G of the target area1,G2,...,GnInitializing the filter coefficients of the convolution filteri=1,2,…,n,FiIs Fourier transform of the target area, GiRandom Gaussian distribution with a target area as a center position;
for the case where the previous frame was not the first time the target was detected:
the convolution filter is updated by adopting the following strategy:
η are the weight coefficients of the image data,the filter coefficient of the convolution filter of the previous frame;
3-3, inputting the feature graphs of all sizes after the feature pyramid is sampled into a convolution filter, wherein the output responses of the convolution filters are g respectively1,g2,...,gnSeparately obtaining the tracking value PSR of each size1,PSR2,...,PSRn,gmaxFor the extreme value of the output response at the current size, musAnd σsRespectively mean and variance of output response under the current size;
3-4 taking tracking value PSR1,PSR2,...,PSRnThe maximum PSR is used as the tracking result PSR, when the PSR is larger than the set threshold, the tracking is considered to be successful, the tracking of the next frame is continued, and the tracking is carried outMapping the tracking result to the original scale according to the corresponding scale to determine a target area; and when the PSR is smaller than the set threshold, the tracking is considered to be failed, and the current tracking target is cancelled.
The invention uses a new ACF characteristic with more identification capability to replace the gray characteristic in the MOSSE algorithm, and uses a quick characteristic pyramid to approach the characteristic on the adjacent scale, thereby quickly calculating the tracking result on different scales. Firstly, converting the acquired RGB vehicle images into Luv space by using the ACF characteristics to respectively obtain L, U, V channels; solving a gradient image on the basis to obtain a gradient channel; cascading these channel features forms an ACF feature. The fast pyramid is obtained by carrying out scaling approximation on an original characteristic image through a statistical thought, and different from the traditional image scaling method, the traditional method needs to use means such as nearest neighbor or linear interpolation to calculate a scaling result, and the calculation complexity is far higher than that of the fast pyramid method. The fast pyramid can directly calculate the image on the original scale to obtain the image on the scale after zooming according to the energy statistical relationship among the zooming of all the channels.
The method has the advantages that the vehicle tracking algorithm based on the aggregated channel characteristics not only utilizes the global information of a plurality of channels, but also fully utilizes the local information of the vehicle on each channel, thereby improving the robustness of vehicle tracking; and the fast characteristic pyramid is utilized to realize the multi-scale tracking, and the real-time tracking is ensured.
Detailed description of the preferred embodiments
For convenience in describing the present disclosure, some terms will be described first.
Luv channel characteristics. The LUV color space is known as the CIE 1976(L, u, v) (also known as CIELUV) color space, L denotes object luminance, and u and v are chromaticities. In 1976, the visual uniformity was provided by the international commission on illumination CIE, which is obtained by simple transformation of CIE XYZ space. A similar color space has CIELAB. For a typical image, u and v range from-100 to +100, with a luminance of 0 to 100.
A gradient channel feature. The gradient channel characteristic is a gradient map of an image, and reflects edge information of the target. The gradient can have various algorithms, such as Prewitt operator, Sobel operator. However, the simplest operator of [ -101 ] performs better. The gradient is used to describe the edges of the vehicle image. Since the Luv channel and the RGB channel only change linearly, a gradient map of an image can be obtained on the Luv channel after the Luv channel is obtained for convenience.
Histogram of gradients. Histogram of gradients, hog (histogram of oriented gradient), the image is first divided into small connected regions, which we call cell units. And then acquiring the gradient or edge direction histogram of each pixel point in the cell unit. Finally, the histograms are combined to form the feature descriptor. To improve performance, we can also contrast-normalize these local histograms over a larger range of the image (we call it a bin or block) by: the density of each histogram in this bin (block) is calculated and then each cell unit in the bin is normalized according to this density. By this normalization, better effects on illumination variations and shadows can be obtained.
The method of the invention is adopted to realize the algorithm on the C + + platform, and the specific steps are as follows:
step 1, calculating the aggregation channel characteristics of the current frame image under the original scale to obtain a characteristic map under the original scale:
1-1, converting the current frame image into an LUV color space to obtain L, U, V three-channel characteristics;
the image that the camera was gathered is RGB image generally, and RGB image is unfavorable for the clustering of colour. In order to describe the gray scale and chromaticity information of the vehicle well, the RGB image needs to be converted into the LUV image. The specific method comprises the following steps:
firstly, RGB image is converted into CIE XYZ
Then CIE XYZ is converted into Luv
u=13L(u'-un') (3)
v=13L(v'-vn') (4)
Wherein,
Ynas its brightness, (u)n',vn') are the chromaticity coordinates.
1-2, solving a gradient map of the LUV image to obtain a HOG (histogram of gradients) feature;
there are many ways to calculate the gradient, such as the Prewitt operatorAndsobel operatorAndhowever, the most adopted hereSimple operator [ -101 ]]Andthe effect obtained by filtering is better.
1-3 sampling and normalization
Since 4 × 4 cells are assigned to 6 directions when calculating the gradient histogram, that is, the aspect ratio of the gradient histogram is 1/4 of the original image, in order to keep the aspect ratio of all channels consistent, the Luv channel image and the gradient image need to be downsampled, and the sampling does not affect the detection result. In the sampling process, a bilinear interpolation method is used to obtain better effect
In order to suppress the influence of noise in the gradient calculation, a normalization operation is required for the gradient map. The normalization operations are L1-norm, L2-norm and L1-sqrt.
L1-norm:v→v/(||v||1+ε) (5)
L2-norm:
L1-sqrt:
Where ε is a very small number, e.g., 0.01, v is the gradient, | · |. luminance1Representing a norm, | · |. non-conducting phosphor2Representing a two-norm. In this example, L2-norm was used.
And obtaining a gradient map, and voting on the direction of each pixel point in the 4 x 4 unit as a gradient element on the histogram of the gradient direction so as to form a histogram of the directional gradient. The directions of the histogram are equally divided between 0-180 degrees or 0-360 degrees, and in order to reduce aliasing, the gradient voting needs to perform bilinear interpolation in the directions and positions between the centers in two adjacent directions. The weight of the vote is calculated from the gradient magnitude, which can be taken as the magnitude itself, the square of the magnitude, or the square root of the magnitude. Practice has shown that using the gradient itself as the voting weight works best.
Due to the change of local illumination and the change of the contrast of the foreground and the background, the change range of the gradient intensity is very large, and local contrast normalization needs to be performed on the gradient. Specifically, the cell units are grouped into larger spatial blocks, then contrast normalization is performed on each block, the normalization process is the same as the step 3, and the final descriptor is a vector formed by histograms of the cell units in all the blocks in the detection window. In fact, there is overlap between blocks, i.e., the histogram of each cell unit is used multiple times for the final descriptor calculation. This approach appears redundant, but can significantly improve performance.
2-3, cascading the L, U, V three-channel characteristics and the HOG channel characteristics in each direction to obtain the ACF characteristics of the polymerization channel of the current frame under the original scale; if the gradient histogram has six directions, a total of 10 channels are obtained. These 10 channels are the aggregate channel feature.
Step 2, calculating a characteristic pyramid
According to various scales under the characteristic pyramid, performing characteristic pyramid sampling on the characteristic graph under the original scale according to the energy statistical relationship of the characteristics of all channels of the aggregation channel characteristics to obtain a characteristic graph under the corresponding scale; the ratio of the dimension of the feature map after sampling by each feature pyramid to the dimension of the feature map under the original dimension is s1,s2,...snN is the total number of different scales; each characteristic pyramid sampling comprises up sampling and down sampling;
upsampling
Let the original image be I (x, y), and the image obtained after sampling by an up-sampling factor k be Ik(x, y) ═ I (x/k, y/k), defined by the gradientI.e. the proportion that varies over the up-sampled imageIs 1/k times of the original scale image. Is provided withFor up-sampling the gradient size of the image, then
Thus, in the original and the up-sampled image, their sum of gradients is correlated with the up-sampling, and therefore k. In the same way, the direction of gradient
Thus, as can be seen from the above definition, in the upsampled image, its gradient relationship is h'q≈khq。
Down sampling
Downsampling cannot be derived from upsampling because of the high frequency loss and energy loss in the previously sampled image. Let IkThe relationship between the original image I and the down-sampled image I by the down-sampling factor k is h'q≤hqK is the sum of the values of k and k. Since the energy channel image of an image and the down-sampled image of the energy channel are related only to the relative scale of the energy channel and are independent of the original image. If f (I, s)i) Is shown at siEnergy of the channel after scale down sampling, then
E[f(I,s1)/f(I,s2)]=E[f(I,s1)]/E[f(I,s2)]=r(s1-s2) (11)
This is true. Thus E [ f (I, s)1)/f(I,s2)]The following forms must be made:
E[f(I,s+s0)/f(I,s0)]=ae-λs(12)
a and lambda can be obtained by performing data statistics on a large number of target images and natural images. Therefore, the relationship between the down-sampled image and the original image is f (I, s) ═ ae-λsf(I,0)。
Step 3, multi-scale tracking:
replacing the gray scale features in the original MOSSE tracking algorithm with ACF features, vectorizing the features of each channel to obtain a plurality of feature vectors, combining the feature vectors into a matrix, and performing Fourier transform as FiThe tracking is made more robust, and the tracking results are calculated on different scales:
3-1 the width and height of the characteristic channel of the characteristic diagram under the original size are respectively W0And H0Let the center point of the target region of the feature map under the original size be (x)0,y0) (ii) a The central position of the corresponding target area after each characteristic pyramid is sampled is (x)0s1,y0s1),(x0s2,y0s2),...,(x0sn,y0sn);
3-2 calculate convolution filter:
for the case where the previous frame was the first time the target was detected:
under the sampling of each characteristic pyramid, a random Gaussian distribution is generated by taking the central position of the target area as an extreme value of the Gaussian distribution, and the random Gaussian distribution corresponds to the output response G of the target area1,G2,...,GnInitializing the filter coefficients of the convolution filter:
Fiis Fourier transform of the target area, GiRandom Gaussian distribution with a target area as a center position;
for the case where the previous frame was not the first time the target was detected:
the convolution filter is updated by adopting the following strategy:
η are the weight coefficients of the image data,the filter coefficient of the convolution filter of the previous frame;
3-3, inputting the feature graphs of all sizes after the feature pyramid is sampled into a convolution filter, wherein the output responses of the convolution filters are g respectively1,g2,...,gnSeparately obtaining the tracking value PSR of each size1,PSR2,...,PSRn;
gmaxFor the extreme value of the output response at the current size, musAnd σsRespectively mean and variance of output response under the current size;
3-4 taking tracking value PSR1,PSR2,...,PSRnThe maximum PSR is used as a tracking result PSR, when the PSR is larger than a set threshold value, the tracking is considered to be successful, the tracking of the next frame is continued, and the tracking result is mapped to the original scale according to the corresponding scale to determine a target area; and when the PSR is smaller than the set threshold, the tracking is considered to be failed, and the current tracking target is cancelled.