CN111915653A

CN111915653A - Method for tracking double-station visual target

Info

Publication number: CN111915653A
Application number: CN202010823124.5A
Authority: CN
Inventors: 王小凌; 孙忠海; 史文浩
Original assignee: Shenyang Aircraft Industry Group Co Ltd
Current assignee: Shenyang Aircraft Industry Group Co Ltd
Priority date: 2020-08-17
Filing date: 2020-08-17
Publication date: 2020-11-10
Anticipated expiration: 2040-08-17
Also published as: CN111915653B

Abstract

A method for tracking a double-station visual target belongs to the technical field of target tracking in computer vision. The method comprises the following steps: acquiring an image area of a tracking target by using target detection; extracting the characteristics of the two images, respectively establishing a discrimination related filtering model, and fusing the models; in the subsequent frame, target tracking is carried out by utilizing the fusion model; and judging a tracking result according to the filter responsivity, carrying out target re-identification and updating the model on line. The target tracking device for realizing the method comprises the following steps: the binocular image acquisition unit, the detection unit, the tracking unit and the servo drive unit.

Description

Method for tracking double-station visual target

Technical Field

The invention relates to the technical field of target tracking in computer vision, in particular to a method and a device for tracking a target by double-station vision.

Background

The visual target tracking is an important research content of computer vision, and has wide application in the fields of video monitoring, security protection, intelligent transportation and the like. Due to the existence of challenging factors such as illumination change, scale change, target deformation, rapid movement, occlusion and the like, designing a real-time, accurate and robust visual tracking system is still a difficult task. Typical visual tracking methods include initialization, feature extraction, motion models, appearance models, and model updates.

The initialization of visual tracking is typically done by manually selecting the target area or automatically locating the target using detection algorithms, in an automated scenario the initial box of tracking is usually given by target detection. The feature extraction not only needs strong distinguishing capability, but also has higher calculation efficiency, and meets the real-time requirement of tracking. The features employed by tracking algorithms to date can be broadly divided into artificial and learned features. The motion model is based on the assumption of continuity of the motion of the object, and may be sampled around the object position of the previous frame as the candidate position of the object, including sliding window, mean shift, particle filter, etc. An appearance model is a selection of an object from a series of candidate locations and can be divided into a generative method and a discriminant method. The model updating is to perform adaptive adjustment on the appearance change of any target, the generating method is updated through a subspace or a template of the target, and the discriminant method is to update the classifier by continuously adding positive and negative samples.

Visual tracking algorithms have gained significant development over the last decade. Article

"Tracking-learning-detection IEEE Transactions on Pattern Analysis and Machine Analysis, 2012,34(7): 1409-. The article "High-speed tracking with kernel compensated filtering filters, IEEE Transactions on Pattern Analysis and Machine Analysis 2015,37(3): 583-.

However, these advanced trackers are prone to failure when the target pose changes dramatically. Meanwhile, a visual tracking system based on a single station has a visual field blind area, and cannot complete the whole-course tracking task of the moving target.

Disclosure of Invention

In view of the above problems, the present invention designs a method for two-station visual tracking and a hardware device thereof to realize real-time visual tracking of a target. By adopting the method, the problem of tracking the blind area of the view field at a single station can be solved, and the tracking capability of the object with the rapidly changed posture is improved, so that the target can be continuously tracked.

The technical scheme of the invention is as follows:

a method of dual station visual target tracking, the method comprising: training a specific class target detector to complete the initialization positioning of the target; respectively extracting features of the binocular images, establishing a relevant filtering model, and fusing; in the subsequent frame, target tracking is carried out by utilizing the fusion model; and judging that the target is lost or shielded by utilizing the related filtering response, and rediscovery by utilizing target detection.

The method comprises the following concrete steps:

step 1: training a target detector offline for a particular target class; acquiring images of the two stations during online operation, and acquiring the areas of the target in the two images by using a target detector;

step 2: respectively processing two images, firstly cutting an image sample by taking a target position as a center, extracting characteristics, training a related filtering model, and fusing the two models;

and step 3: in the subsequent frames, respectively cutting out a search image area by taking the above frame as a center in the two images, extracting the characteristics, and obtaining the position and the scale of the target by using a fusion model;

and 4, step 4: and setting a response threshold, judging that the target is lost or blocked when the response threshold is lower than the threshold, starting the online detection of the target to re-present the target, and repeating the steps 2 and 3 until the last frame is finished when the target is re-detected.

Further, the off-line training of the target detector in step 1 and the on-line detection process of the target detector in step 4 specifically include the following steps:

(1) loading a pre-training target detector model YOLOv 3;

(2) marking the historical image, and inputting the historical image into a detection network for fine tuning training;

(3) and inputting the image to be detected into a detection network, and outputting a target area.

Furthermore, the label comprises the category of the target and a rectangular target frame, and the ith frame image is set as I_iThe category of the target is C, the upper left corner of the target box (x)₁,y₁) And the lower right corner (x)₂,y₂) And obtaining the labeled content as follows: c, x₁,y₁,x₂,y₂(ii) a Dividing the marked samples according to a training set and a verification set, and then respectively training by utilizing a YOLOv3 detector network; binocular image notation

And

i belongs to {0,1,2,. and N }, and the right superscript represents a left image and a right image respectively; for the initial frame i equal to 0, the trained detector detects that the targets are respectively at

And

the target frame in (1) is B^LAnd B^R。

Further, in step 2 and step 3, the specific steps of cutting the image are as follows:

(1) the target box is (cx, cy, w, h), where (cx, cy) is the target center and (w, h) is the width and height of the target;

(2) with (cx, cy) as the center, the side length is cut out

And resize to 240 x 240.

Further, in step 2 and step 3, the specific steps of feature extraction are as follows:

(1) extracting gradient vector histograms and color space features of the cut images, normalizing the scale of the feature map, and performing feature splicing to obtain multi-channel features p;

(2) and constructing a two-dimensional cosine window matrix omega, and multiplying the two-dimensional cosine window matrix omega by the multichannel feature p to obtain a final multichannel feature map x which is omega.

Further, in the step 2, the specific steps of creating the correlation filtering model are as follows:

(1) constructing a two-dimensional Gaussian matrix Y as a regression target, taking the scale as the resolution of the characteristic diagram and the Gaussian center as the center of the characteristic diagram, and performing fast Fourier transform on the two-dimensional Gaussian matrix to obtain Y;

(2) performing two-dimensional fast Fourier transform on the multi-channel characteristic diagram X to obtain X;

(3) obtaining a related filter coefficient F by using the results of the step (1) and the step (2), namely obtaining a related filter model;

(4) respectively carrying out the steps (1) to (3) on the two-station image to obtain a relevant filtering model F^LAnd F^RThe relevant filter model obtained by fusion is F ═ (1-mu) F^L+μF^RWhere μ ∈ (0,1) is a weight coefficient.

Further, the multi-channel signature graph is represented as x ∈ R^M×N×DWherein M and N are the space scale of the characteristic diagram, and D is the number of characteristic channels; learning an optimal correlation filter f, setting a target confidence coefficient to obey a spatial Gaussian distribution

I.e., the two-dimensional gaussian matrix y, minimizes the regression loss for all training samples, defining the objective function as:

wherein, f is the discrimination correlation filter parameter, x represents the convolution operation, lambda is the regular term coefficient, d represents the number of characteristic channels; by converting to frequency domain calculation, the ridge regression problem is solved, and the closed solution of the obtained correlation filter is as follows:

wherein capital letters indicate corresponding DFT transforms, lines indicate corresponding element multiplications,

a conjugate transpose denoted X;

according to the above steps, two correlation filters are obtained as F^LAnd F^RThen, the final fused correlation filter is:

F＝(1-μ)F^L+μF^R (3)

where μ is a weight coefficient.

Further, in step 3, the position and the scale of the target are obtained by using the fusion model, and the specific steps are as follows:

based on the target position and scale of the previous frame, cutting an image block as a target search area in the step 3, and preprocessing to obtain a multi-channel feature map x of the search area, wherein the response z of the target at each position of the search area is as follows:

wherein the content of the first and second substances,

being an inverse discrete Fourier transform, an indication of a multiplication of corresponding elements;

the response graphs of the binocular images are respectively z^LAnd z^RThe position of the peak value of the response map is the position of the target; and determining the scale of the target through multi-scale search, thereby respectively obtaining target frames of the target in the two images.

Further, the specific process in step 4 is as follows: setting a threshold value tau for the filter response of the binocular image, judging that target tracking fails when max (z) < tau, thus re-utilizing the detector for target re-identification, and initializing a relevant filter when a target is detected; and for the effective filter, updating the filter, adopting a linear updating mode, and if the learning rates are eta respectively, updating the filter as follows:

F＝(1-η)F_t-1+ηF_t (5)

where the subscript t-1 represents the filter of the previous frame and the subscript t represents the current frame.

The invention also provides a device for tracking the double-station visual target, which comprises a binocular image acquisition unit, a detection unit, a tracking unit and a servo drive unit; wherein the content of the first and second substances,

the binocular image acquisition unit is mainly used for clearly imaging a target;

the detection unit is mainly used for marking the area of the target in the image;

the tracking unit is mainly used for marking the area of the target in the subsequent frame;

the servo driving unit drives the direction of a visual axis of the lens to move along with the space position of the target mainly according to the angle of the target deviating from the center of the visual field.

Compared with the prior art, the invention has the advantages that: the double-station tracking method and the double-station tracking device provided by the invention can solve the problem of a view field blind area of single-station tracking, improve the tracking accuracy and stability, and realize automatic discovery and whole-process real-time tracking of a target after entering a view field by combining target detection.

Drawings

FIG. 1 is a flow chart of a method of detection-assisted two-station visual target tracking in the present invention;

FIG. 2 is a schematic diagram of a dual station tracking apparatus of the present invention;

Detailed Description

The invention is further illustrated by the accompanying drawings and the detailed description below.

The basic idea of the invention is as follows: automatically marking the area of the target in the image by using a target detector of YOLOv3, and entering a tracking stage; respectively modeling the two-station images by using discrimination related filtering of multi-channel features, and fusing the two-station images into an enhanced model; when the target is blocked or lost, the target can be automatically distinguished, and the target can be found again by using the detector.

As shown in fig. 1, the method steps of the present invention are implemented as follows:

1. firstly, labeling a target image, including the category and the rectangular surrounding frame of the target. Suppose that there is an ith frame image I_iThe category of the target is C, the upper left corner of the target box (x)₁,y₁) And the lower right corner (x)₂,y₂) And obtaining the labeled content as follows: c, x₁,y₁,x₂,y₂. Dividing the marked samples according to a training set and a verification set, wherein the proportion of the number of the samples is 3: 1, and then trained using the YOLOv3 network, respectively.

2. Binocular image notation

And

i ∈ {0,1,2, ·, N }, the right superscript representing the left and right images, respectively. For the initial frame i equal to 0, the detector trained in step 1 detects that the targets are respectively in

And

the surrounding frame is B^LAnd B^R。

3. For initial frames of binocular images respectively

And

the method for training the correlation filter comprises the following specific steps:

for image I, the initial target frame is B ═ cx, cy, w, h, where (cx, cy) is the target center position, and w and h are the width and height of the target, respectively. Firstly, cutting an image block, comprising the following steps:

selecting (cx, cy) as target center and side length as

And resampled to a size of 240 x 240.

Then, extracting a gradient direction histogram HOG and a color name CN from the image block as mixed features, normalizing the scale of the feature map, and performing feature splicing to obtain a multi-channel feature map x ∈ R^M×N×DWherein M and N are the space scale of the characteristic diagram, and D is the number of characteristic channels.

Finally, learning the optimal correlation filter f, and setting the confidence coefficient of the target to obey the spatial Gaussian distribution

The regression loss for all training samples was minimized, defining the objective function as:

wherein f is a discrimination correlation filter parameter, x represents convolution operation, lambda is a regular term coefficient, and d represents the number of characteristic channels. By converting to frequency domain calculation, the ridge regression problem is solved, and the closed solution of the obtained correlation filter is as follows:

wherein capital letters are denoted as the corresponding DFT transforms.

F＝(1-μ)F^L+μF^R (8)

where μ is a weight coefficient.

4. In subsequent frames, target tracking is performed using the fusion model

Respectively tracking the target of the binocular images by using the fusion model, and specifically comprising the following steps:

based on the target position and scale of the previous frame, cutting an image block as a target search area in step 3, and preprocessing to obtain a sample x, wherein the response z of the target at each position of the search area is as follows:

wherein the content of the first and second substances,

is an inverse discrete fourier transform.

The response graphs of the binocular images are respectively z^LAnd z^RThe peak of the response map is located at the target. Through multi-scale search, the scale of the target can be determined, so that target frames of the target in the two images are obtained respectively.

5. Judging whether the target tracking is successful or not through a response threshold value, carrying out target re-identification and updating a filter

For the filter response of the binocular image, a threshold τ is set, and when max (z) < τ, it is determined that target tracking has failed, and therefore target re-recognition is performed again using the detector, and when a target is detected, the relevant filter is initialized. For the effective filter, the filter can be updated in a linear updating mode, the learning rates are eta respectively, and then the filter is updated as follows:

F＝(1-η)F_t-1+ηF_t (10)

As shown in fig. 2, in order to implement the above tracking method, the present invention provides a dual-station visual target tracking apparatus:

the detection unit is mainly used for marking the area of the target in the image in an initialization stage and a target re-identification stage;

the servo driving unit drives the lens to move towards the space position along with the target according to the angle of the target deviating from the center of the double-station view field respectively.

Example (b):

in this embodiment, the flying target shot by the two eyes is tracked, and the scene includes the challenging factors such as illumination change, rapid movement, attitude change and the like. The computer processing platform is provided with a CPU of Intel E5-1650v2, a memory of 64GB and a GPU of NVIDIA RTX2080 Ti.

The algorithm is verified on an outdoor binocular image, a tracking target is an airplane, the video is 5000 frames in total, and in order to visually show the tracking result of the method, the tracking result is as follows:

and detecting the tracking target normally by two stations in frames 1-479, wherein in frame 480, the tracking detection of the left station is normal, the target is lost by the right station, after the position information of the target of the left station is read by the right station, the target is detected again by the right station in frame 540, and the tracking detection task is continuously executed. At frame 780, the right station is clearer than the left station image information, at frame 4822, the left station object detection disappears, the left station reads the right station object information, at frame 5000, the left station detects the object again, and continues to execute the tracking detection task.

The result shows that the algorithm can well deal with the challenging factors such as illumination change, shielding, posture change and the like, and output an accurate target area.

Claims

1. The method for tracking the double-station visual target is characterized by comprising the following steps:

2. The method for two-station visual target tracking according to claim 1, wherein the off-line training of the target detector in step 1 and the on-line detection of the target detector in step 4 comprise the following steps:

(1) loading a pre-training target detector model YOLOv 3;

3. The method of claim 2, wherein the annotation comprises a category of the target and a rectangular target frame, and the image of the ith frame is designated as I_iThe category of the target is C, the upper left corner of the target box (x)₁,y₁) And the lower right corner (x)₂,y₂) And obtaining the labeled content as follows: c, x₁,y₁,x₂,y₂(ii) a Dividing the marked samples according to a training set and a verification set, and then respectively training by utilizing a YOLOv3 detector network; binocular image notation

And

the right superscript represents the left image and the right image respectively; for the initial frame i equal to 0, the trained detector detects that the targets are respectively at

And

the target frame in (1) is B^LAnd B^R。

4. The method for double-station visual target tracking according to claim 1, wherein in the step 2 and the step 3, the specific step of cropping the image is as follows:

(2) with (cx, cy) as the center, the side length is cut out

And resize to 240 x 240.

5. The method for tracking a two-station visual target according to claim 1, wherein in the steps 2 and 3, the specific steps of feature extraction are as follows:

6. The method for two-station visual target tracking according to claim 5, wherein in the step 2, the specific steps of the creation of the correlation filtering model are:

7. The method of two-station visual target tracking according to claim 6, wherein the representation of the multi-channel feature map is x ∈ R ∈^M×N×DWherein M and N are the space scale of the characteristic diagram, and D is the number of characteristic channels; learning an optimal correlation filter f, setting a target confidence coefficient to obey a spatial Gaussian distribution

wherein f is the parameter of the discrimination related filter, ^ represents the convolution operation, λ is the regular coefficient, d represents the number of the characteristic channel; by converting to frequency domain calculation, the ridge regression problem is solved, and the closed solution of the obtained correlation filter is as follows:

a conjugate transpose denoted X;

F＝(1-μ)F^L+μF^R (3)

where μ is a weight coefficient.

8. The method for tracking a two-station visual target according to claim 7, wherein in step 3, the position and the scale of the target are obtained by using a fusion model, and the specific steps are as follows:

wherein the content of the first and second substances,

9. The method for tracking a visual target of a double station as claimed in claim 7, wherein the specific process in step 4 is as follows: setting a threshold value tau for the filter response of the binocular image, judging that target tracking fails when max (z) < tau, thus re-utilizing the detector for target re-identification, and initializing a relevant filter when a target is detected; and for the effective filter, updating the filter, adopting a linear updating mode, and if the learning rates are eta respectively, updating the filter as follows:

F＝(1-η)F_t-1+ηF_t (5)