CN111260687B

CN111260687B - Aerial video target tracking method based on semantic perception network and related filtering

Info

Publication number: CN111260687B
Application number: CN202010028112.3A
Authority: CN
Inventors: 李映; 尹霄越; 朱奕昕; 薛希哲
Original assignee: Northwestern Polytechnical University
Current assignee: Northwestern Polytechnical University
Priority date: 2020-01-10
Filing date: 2020-01-10
Publication date: 2022-09-27
Anticipated expiration: 2040-01-10
Also published as: CN111260687A

Abstract

The invention relates to an aerial video target tracking method based on a semantic perception network and related filtering, which aims at the problem that a related filtering algorithm is difficult to solve such as target blurring and shielding. Thanks to the above measures, the present invention can achieve very robust results in a variety of challenging aerial scenes.

Description

Aerial video target tracking method based on semantic perception network and relevant filtering

Technical Field

The invention belongs to the technical field of image processing, and particularly relates to an aerial video target tracking method based on a semantic perception network and related filtering.

Background

In recent years, aerial video tracking technology has been remarkably developed in the military field and the civil field, and is excellent in diversity and flexibility. Compared with the common handheld device for shooting videos, the aerial video has more flexible angles, scales and visual fields. The development of aerial video target tracking technology has led to many novel and important applications, such as crowd monitoring, target tracking, and air navigation. In the traditional general scene target tracking technology, many algorithms position a bounding box on a video according to a given initial state in a first frame, however, specific factors such as weather conditions, flight heights, target sizes and camera view angles can influence the target tracking result; meanwhile, due to shadows, background interference and low light conditions introduced by a high-tilt shooting angle, the originally rich texture information and details of an object can be greatly lost in an aerial video. In recent years, a method based on relevant filtering is greatly developed, and good tracking performance is shown in precision and speed, so that the requirements of aerial video can be met to a certain extent. However, trackers can be misleading when a target captured in an aerial video appears blurred in the shadow, or is occluded by other objects. In this case, when the target is lost for a period of time, the conventional correlation filtering method may generate a model drift phenomenon, which may result in that the target cannot be tracked and positioned again. Therefore, designing a robust target tracking algorithm for an aerial scene is significant and urgent.

Disclosure of Invention

Technical problem to be solved

Aiming at the problem that appearance model drift is caused by fuzzy and sheltered targets due to camera motion in an aerial video, and further tracking failure is easily caused, the robust real-time target tracking method is designed on the basis of high-efficiency related filtering algorithms by fully utilizing the advantage that target semantic information is not easily influenced by appearance change and combining a target detection technology.

Technical scheme

An aerial video target tracking method based on semantic perception network and related filtering is characterized by comprising the following steps:

step 1: reading the first frame image data and the parameter R of the target block in the first frame image _target ＝[x,y,w,h]Wherein x, y represent the targetThe horizontal and vertical coordinates of the upper left corner, w and h represent the width and height of the target;

and 2, step: determining a target region R according to the target center position and length and width of the first frame ₁ ，R ₁ ＝[x _center ,y _center ,1.5w,1.5h]；

And step 3: in the region R ₁ Performing feature extraction, wherein a ResNet50 residual error network structure with a feature pyramid FPN is used by a feature extraction network to obtain 256-dimensional depth features J which are obtained by multiplying 5 different scales S by {1.0,0.8,0.4,0.2 and 0.1} times of the size of the original image;

and 4, step 4: inputting the characteristics obtained in the step (3) into a relevant filtering module and a detection module respectively; in the correlation filtering module, the truncated features J correspond to R _target Part J of (2) _target As target template y ₁ (ii) a The detection module detects the target characteristics J _target Inputting a category judgment branch therein, and outputting the category information of the target by the network;

and 5: reading the k frame image, wherein k is more than or equal to 2 and the initial value is 2, and according to the target parameter [ x ] of the previous frame _k-1 ,y _k-1 ,w _k-1 ,h _k-1 ]Determining a k-th frame target region R _k Using the method in step 3 to R _k Carrying out feature extraction to obtain a target feature J _k A mixture of J and _k respectively inputting the data into a relevant filtering module, a detection module and a semantic segmentation module;

and 6: in the correlation filtering module, order J _k Training sample x equal to the frame _k Combining the target template y of the frame _k Training a correlation filter w; training for w uses an optimization model:

wherein f (-) represents the correlation operation, L (-) represents the square loss function, and λ is the regular parameter, for the convenience of solving, x _k ,y _k By discrete Fourier transform to X _k ,Y _k Converting the above formula into frequency domain calculation, W represents the frequency domain W, and solving to obtain

Wherein h represents the feature dimension of the training sample; after obtaining the correlation filter W, calculating an initial response graph r output by the correlation filter module according to the following formula:

wherein, F ^-1 (. -) represents the inverse Fourier transform, and indicates the dot product, and indicates the complex conjugate operator;

and 7: detection module pair J _k Firstly, carrying out convolution operation with convolution kernel size of 3 multiplied by 3; respectively inputting the output of the convolution operation into a category judgment branch, a target frame regression branch and a mask branch; the class judgment branch performs convolution operation of 3 multiplied by 3 on the input, the output dimension is 80, namely the number of classes of the COCO data set, and each dimension of data represents the confidence score belonging to the class; the target frame regression branch performs convolution operation of 3 multiplied by 3 on the input, the output dimension is 4, and the output dimension comprises the coordinates of the upper left corner and the lower right corner of the target frame; the mask branch performs convolution operation of 3 multiplied by 3 on the input, the output dimension is 32, and tanh activation function is used on the output to generate the coefficient c corresponding to each pixel point _i The semantic segmentation module is used for generating a target mask; the detection module needs to be pre-trained before the tracking algorithm is executed;

and 8: the category confidence and the regression frame are combined, and the category and the target frame can be obtained pixel by pixel; setting anchor points according to {1:2,1:1,2:1}, and obtaining a candidate frame through non-maximum value suppression NMS processing; screening the candidate frames according to the target category obtained from the first frame to further obtain a region R _k The detection frame with the same category as the target is used as the output of the detection module; and simultaneously obtaining the mask coefficient of the corresponding pixel point to be expressed as C ═ tanh ([ C ═ tanh) ₁ ,c ₂ ,...,c _t ])∈R ^t×32 T represents the number of the screened target frames;

step (ii) of9: the semantic segmentation module divides J _k Inputting a full convolution neural network FCN, which is firstly subjected to convolution operation with a 3-layer convolution kernel size of 3 x 3, keeping dimensionality unchanged, then subjected to a layer of 2-time upsampling, subjected to a layer of 3 x 3 convolution, and finally subjected to a 1 x 1 convolution to output a 32-dimensional semantic segmentation prototype, which is expressed as D ═ D [ D ] s ₁ ,d ₂ ,...,d ₃₂ ]∈R ^32×n N is the dimension of the feature map, i.e. the product of the feature map length and width; the semantic segmentation module needs to be pre-trained before the tracking algorithm is executed;

step 10: combining the mask coefficient C output in the step 7, generating a target mask M according to the following formula _t ，p _i,x,y Representing elements in a matrix C by D, t representing a total of t target masks

Step 11: to M _t Selecting according to the following formula to obtain a final target mask M, score represents the confidence coefficient of the category, dist represents the region R from the center of the mask _k The distance of the center, i, represents the index of the mask, and finds the mask with the maximum ratio as the final target mask M:

step 12: according to the target frame output by the detection module, the initial response graph r of the relevant filter is cut, the value in the area of the target frame is reserved, the value outside the area is set to be 0, and a new response graph r is obtained _b (ii) a Then combining the output of the segmentation module according to the following formula to obtain a final semantic fused response graph r _m P represents the weight of mask M;

r _m ＝(1-p)r _b +pM

step 13: find out r _m And taking the position with the maximum upper response value as the target position of the frame, and updating the correlation filter w according to the following formula:

w _k ＝(1-η)w _k-1 +ηw _k

wherein η represents a learning rate;

step 14: judging whether all the images are processed or not, and if so, ending the process; otherwise, the step 5 is returned.

The value of the lambda is 0.003.

The H is 50.

The value of p is 0.2.

The eta is 0.03.

The detection module and the semantic segmentation module are jointly pre-trained as follows:

1) carrying out normalization operation on the images of the COCO2017 data set to enable the data set distribution to be in accordance with standard normal distribution, and then randomly cutting the images and fixing the size to be 500 x 500;

2) the category judgment branch uses smooth-L1 loss function, the target frame regression branch uses standard cross entropy loss function, the semantic segmentation module combines the mask coefficient output by the detection network, and the loss function shown by the following formula is adopted:

wherein G represents a real mask label, and S represents the number of masks in the graph;

the total loss function of the network is the sum of the 3 loss functions;

3) initializing a feature extraction network FPN + ResNet50 by using network model parameters pre-trained on ImageNet; training is optimized by using a random gradient descent SGD algorithm, and the parameters of an optimizer are set as follows: learning rate of 0.001, momentum of 0.9, weight attenuation of 5 × 10 ^-4 ；

4) And inputting data into a network for training, wherein 27 periods are trained, 20000 pictures are trained in each period, and the obtained network model is used for a tracking process.

Advantageous effects

The invention provides an aerial video target tracking method based on a semantic perception network and related filtering, which adopts a related filtering tracking algorithm and more robustly and accurately positions the position of a target by fusing semantic information of a target area. Aiming at the problem of target blurring and shielding which is difficult to solve by a related filtering algorithm, the invention introduces a detection module and a segmentation module, records the category information of a target in a first frame, detects and semantically segments a target candidate region in a subsequent frame to obtain a target candidate frame and a segmentation mask of the same category in the region, then fuses the candidate frame and the mask to process a response image of the related filtering algorithm, and cuts a non-target region with a larger response value in the response image to obtain accurate target positioning. Thanks to the above measures, the present invention can achieve very robust results in a variety of challenging aerial scenes.

Drawings

FIG. 1 flow chart of the present invention

Detailed Description

The invention will now be further described with reference to the following examples and drawings:

1.1 tracking procedure

1) Reading first frame image data and parameter R of target block in first frame image _target ＝[x,y,w,h]Where x, y represent the horizontal and vertical coordinates of the upper left corner of the target, and w, h represent the width and height of the target.

2) Determining a target region R according to the target center position and length and width of the first frame ₁ ，R ₁ ＝[x _center ,y _center ,1.5w,1.5h]。

3) In the region R ₁ The feature extraction is performed, and the feature extraction network uses a ResNet50 residual network structure with a feature pyramid FPN to obtain 256-dimensional depth features J including 5 different dimensions S ═ {1.0,0.8,0.4,0.2,0.1} times of the original size.

4) Inputting the characteristics obtained in the step 3) into a relevant filtering module and a detection module respectively. In the correlation filtering module, we intercept the feature J corresponding to R _target Part J of (5) _target As target template y ₁ (ii) a The detection module detects the target characteristics J _target Inputting the category judgment branch, and outputting the category information of the target by the network.

5) Reading the k frame image, wherein k is more than or equal to 2 and the initial value is 2, and according to the target parameter [ x ] of the previous frame _k-1 ,y _k-1 ,w _k-1 ,h _k-1 ]Determining a k-th frame target region R _k The method in the step 3) is adopted for R _k Performing feature extraction to obtain target feature J _k A mixture of J and _k respectively input into a related filtering module, a detection module and a semantic segmentation module.

6) In the correlation filtering module, order J _k Training sample x equal to the frame _k Combining the target template y of the frame _k The correlation filter w is trained. Training for w uses an optimization model:

f (-) represents the correlation operation, L (-) represents the square loss function, and λ is the regular parameter and takes the value of 0.003. For the convenience of solution, for x _k ,y _k By discrete Fourier transform to X _k ,Y _k Converting the above formula into frequency domain calculation, W represents the frequency domain W, and obtaining the solution

H denotes the feature dimension of the training sample, H50. After obtaining the correlation filter W, calculating an initial response graph r output by the correlation filter module according to the following formula.

F ^-1 (. -) represents the inverse Fourier transform, and-represents the dot product, and-represents the complex conjugate operator.

7) Detection Module (Pre-training Process see 1.2) to J _k Firstly, a convolution operation with a convolution kernel size of 3 x 3 is carried out, the characteristic dimension is not changed, and 256 dimensions are still kept. The outputs of the convolution operations are input into respective class decision branches,target box regression branch and mask branch. The class judgment branch performs convolution operation of 3 multiplied by 3 on the input, the output dimension is 80, namely the number of classes of the COCO data set, and each dimension of data represents the confidence score belonging to the class; the target frame regression branch performs convolution operation of 3 multiplied by 3 on the input, the output dimension is 4, and the output dimension comprises the coordinates of the upper left corner and the lower right corner of the target frame; the mask branch performs convolution operation of 3 multiplied by 3 on the input, the output dimension is 32, and tanh activation function is used on the output to generate the coefficient c corresponding to each pixel point _i And the semantic segmentation module is used for generating the target mask.

8) The category and the target box can be obtained pixel by combining the category confidence and the regression box. And setting anchor points according to the {1:2,1:1,2:1}, and obtaining a more accurate candidate frame through non-maximum suppression (NMS) processing. Screening the candidate frames according to the target category obtained from the first frame to further obtain a region R _k And the detection frame with the same target category is used as the output of the detection module. And simultaneously obtaining the mask coefficient of the corresponding pixel point to be expressed as C ═ tanh ([ C ═ tanh) ₁ ,c ₂ ,...,c _t ])∈R ^t×32 And t represents the number of the screened target frames.

9) Segmentation Module (see 1.2 for Pre-training procedure) J _k Inputting a full convolution neural network (FCN), which is subjected to convolution operation with a 3-layer convolution kernel size of 3 x 3, keeping dimensionality unchanged, then subjected to up-sampling by a layer of 2 times, subjected to convolution by a layer of 3 x 3, and finally outputting a 32-dimensional semantic segmentation prototype expressed as D ═ D through convolution by a 1 x 1 ₁ ,d ₂ ,...,d ₃₂ ]∈R ^32×n And n is the dimension of the feature map, i.e., the product of the feature map length and width.

10) Combining the mask coefficient C output in the step 7), generating a target mask M according to the following formula _t ，p _i,x,y Representing elements in a matrix C by D, t representing a total of t target masks

11) To M _t The selection is made according to the following formula,obtaining a final target mask M, score representing the confidence of the category, dist representing the region R from the center of the mask _k The distance of the center, i, represents the index of the mask, and the mask with the largest ratio is found out as the final target mask M.

12) According to the target frame output by the detection module, the initial response graph r of the relevant filter is cut, the value in the area of the target frame is reserved, the value outside the area is set to be 0, and a new response graph r is obtained _b . Then combining the output of the segmentation module according to the following formula to obtain a final semantic fused response graph r _m And p represents the weight of the mask M, and p is taken as 0.2 in the invention.

r _m ＝(1-p)r _b +pM

13) Find out r _m The position where the upper response value is the maximum is set as the target position of the frame, and the correlation filter w is updated, where η represents the learning rate and is 0.03 according to the following formula.

w _k ＝(1-η)w _k-1 +ηw _k

14) Judging whether all the images are processed or not, and if so, ending the process; otherwise go back to step 5).

1.2 detection and semantic segmentation Module Joint Pre-training

1) The images of the COCO2017 data set are normalized to make the data set distribution conform to the standard normal distribution, and then the images are randomly cut and fixed to be 500 x 500 in size.

2) The network structure of the detection module and the segmentation module is shown as 1.1, the category judgment branch uses smooth-L1 loss function, the target frame regression branch uses standard cross entropy loss function, the semantic segmentation module combines the mask coefficient output by the detection network and adopts the loss function shown by the following formula, the meanings of C, D and n are shown as 1.1, G represents the real mask label, S represents the number of masks in the graph,

the total loss function of the network is the sum of the 3 loss functions described above.

3) For the feature extraction network FPN + ResNet50, initialization was performed using network model parameters pre-trained on ImageNet. Training was optimized using a Stochastic Gradient Descent (SGD) algorithm with the optimizer parameters set to: learning rate of 0.001, momentum of 0.9, weight attenuation of 5 × 10 ^-4 。

Claims

1. An aerial video target tracking method based on semantic perception network and related filtering is characterized by comprising the following steps:

step 1: reading the first frame image data and the parameter R of the target block in the first frame image _target ＝[x,y,w,h]Wherein x, y represent the horizontal and vertical coordinates of the upper left corner of the target, and w, h represent the width and height of the target;

step 2: determining a target region R according to the target center position and length and width of the first frame ₁ ，R ₁ ＝[x _center ,y _center ,1.5w,1.5h]；

and 4, step 4: inputting the characteristics obtained in the step (3) into a relevant filtering module and a detection module respectively; in the correlation filtering module, the truncated features J correspond to R _target Part J of (2) _target As target template y ₁ (ii) a The detection module detects the target feature J _target Inputting the category judgment branch therein, and outputting the category information of the target by the network;

and 5: reading the k frame image, wherein k is more than or equal to 2 and the initial value is 2, and according to the target parameter [ x ] of the previous frame _k-1 ,y _k-1 ,w _k-1 ,h _k-1 ]Determining a k frame target region R _k By the method in step 3 for R _k Carrying out feature extraction to obtain a target feature J _k A mixture of J and _k respectively inputting the data into a relevant filtering module, a detection module and a semantic segmentation module;

step 6: in the correlation filtering module, order J _k Training sample x equal to the frame _k Combining the target template y of the frame _k Training a correlation filter w; training for w uses an optimization model:

wherein, F ^-1 (-) represents the inverse Fourier transform, and, -, represents the dot product, -, represents the complex conjugate operator;

and 7: detection module pair J _k Firstly, performing convolution operation with convolution kernel size of 3 multiplied by 3; respectively inputting the output of the convolution operation into a category judgment branch, a target frame regression branch and a mask branch; the class decision branch performs a convolution operation of 3 x 3 on the input with an output dimension of 80, i.e. of the COCO datasetA number of classes, each dimension data representing a confidence score belonging to the class; the target frame regression branch performs convolution operation of 3 multiplied by 3 on the input, the output dimension is 4, and the output dimension comprises the coordinates of the upper left corner and the lower right corner of the target frame; the mask branch performs convolution operation of 3 multiplied by 3 on the input, the output dimension is 32, and tanh activation function is used on the output to generate a coefficient c corresponding to each pixel point _i The semantic segmentation module is used for generating a target mask; the detection module needs to be pre-trained before the tracking algorithm is executed;

and 8: the category confidence and the regression frame are combined, and the category and the target frame can be obtained pixel by pixel; setting anchor points according to {1:2,1:1,2:1}, and obtaining a candidate frame through non-maximum value suppression NMS processing; screening the candidate frames according to the target category obtained from the first frame to further obtain a region R _k The detection frame with the same type as the target is used as the output of the detection module; and simultaneously obtaining the mask coefficient of the corresponding pixel point to be expressed as C ═ tanh ([ C ═ tanh) ₁ ,c ₂ ,...,c _t ])∈R ^t×32 T represents the number of the screened target frames;

and step 9: the semantic segmentation module divides J _k Inputting a full convolution neural network FCN, which is firstly subjected to convolution operation with a 3-layer convolution kernel size of 3 x 3, keeping dimensionality unchanged, then subjected to a layer of 2-time upsampling, subjected to a layer of 3 x 3 convolution, and finally subjected to a 1 x 1 convolution to output a 32-dimensional semantic segmentation prototype, which is expressed as D ═ D [ D ] s ₁ ,d ₂ ,...,d ₃₂ ]∈R ^32×n N is the dimension of the feature map, i.e. the product of the feature map length and width; the semantic segmentation module needs to be pre-trained before a tracking algorithm is executed;

Step 11: to M _t Selecting according to the following formulaSelecting to obtain a final target mask M, score represents the confidence of the category, dist represents the region R from the center of the mask _k The distance of the center, i, represents the index of the mask, and finds the mask with the maximum ratio as the final target mask M:

r _m ＝(1-p)r _b +pM

step 13: find out r _m And taking the position with the maximum upper response value as a target position, and updating the correlation filter w according to the following formula:

w _k ＝(1-η)w _k-1 +ηw _k

wherein η represents a learning rate;

step 14: judging whether all the images are processed or not, and if so, ending the process; otherwise, returning to the step 5.

2. The aerial video target tracking method based on the semantic aware network and the correlation filtering as claimed in claim 1, wherein λ is 0.003.

3. The semantic aware network and correlation filtering based aerial video target tracking method according to claim 1, wherein H-50.

4. The aerial video target tracking method based on the semantic perception network and the relevant filtering according to claim 1, wherein the value of p is 0.2.

5. The method for tracking the aerial video target based on the semantic aware network and the correlation filtering as claimed in claim 1, wherein η is 0.03.

6. The method for tracking aerial video target based on semantic aware network and correlation filtering as claimed in claim 1, wherein the detection module and semantic segmentation module are jointly pre-trained as follows:

the total loss function of the network is the sum of the 3 loss functions;