CN111508002B

CN111508002B - Small-sized low-flying target visual detection tracking system and method thereof

Info

Publication number: CN111508002B
Application number: CN202010309617.7A
Authority: CN
Inventors: 陶然; 李伟; 黄展超; 马鹏阁; 揭斐然
Original assignee: Beijing Institute of Technology BIT; Luoyang Institute of Electro Optical Equipment AVIC; Zhengzhou University of Aeronautics
Current assignee: Beijing Institute of Technology BIT; Luoyang Institute of Electro Optical Equipment AVIC; Zhengzhou University of Aeronautics
Priority date: 2020-04-20
Filing date: 2020-04-20
Publication date: 2020-12-25
Anticipated expiration: 2040-04-20
Also published as: CN111508002A

Abstract

The invention discloses a small low-flying target visual detection tracking system and a method thereof, wherein the system comprises: the device comprises a video data input unit, a video preprocessing unit, a training data construction unit, a detection model training unit, a target comparison screening unit, a detection correction unit, a reference frame initialization unit, a sample library dynamic construction unit, an online learning unit, a position fine modification unit, a decision control unit and a tracking result output unit. The method comprises the following steps: constructing a target detection network, and comparing and screening targets; target tracking online learning; dynamically constructing a classifier training sample library, and finely trimming a target tracking position; the invention has the advantages that: the tracking drift condition of the tracked target caused by factors such as shielding, scale change and illumination can be effectively relieved, and robust target tracking can be realized. The method has the capability of updating the reference frame characteristics in time according to the target change, and meanwhile, the error tracking caused by updating the reference frame characteristics can be avoided by introducing the characteristic point matching algorithm.

Description

Small-sized low-flying target visual detection tracking system and method thereof

Technical Field

The invention relates to the technical field of flight target tracking, in particular to a small low-flight target detection and tracking system based on a neural network and online learning and a detection and tracking method thereof.

Background

At present, some related methods for joint detection and tracking and some tracking methods suitable for low-speed small targets exist. The existing method realizes short-time target tracking through a related filtering method, and realizes the function of target relocation by adopting a target detection method based on a neural network when the tracking fails. The related patents and research techniques are as follows:

the Chinese invention patent, the name is: a robustness long-term tracking method based on correlation filtering and target detection is disclosed in the application number: CN 201910306616.4; according to the method, the target tracking is realized through a related filtering method, the target detection is performed by using a one-stage target detector YOLO, after a detection result is obtained, a candidate frame with the highest matching point number is selected as a target surrounding frame of the reinitialization tracker by using a SURF feature point matching method, and finally, the long-term tracking effect is achieved. However, this method does not take into account the problem of extreme imbalance between the target and the background in the detector, and cannot be applied to a small target long-term tracking scene.

The Chinese invention patent, the name is: a low-altitude slow unmanned aerial vehicle tracking method combining correlation filtering and visual saliency is disclosed in the application number: 201910117155.6, respectively; the central position of a predicted target is determined by using a prediction response graph obtained by using a correlation filtering method in a small search area, and the scale of the predicted target is determined by using a significance detection method in a large search area, so that the tracking method suitable for the low-altitude slow unmanned aerial vehicle is realized. However, the method does not perform further processing after the target tracking fails, and the precision is to be further improved.

By integrating the prior art, the problems of extreme unbalance of the target background in single target detection and tracking are not solved, the network performance is not optimal, and the tracking precision of small targets is further improved.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a small low-flying target visual detection tracking system and a method thereof, and solves the defects in the prior art.

In order to realize the purpose, the technical scheme adopted by the invention is as follows:

a small low-flying target detection tracking system, comprising: the device comprises a video data input unit, a video preprocessing unit, a training data construction unit, a detection model training unit, a target comparison screening unit, a detection correction unit, a reference frame initialization unit, a sample library dynamic construction unit, an online learning unit, a position fine modification unit, a decision control unit and a tracking result output unit.

The video data input unit is configured to: a plurality of video sequences containing targets are input and randomly divided into two parts, wherein one part is used for training a target detection model, and the other part is used for online testing of a target tracking model.

The video pre-processing unit is configured to: the method comprises the steps of finishing the early-stage video preprocessing work according to the needs of a target detection and tracking unit, specifically deleting video segments without targets for a long time, and eliminating video segments which obviously do not accord with the characteristics of small targets flying at a slow speed per hour in a low-altitude airspace.

The training data construction unit is configured to: the completeness and richness of training data are guaranteed, a training set and a verification set are constructed in a mode of extracting video frames at equal intervals, and data labeling is carried out, namely the information of the center position, the width and the height of a target in an image is determined and is used for a supervised training target detection model.

The detection model training unit is used for: a pyramid structured target detection network is created and focus loss is used to alleviate the problem of target background imbalance. And stopping the training process after the training loss function is observed to tend to be stable, and storing the model file with the optimal performance in the verification process for providing the reset box information when the target tracking fails.

The target alignment screening unit is used for: and comparing the first frame true value frame of the tracked video with the detection result by using a SURF (speeded up robust features) feature matching algorithm, and eliminating false alarms which are obviously not low-slow small targets, thereby further ensuring the robustness of long-term stable tracking.

The detection correction unit is used for: starting a detection and correction unit when the following two conditions occur, namely starting the detection and correction unit when the position confidence coefficient of a tracking frame is lower than a set threshold value, which indicates that the target tracking of the current frame fails; and secondly, automatically starting a detection and correction unit when the specified frame number interval is reached, and ensuring that the difference between the current tracking target and the target characteristic of the reference frame is not too large.

The reference frame initialization unit is configured to: according to the received target position and scale information of the reference frame, cutting out a search area according to the size 5 times of the target size, zooming the image to the specified size 288 x 288, and inputting the image blocks of the search area with the specified size, the target position scale information and the like into a dynamic construction sample library of the classifier.

The sample library dynamic construction unit is used for: and performing data enhancement on the reference frame sample, including basic operations of rotation, scaling, dithering and blurring, and receiving a sample newly added in the tracking process.

The online learning unit extracts features of samples stored in a sample base by using a deep network ResNet18 network, obtains a predicted Gaussian response graph after passing through two fully-connected layers, generates a true Gaussian label by taking a target center as a peak point of Gaussian distribution according to label information in the sample base, adjusts parameters of the two fully-connected layers on line for an optimization target by reducing a difference between the predicted Gaussian response graph and the true Gaussian label, and achieves the purposes of realizing the predicted Gaussian response graph and obtaining a predicted target position center through the feature extraction network and the fully-connected layers under the condition of no label.

The position refinement unit is configured to: the method comprises the steps of carrying out position refinement on an initial tracking result obtained by an online learning unit, obtaining a plurality of shaking frames by taking the center of a predicted target position obtained by the online learning unit of a current frame and the width and height of a target scale obtained by a previous frame as references, mapping the shaking frames to a search area of the current frame, carrying out feature extraction on the shaking frames by using an accurate region-of-interest pooling layer, splicing modulation features obtained by a reference frame, obtaining the confidence coefficient of the predicted position of each shaking frame through a full-connection layer, and combining the results of 3 shaking frames with the highest confidence coefficients to obtain the tracking result of the current frame after refinement.

The decision control unit is configured to: and in the tracking process, the target tracking state is judged according to the relation between the confidence coefficient of the predicted position of the tracking frame and a set threshold value, if the target is stably tracked, the tracking of the next frame is continued, and if the target is lost, the detection and correction unit is activated to detect the target of the current frame and update the reference frame so as to realize the long-term stable tracking of the low-slow small target.

The tracking result output unit is used for: and after traversing all the frames of the video, outputting the position and scale information of each frame.

The invention also discloses a small low-flying target visual detection tracking method, which comprises the following steps:

step 1, constructing a target detection network;

1) constructing a network structure comprising: a backbone network, a classification subnet and a regression subnet;

2) and a loss function, which solves the problem of serious imbalance of the proportion of positive and negative samples in target detection by using focus loss and reduces the weight of the simple negative samples in training.

Step 2, comparing and screening targets;

before the video is sent to the detection and correction unit, SURF feature point matching is carried out for one time, the first frame target of the current video is subjected to feature matching with the detection result, when the number of matching points is larger than a set value, the detection result is really the target needing to be tracked at present, the detection is successful, and at the moment, the detection frame is sent to the detection and correction unit for carrying out the subsequent process.

Step 3, target tracking online learning;

predicting the target center position, comprising two parts of an initialization classifier and an online classification process:

1) initialization classifier

Extracting features of a data-enhanced reference frame in a sample library by using a feature extraction network, generating a two-dimensional Gaussian true value label ygt with the same size as a feature map by taking the target central position of the reference frame as a peak value, initializing a classifier according to the features and the label, reducing the distance between an actual value and a true value as much as possible by using a least square optimization algorithm, and solving a nonlinear least square problem by using a Gaussian Newton iteration method; the formula is as follows:

in the formula (1), x ∈ {1, …, M } represents the horizontal direction coordinate of the target center point of the reference frame, and M is the width of the feature map; y belongs to {1, …, N } represents the vertical direction coordinate of the target center point of the reference frame, and N is the height of the feature map; σ is the gaussian bandwidth.

2) Online classification process

Tracking the result (x) from the previous frame (t-1 th frame)_t-1，y_t-1，w_t-1，h_t-1) (wherein (x)_t-1，y_t-1) To estimate the center coordinates of the target, (w)_t-1，h_t-1) For estimating the width and height of the target), the center position of the target of the previous frame is the center of the target search area of the current frame (the t frame), the target search area is expanded to be wide and high according to a specified proportion k, and the search area (x) of the current frame is generated_t-1，y_t-1，k*w_t-1，k*h_t-1). Then, the feature f of the search area is extracted using the feature extraction network_tAfter two full connection layers, a prediction Gaussian response graph consistent with the size of the search area is generated

(

A mapping function representing a fully connected layer; weight₁,weight₂A weight coefficient matrix representing a full connection layer), and the maximum response position is the target center coordinate (x) estimated by the current frame_t，y_t). The online training classifier fully considers the tracked target and the background area, and the classifier is continuously updated to estimate the position of the target.

Step 4, dynamically constructing a classifier training sample library

1) Setting a reference frame updating interval as T, calling a target detection unit to update a reference frame when the current frame number T can be completely divided by T, clearing all outdated samples in a sample library, using the updated reference frame to reinitialize a classifier, and simultaneously sequentially adding newly generated samples into the sample library along with the tracking process, so that the similarity between the characteristics of a target in the sample library and the currently tracked sample is higher, and the central position of the target can be accurately estimated.

2) When the confidence coefficient of the predicted position of the tracking frame is smaller than a set threshold, the current frame target tracking fails, the decision control unit sends information of the reference frame needing to be reinitialized to the detection and correction unit, and after the detection and correction unit receives the information of the detection frame of the target detection unit, the information is sent to the reference frame initialization unit, data enhancement and other operations are carried out, and finally the information is sent to a dynamically constructed sample library.

Step 5, fine trimming of the target tracking position;

the method comprises two parts of a feature extraction network and a similarity evaluation network:

1) feature extraction network

The feature extraction uses a ResNet18 network, balance and reserve the previous template information and update the current reference frame information, provide the features combining the current and historical states of the target for the neural network, improve the tracking stability, extract the search area features of the reference frame, the current frame, the image frame at the intermediate time of the reference frame and the current frame, and respectively send the search area features to the precise region-of-interest pooling layer for the similarity evaluation network to calculate the confidence coefficient of the predicted position.

2) Similarity evaluation network

The core of the similarity evaluation network is a precise region-of-interest pooling layer, the input of which comprises two parts, the first part is bilinear interpolation with interpolation coefficient of IC for an image feature map extracted by the network

IC(x，y，i，j)＝max(0，1-|x-i|)×max(0，1-|y-j|) (2)

Mapping the discrete feature map to a continuous space and obtaining a feature map f (x, y)

In the formulae (2) and (b)3) Where (x, y) is the feature map center coordinate, (i, j) is the coordinate index on the feature map, w_i，jThe weight corresponding to the position (i, j) on the feature map. The second part of the input is the coordinates (x) of the upper left corner of the rectangular box₂，x₁) And the coordinates of the lower right corner (y)₂，y₁). And performing accurate region-of-interest pooling operation according to the obtained continuous spatial feature map and the coordinates of the rectangular frame, and reserving the target features on the image to the maximum extent to prepare for further comparing the similarity of the reference target and the target of the historical frame. Finally, the feature map f (x, y) is doubly integrated

And dividing by the area of the rectangular frame to obtain a precise region of interest Pooling (PrROI Pooling)

And after the characteristics of the accurate region-of-interest pooling layer are obtained, splicing the three characteristics of the reference frame, the intermediate frame and the current frame, inputting the three characteristics into the full-connection layer and outputting the final position confidence coefficient. And comparing the similarity degree of the candidate target and the historical target, and finding the maximum similar target as a tracking result.

Compared with the prior art, the invention has the advantages that:

1) the invention combines the detection and tracking method, can effectively relieve the tracking drift condition of the tracked target caused by factors such as shielding, scale change, illumination and the like, and can realize robust target tracking.

2) The method has the capability of updating the reference frame characteristics in time according to the target change, and meanwhile, the error tracking caused by updating the reference frame characteristics can be avoided by introducing the characteristic point matching algorithm.

3) The method is suitable for long-term stable tracking of low-speed small targets in optical air remote sensing images.

Drawings

FIG. 1 is a block diagram of a small low-flying target detection and tracking system according to an embodiment of the present invention;

FIG. 2 is a diagram of a target detection network architecture according to an embodiment of the present invention;

FIG. 3 is a flow chart of online learning according to an embodiment of the present invention;

fig. 4 is a flowchart of position refinement according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be further described in detail below with reference to the accompanying drawings by way of examples.

As shown in fig. 1, a small low-flying target visual detection tracking system includes the following units:

(1) a video data input unit. A plurality of video sequences containing targets are input and randomly divided into two parts, wherein one part is used for training a target detection model, and the other part is used for online testing of a target tracking model.

(2) And a video preprocessing unit. The method comprises the steps of finishing the video preprocessing work in the early stage according to the requirements of a target detection and tracking unit, specifically deleting video segments without targets for a long time, and removing video segments which obviously do not accord with the characteristics of low and slow small targets.

(3) And a training data construction unit. In order to ensure the completeness and richness of training data, a training set and a verification set are constructed in a mode of extracting video frames at equal intervals, and data labeling is carried out, namely information such as the central position, the width and the like of a target in an image is determined and the information is used for a supervised training target detection model.

(4) And detecting a model training unit. Due to the problems of large scale change of low-slow small-flight targets and extreme unbalance of categories in the single-target training process, a target detection network with a pyramid structure is designed, and the problem of unbalanced target backgrounds is relieved by using focus loss. And stopping the training process after the training loss function is observed to tend to be stable, and storing the model file with the optimal performance in the verification process for providing the reset box information when the target tracking fails.

(5) And a target comparison screening unit. In view of the possibility of false alarms in the target detection result, the SURF feature matching algorithm is used for comparing the first frame true value frame of the tracked video with the detection result, so that false alarms which are obviously not low-slow small targets are eliminated, and the robustness of long-term stable tracking is further ensured.

(6) A correction unit is detected. Starting a detection and correction unit under the following two conditions, namely starting the detection and correction unit when the position confidence of the tracking frame is lower than a set threshold, which indicates that the target tracking of the current frame fails; and secondly, automatically starting a detection and correction unit when the specified frame number interval is reached, and ensuring that the difference between the current tracking target and the target characteristic of the reference frame is not too large.

(7) And a reference frame initialization unit. According to the received target position and scale information of the reference frame, cutting out a search area according to the size 5 times of the target size, zooming the image to the specified size 288 x 288, and inputting the image blocks of the search area with the specified size, the target position scale information and the like into a dynamic construction sample library of the classifier.

(8) And a sample library dynamic construction unit. And performing data enhancement on the reference frame sample, including basic operations such as rotation, scaling, dithering, blurring and the like, and receiving a newly added sample in the tracking process.

(9) And an online learning unit. And performing feature extraction on samples stored in the sample library by using a deep network ResNet18, obtaining a predicted Gaussian response map after passing through two fully-connected layers, generating a true Gaussian label by taking the target center as a peak point of Gaussian distribution according to label information in the sample library, and adjusting parameters of the two fully-connected layers on line by taking a difference between the predicted Gaussian response map and the true Gaussian label as an optimization target to achieve the purposes of realizing the predicted Gaussian response map and obtaining the predicted target position center through the feature extraction network and the fully-connected layers under the condition of no label.

(10) And a position finishing unit. The method comprises the steps of carrying out position refinement on an initial tracking result obtained by an online learning unit, obtaining a plurality of shaking frames by taking the center of a predicted target position obtained by the online learning unit of a current frame and the width and height of a target scale obtained by a previous frame as references, mapping the shaking frames to a search area of the current frame, carrying out feature extraction on the shaking frames by using an accurate region-of-interest pooling layer, splicing modulation features obtained by a reference frame, obtaining the confidence coefficient of the predicted position of each shaking frame through a full-connection layer, and combining the results of 3 shaking frames with the highest confidence coefficients to obtain the tracking result of the current frame after refinement.

(11) And a decision control unit. And in the tracking process, the target tracking state is judged according to the relation between the confidence coefficient of the predicted position of the tracking frame and a set threshold value, if the target is stably tracked, the tracking of the next frame is continued, and if the target is lost, the detection and correction unit is activated to detect the target of the current frame and update the reference frame so as to realize the long-term stable tracking of the low-slow small target.

(12) And a tracking result output unit. And after traversing all the frames of the video, outputting the position and scale information of each frame.

A small-sized low-flying target visual detection tracking method comprises the following steps:

step 1, constructing a target detection network;

the detection network structure for the low-slow small target mainly comprises two parts, namely a neural network structure with a multi-stage pyramid structure and a loss function for relieving extreme unbalance of a target background in a training process:

1) network architecture

Backbone network:

the backbone network for target detection uses feature pyramid levels P3 through P7. The main network is composed of two parts, P3 and P4, one part is calculated from the output of the corresponding feature extraction network ResNet (C3 and C4) through horizontal connection, the other part is to use the top-down approach to up-sample the deep small-size feature map to the same size as the shallow layer, and then perform the superposition operation and convolution operation to make the output channel number still 256, although the channel number of the feature map is always kept unchanged, but the information is incremental. P5 is computed from the output of the corresponding feature extraction network ResNet (C5) by cross-connection. P6 and P7 are obtained by using a convolution layer and an activation layer on the basis of C5, wherein P is_lIs lower than the input image by 2^l(l denotes the pyramid level), all pyramid levels have C-256 channels.

And (3) classifying the subnets:

the target classification subnet predicts the probability of target presence at each spatial location of 9 anchor points and K target classes per a. This subnet is a small full convolution network connected to each of the backbone network pyramid levels, all pyramid levels sharing the parameters of the subnet. From a given pyramid level C-channel input feature map, the subnet applies 4 3 × 3 convolutional layers, each layer having C ═ 256 filters and Relu activation functions, then passes through the 3 × 3 convolutional layers with KA filters, finally appends sigmoid activation functions at each spatial position, and outputs KA binary prediction.

Regression subnet:

the bounding box regression sub-network runs parallel to the target classification sub-network, and another small full convolution network is added after each pyramid level to regress the offsets of each anchor box to the true targets that may exist nearby. The design subnet of the bounding box regression is the same as the classification subnet except that it has 4A linear outputs at each spatial position, and for a anchor boxes centered at each spatial position, 4 means the relative offset between the coordinate positions of the upper left and lower right corners of the 4 output prediction anchor boxes and the corresponding positions of the truth boxes. The target classification subnet and the box regression subnet, while sharing a common structure, use separate parameters. Fig. 2 shows the main structure of the target detection network.

2) Loss function

Using Focal Loss (Focal local, FL)

FL(p_t)＝-α_t(1-p_t)γlog(p_t) (5)

The problem that the proportion of positive and negative samples is seriously unbalanced in target detection is solved, and the weight of the simple negative sample in training is reduced. In equation (5), FL (·) represents a focal loss; p is a radical of_tA discrimination function value representing a target classification probability; alpha represents a balance factor for balancing the proportion unevenness of the positive and negative samples; gamma is used to balance the difficult and easy sample fractionsThe class balance factor, gamma, taking a value greater than 0 will reduce the loss of easily classified samples, making the model more concerned with the learning of difficult and misclassified samples. The formula (6) is an expression of a probability discrimination function, y is a prediction output label passing through an activation function, and the value is between 0 and 1; p represents the probability value that the target belongs to the label labeling category.

Step 2, comparing and screening targets;

in order to prevent the phenomenon that tracking fails due to the fact that a target does not exist in an image or the target is detected as a false alarm by updating a reference frame through background information, SURF feature point matching is performed once before the target is sent to a detection and correction unit, feature matching is performed on a first frame target of a current video and a detection result, when the number of matching points is larger than a set value, the detection result is really the target needing to be tracked at present, the detection is successful, and at the moment, a detection frame is sent to the detection and correction unit to perform a subsequent process.

Step 3, target tracking online learning;

as shown in fig. 3, the on-line learning of target tracking realizes the function of predicting the target center position in real time during the tracking process. The method mainly comprises two parts of an initialization classifier and an online classification process:

1) initialization classifier

For a reference frame subjected to data enhancement in a sample base, extracting features by using a feature extraction network, and generating a two-dimensional Gaussian true value label y with the same size as a feature map by taking the target central position of the reference frame as a peak value_gt，

Initializing a classifier according to characteristics and labels, minimizing the distance between an actual value and a true value by using a least square optimization algorithm, solving a nonlinear least square problem by using a Gauss-Newton iteration method, wherein the basic idea of the Gauss-Newton iteration method is to use a Taylor series expansion to approximate to replace a nonlinear regression model, then continuously approximating the regression coefficient to the optimal regression coefficient of the nonlinear regression model by multiple iterations and multiple corrections of the regression coefficient, and finally minimizing the sum of squares of residuals of an original model.

2) Online classification process

As shown in FIG. 3, the tracking result (x) is based on the previous frame_t-1，y_t-1，w_t-1，h_t-1) Wherein (x)_t-1，y_t-1) To estimate the center coordinates of the target, (w)_t-1，h_t-1) In order to estimate the width and height of the target, the width and height are expanded by a specified ratio k with the center position of the target in the previous frame as the center, and a search area (x) is generated from the current frame_t-1，y_t-1，k*w_t-1，k*h_t-1) Then using the feature extraction network to extract the features f of the search area_tAfter two full connection layers, a predicted Gaussian response graph with the same size as the search area is generated

The maximum response position is the target center coordinate (x) estimated by the current frame_t，y_t). The online training classifier fully considers the tracked target and the background area, and the classifier is continuously updated to estimate the position of the target.

Step 4, dynamically constructing a classifier training sample library

In view of the fact that the motion state of a low-slow small-flight target is complex, the target size and the target form change greatly in the tracking process, if only the first frame of a video is used as a supervision frame, it is obvious that dynamic changes which cannot adapt to the target in real time exist, so the following two steps are adopted to further alleviate the problem, and the specific content is as follows:

firstly, a reference frame updating interval is set to be T, a target detection unit is called to update a reference frame when T can be completely divided by T, all outdated samples in a sample library are eliminated, a classifier is reinitialized by using the updated reference frame, and simultaneously newly generated samples are sequentially added into the sample library along with the tracking process, so that the similarity between the characteristics of a target in the sample library and the currently tracked sample is higher, and the accurate estimation of the central position of the target is facilitated.

And secondly, when the confidence coefficient of the predicted position of the tracking frame is smaller than a set threshold, the current frame target tracking fails, the decision control unit sends information of the reference frame needing to be reinitialized to the detection and correction unit, and after the detection and correction unit receives the information of the detection frame of the target detection unit, the information is sent to the reference frame initialization unit, data enhancement and other operations are carried out, and finally the information is sent to a dynamically constructed sample library.

Step 5, fine trimming of target tracking position

As shown in fig. 4, the final determination of the position and scale information of the current frame tracking frame according to the target center position predicted by the classifier and the target width and height of the previous frame is the main function of the target tracking position refinement. The method mainly comprises a feature extraction network and a similarity evaluation network:

1) feature extraction network

The ResNet-18 network is still used for feature extraction, in order to fully utilize historical information, previous template information is kept in balance with current reference frame updating information, features combining the current state and the historical state of a target are provided for a neural network, the tracking stability is improved, search area features are extracted from three parts of image frames of the reference frame, the current frame, the reference frame and the current frame at the intermediate time, and the three parts are respectively sent into an accurate interested area pooling layer and used for calculating the confidence coefficient of a prediction position by a similarity evaluation network.

2) Similarity evaluation network

The core of the similarity evaluation network is a precise region-of-interest pooling layer, and the input of the similarity evaluation network comprises two parts, namely an image feature map extracted by using the network, wherein (i, j) is a coordinate on the feature map, and w_i，jBilinear interpolation using interpolation coefficients as IC for weighting values of corresponding positions (i, j) on the feature map

IC(x，y，i，j)＝max(0，1-|x-i|)×max(0，1-|y-j|) (8)

Mapping discrete feature maps to a continuous space

Second, the upper left corner of the rectangular frame andlower right corner coordinate (x)₂，x₁) And (y)₂，y₁). Performing accurate region-of-interest pooling operation according to the obtained continuous spatial feature map and the coordinates of the rectangular frame,

and maximally preserving the target characteristics on the image and preparing for further comparison of the similarity of the reference target and the target of the historical frame. And after the characteristics of the accurate region-of-interest pooling layer are obtained, splicing the three characteristics of the reference frame, the intermediate frame and the current frame, inputting the three characteristics into the full-connection layer and outputting the final position confidence coefficient. And comparing the similarity degree of the candidate target and the historical target, and finding the maximum similar target as a tracking result.

It will be appreciated by those of ordinary skill in the art that the examples described herein are intended to assist the reader in understanding the manner in which the invention is practiced, and it is to be understood that the scope of the invention is not limited to such specifically recited statements and examples. Those skilled in the art can make various other specific changes and combinations based on the teachings of the present invention without departing from the spirit of the invention, and these changes and combinations are within the scope of the invention.

Claims

1. A small low-flying target detection and tracking system, comprising: the system comprises a video data input unit, a video preprocessing unit, a training data construction unit, a detection model training unit, a target comparison screening unit, a detection correction unit, a reference frame initialization unit, a sample library dynamic construction unit, an online learning unit, a position fine modification unit, a decision control unit and a tracking result output unit;

the video data input unit is configured to: inputting a plurality of video sequences containing targets, and randomly dividing the video sequences into two parts, wherein one part is used for training a target detection model, and the other part is used for online testing of a target tracking model;

the video pre-processing unit is configured to: finishing the preliminary video preprocessing work according to the needs of a target detection and tracking unit, and specifically comprising deleting video segments without targets for a long time and removing video segments obviously not conforming to the characteristics of small targets flying at a slow speed per hour in a low-altitude airspace;

the training data construction unit is configured to: ensuring the completeness and richness of training data, constructing a training set and a verification set by extracting video frames at equal intervals, and carrying out data labeling, namely determining the central position, width and height information of a target in an image, wherein the information is used for a supervised training target detection model;

the detection model training unit is used for: creating a target detection network with a pyramid structure, and relieving the problem of unbalanced target background by using focus loss; stopping the training process after observing that the training loss function tends to be stable, storing a model file with optimal performance in the verification process, and providing reset frame information when the target tracking fails;

the target alignment screening unit is used for: comparing the first frame true value frame of the tracked video with the detection result by using a SURF (speeded up robust features) feature matching algorithm, eliminating false alarms which are obviously not low-slow small targets, and further ensuring the robustness of long-term stable tracking;

the detection correction unit is used for: starting a detection and correction unit when the following two conditions occur, namely starting the detection and correction unit when the position confidence coefficient of a tracking frame is lower than a set threshold value, which indicates that the target tracking of the current frame fails; secondly, automatically starting a detection and correction unit when a specified frame number interval is reached, and ensuring that the difference between the current tracking target and the target characteristics of the reference frame is not too large;

the reference frame initialization unit is configured to: cutting out a search area according to the received target position and scale information of the reference frame and according to the size 5 times of the target size, zooming the image to the specified size 288 x 288, and inputting the image block of the search area with the specified size and the scale information of the target position into a dynamic construction sample library of the classifier;

the sample library dynamic construction unit is used for: performing data enhancement on the reference frame sample, including basic operations of rotation, scaling, dithering and blurring, and receiving a newly added sample in the tracking process;

the online learning unit extracts features of samples stored in a sample library by using a deep network ResNet18 network, obtains a predicted Gaussian response map after passing through two fully-connected layers, generates a true Gaussian label by taking a target center as a peak point of Gaussian distribution according to label information in the sample library, adjusts parameters of the two fully-connected layers on line for an optimization target by reducing a difference between the predicted Gaussian response map and the true Gaussian label, and achieves the purposes of realizing the predicted Gaussian response map and obtaining a predicted target position center through the feature extraction network and the fully-connected layers under the condition of no label;

the position refinement unit is configured to: performing position refinement on an initial tracking result obtained by an online learning unit, obtaining a plurality of shaking frames by taking the center of a predicted target position obtained by the online learning unit of a current frame and the width and height of a target scale obtained by a previous frame as references, mapping the shaking frames onto a search area of the current frame, performing feature extraction on the shaking frames by using an accurate region-of-interest pooling layer, splicing modulation features obtained by a reference frame, and finally obtaining the confidence coefficient of the predicted position of each shaking frame through a full-connection layer, and merging the results of 3 shaking frames with the highest confidence coefficients to obtain the tracking result of the current frame after the refinement;

the decision control unit is configured to: judging a target tracking state through the relation between the confidence coefficient of the predicted position of the tracking frame and a set threshold value in the tracking process, if the target is stably tracked, continuing to track the next frame, and if the target is lost, activating a detection and correction unit to detect the target of the current frame and updating a reference frame so as to realize long-term stable tracking of low and slow small targets;

2. The detection and tracking method of the small-sized low-flying target detection and tracking system according to claim 1, characterized by comprising the following steps:

step 1, constructing a target detection network;

2) a loss function, which solves the problem of serious imbalance of the proportion of positive and negative samples in target detection by using focus loss and reduces the weight of simple negative samples in training;

step 2, comparing and screening targets;

before the video is sent to the detection and correction unit, SURF feature point matching is carried out for one time, a first frame target of a current video is subjected to feature matching with a detection result, when the number of matching points is larger than a set value, the detection result is really a target needing to be tracked at present, the detection is successful, and at the moment, a detection frame is sent to the detection and correction unit for subsequent processes;

step 3, target tracking online learning;

1) initialization classifier

For a reference frame subjected to data enhancement in a sample base, extracting features by using a feature extraction network, and generating a two-dimensional Gaussian true value label y with the same size as a feature map by taking the target central position of the reference frame as a peak value_gtInitializing a classifier according to the characteristics and the label, minimizing the distance between an actual value and a true value by using a least square optimization algorithm, and solving a nonlinear least square problem by using a Gauss-Newton iteration method; the formula is as follows:

in the formula (1), x ∈ {1, …, M } represents the horizontal direction coordinate of the target center point of the reference frame, and M is the width of the feature map; y belongs to {1, …, N } represents the vertical direction coordinate of the target center point of the reference frame, and N is the height of the feature map; sigma is the Gaussian bandwidth;

2) online classification process

Tracking the result (x) from the previous frame (t-1 th frame)_t-1，y_t-1，w_t-1，h_t-1) Which isIn (x)_t-1，y_t-1) To estimate the center coordinates of the target, (w)_t-1，h_t-1) In order to estimate the width and height of target, the central position of previous frame target is used as the centre of target search area of current frame (t frame), and according to the specified ratio k the current frame search area (x) can be expanded, widened and raised so as to produce current frame search area_t-1，y_t-1，k*w_t-1，k*h_t-1) (ii) a Then, the feature f of the search area is extracted using the feature extraction network_tAfter two full connection layers, a prediction Gaussian response graph consistent with the size of the search area is generated

A mapping function representing a fully connected layer; weight₁，weight₂The weight coefficient matrix of the full connection layer is represented, and the maximum response position is the target center coordinate (x) estimated by the current frame_t，y_t) (ii) a The online training classifier fully considers the tracked target and the background area, and the classifier is continuously updated to estimate the position of the target;

step 4, dynamically constructing a classifier training sample library

1) Setting a reference frame updating interval as T, calling a target detection unit to update a reference frame when the current frame number T can be completely divided by T, removing all outdated samples in a sample library, using the updated reference frame to reinitialize a classifier, and simultaneously sequentially adding newly generated samples into the sample library along with the tracking process, so that the similarity between the characteristics of a target in the sample library and the currently tracked sample is high, and the central position of the target can be accurately estimated;

2) when the confidence coefficient of the predicted position of the tracking frame is smaller than a set threshold, the current frame target tracking fails, the decision control unit sends information of a reference frame needing to be reinitialized to the detection and correction unit, and the detection and correction unit sends the information to the reference frame initialization unit after receiving the information of the detection frame of the target detection unit, performs data enhancement operation and finally sends the information to a dynamically constructed sample library;

step 5, fine trimming of the target tracking position;

1) feature extraction network

The feature extraction uses a ResNet18 network, balance and reserve the information of a previous template and update the information of a current reference frame, provide the features combining the current and historical states of a target for a neural network, improve the tracking stability, extract the features of a search area for three parts of an image frame at the middle time of the reference frame, the current frame, the reference frame and the current frame, respectively send the features into an accurate region-of-interest pooling layer, and are used for calculating the confidence coefficient of a predicted position by a similarity evaluation network;

2) similarity evaluation network

IC(x，y，i，j)＝max(0，1-|x-i|)×max(0，1-|y-j|) (2)

In the formulas (2) and (3), (x, y) is the feature map center coordinate, (i, j) is the coordinate index on the feature map, w_i，jThe weight value corresponding to the position (i, j) on the feature map; the second part of the input is the coordinates (x) of the upper left corner of the rectangular box₂，x₁) And the coordinates of the lower right corner (y)₂，y₁) (ii) a Performing accurate region-of-interest pooling operation according to the obtained continuous spatial feature map and the coordinates of the rectangular frame, and reserving target features on the image to the maximum extent to prepare for further comparing the similarity of the reference target and the target of the historical frame; finally, the feature map f (x, y) is doubly integrated

After the characteristics of the accurate region-of-interest pooling layer are obtained, the three characteristics of the reference frame, the intermediate frame and the current frame are spliced and input into the full-connection layer to output the final position confidence; and comparing the similarity degree of the candidate target and the historical target, and finding the maximum similar target as a tracking result.