CN111508002B - Small-sized low-flying target visual detection tracking system and method thereof - Google Patents

Small-sized low-flying target visual detection tracking system and method thereof Download PDF

Info

Publication number
CN111508002B
CN111508002B CN202010309617.7A CN202010309617A CN111508002B CN 111508002 B CN111508002 B CN 111508002B CN 202010309617 A CN202010309617 A CN 202010309617A CN 111508002 B CN111508002 B CN 111508002B
Authority
CN
China
Prior art keywords
target
frame
tracking
unit
detection
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010309617.7A
Other languages
Chinese (zh)
Other versions
CN111508002A (en
Inventor
陶然
李伟
黄展超
马鹏阁
揭斐然
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Institute of Technology BIT
Luoyang Institute of Electro Optical Equipment AVIC
Zhengzhou University of Aeronautics
Original Assignee
Beijing Institute of Technology BIT
Luoyang Institute of Electro Optical Equipment AVIC
Zhengzhou University of Aeronautics
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Institute of Technology BIT, Luoyang Institute of Electro Optical Equipment AVIC, Zhengzhou University of Aeronautics filed Critical Beijing Institute of Technology BIT
Priority to CN202010309617.7A priority Critical patent/CN111508002B/en
Publication of CN111508002A publication Critical patent/CN111508002A/en
Application granted granted Critical
Publication of CN111508002B publication Critical patent/CN111508002B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a small low-flying target visual detection tracking system and a method thereof, wherein the system comprises: the device comprises a video data input unit, a video preprocessing unit, a training data construction unit, a detection model training unit, a target comparison screening unit, a detection correction unit, a reference frame initialization unit, a sample library dynamic construction unit, an online learning unit, a position fine modification unit, a decision control unit and a tracking result output unit. The method comprises the following steps: constructing a target detection network, and comparing and screening targets; target tracking online learning; dynamically constructing a classifier training sample library, and finely trimming a target tracking position; the invention has the advantages that: the tracking drift condition of the tracked target caused by factors such as shielding, scale change and illumination can be effectively relieved, and robust target tracking can be realized. The method has the capability of updating the reference frame characteristics in time according to the target change, and meanwhile, the error tracking caused by updating the reference frame characteristics can be avoided by introducing the characteristic point matching algorithm.

Description

Small-sized low-flying target visual detection tracking system and method thereof
Technical Field
The invention relates to the technical field of flight target tracking, in particular to a small low-flight target detection and tracking system based on a neural network and online learning and a detection and tracking method thereof.
Background
At present, some related methods for joint detection and tracking and some tracking methods suitable for low-speed small targets exist. The existing method realizes short-time target tracking through a related filtering method, and realizes the function of target relocation by adopting a target detection method based on a neural network when the tracking fails. The related patents and research techniques are as follows:
the Chinese invention patent, the name is: a robustness long-term tracking method based on correlation filtering and target detection is disclosed in the application number: CN 201910306616.4; according to the method, the target tracking is realized through a related filtering method, the target detection is performed by using a one-stage target detector YOLO, after a detection result is obtained, a candidate frame with the highest matching point number is selected as a target surrounding frame of the reinitialization tracker by using a SURF feature point matching method, and finally, the long-term tracking effect is achieved. However, this method does not take into account the problem of extreme imbalance between the target and the background in the detector, and cannot be applied to a small target long-term tracking scene.
The Chinese invention patent, the name is: a low-altitude slow unmanned aerial vehicle tracking method combining correlation filtering and visual saliency is disclosed in the application number: 201910117155.6, respectively; the central position of a predicted target is determined by using a prediction response graph obtained by using a correlation filtering method in a small search area, and the scale of the predicted target is determined by using a significance detection method in a large search area, so that the tracking method suitable for the low-altitude slow unmanned aerial vehicle is realized. However, the method does not perform further processing after the target tracking fails, and the precision is to be further improved.
By integrating the prior art, the problems of extreme unbalance of the target background in single target detection and tracking are not solved, the network performance is not optimal, and the tracking precision of small targets is further improved.
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides a small low-flying target visual detection tracking system and a method thereof, and solves the defects in the prior art.
In order to realize the purpose, the technical scheme adopted by the invention is as follows:
a small low-flying target detection tracking system, comprising: the device comprises a video data input unit, a video preprocessing unit, a training data construction unit, a detection model training unit, a target comparison screening unit, a detection correction unit, a reference frame initialization unit, a sample library dynamic construction unit, an online learning unit, a position fine modification unit, a decision control unit and a tracking result output unit.
The video data input unit is configured to: a plurality of video sequences containing targets are input and randomly divided into two parts, wherein one part is used for training a target detection model, and the other part is used for online testing of a target tracking model.
The video pre-processing unit is configured to: the method comprises the steps of finishing the early-stage video preprocessing work according to the needs of a target detection and tracking unit, specifically deleting video segments without targets for a long time, and eliminating video segments which obviously do not accord with the characteristics of small targets flying at a slow speed per hour in a low-altitude airspace.
The training data construction unit is configured to: the completeness and richness of training data are guaranteed, a training set and a verification set are constructed in a mode of extracting video frames at equal intervals, and data labeling is carried out, namely the information of the center position, the width and the height of a target in an image is determined and is used for a supervised training target detection model.
The detection model training unit is used for: a pyramid structured target detection network is created and focus loss is used to alleviate the problem of target background imbalance. And stopping the training process after the training loss function is observed to tend to be stable, and storing the model file with the optimal performance in the verification process for providing the reset box information when the target tracking fails.
The target alignment screening unit is used for: and comparing the first frame true value frame of the tracked video with the detection result by using a SURF (speeded up robust features) feature matching algorithm, and eliminating false alarms which are obviously not low-slow small targets, thereby further ensuring the robustness of long-term stable tracking.
The detection correction unit is used for: starting a detection and correction unit when the following two conditions occur, namely starting the detection and correction unit when the position confidence coefficient of a tracking frame is lower than a set threshold value, which indicates that the target tracking of the current frame fails; and secondly, automatically starting a detection and correction unit when the specified frame number interval is reached, and ensuring that the difference between the current tracking target and the target characteristic of the reference frame is not too large.
The reference frame initialization unit is configured to: according to the received target position and scale information of the reference frame, cutting out a search area according to the size 5 times of the target size, zooming the image to the specified size 288 x 288, and inputting the image blocks of the search area with the specified size, the target position scale information and the like into a dynamic construction sample library of the classifier.
The sample library dynamic construction unit is used for: and performing data enhancement on the reference frame sample, including basic operations of rotation, scaling, dithering and blurring, and receiving a sample newly added in the tracking process.
The online learning unit extracts features of samples stored in a sample base by using a deep network ResNet18 network, obtains a predicted Gaussian response graph after passing through two fully-connected layers, generates a true Gaussian label by taking a target center as a peak point of Gaussian distribution according to label information in the sample base, adjusts parameters of the two fully-connected layers on line for an optimization target by reducing a difference between the predicted Gaussian response graph and the true Gaussian label, and achieves the purposes of realizing the predicted Gaussian response graph and obtaining a predicted target position center through the feature extraction network and the fully-connected layers under the condition of no label.
The position refinement unit is configured to: the method comprises the steps of carrying out position refinement on an initial tracking result obtained by an online learning unit, obtaining a plurality of shaking frames by taking the center of a predicted target position obtained by the online learning unit of a current frame and the width and height of a target scale obtained by a previous frame as references, mapping the shaking frames to a search area of the current frame, carrying out feature extraction on the shaking frames by using an accurate region-of-interest pooling layer, splicing modulation features obtained by a reference frame, obtaining the confidence coefficient of the predicted position of each shaking frame through a full-connection layer, and combining the results of 3 shaking frames with the highest confidence coefficients to obtain the tracking result of the current frame after refinement.
The decision control unit is configured to: and in the tracking process, the target tracking state is judged according to the relation between the confidence coefficient of the predicted position of the tracking frame and a set threshold value, if the target is stably tracked, the tracking of the next frame is continued, and if the target is lost, the detection and correction unit is activated to detect the target of the current frame and update the reference frame so as to realize the long-term stable tracking of the low-slow small target.
The tracking result output unit is used for: and after traversing all the frames of the video, outputting the position and scale information of each frame.
The invention also discloses a small low-flying target visual detection tracking method, which comprises the following steps:
step 1, constructing a target detection network;
1) constructing a network structure comprising: a backbone network, a classification subnet and a regression subnet;
2) and a loss function, which solves the problem of serious imbalance of the proportion of positive and negative samples in target detection by using focus loss and reduces the weight of the simple negative samples in training.
Step 2, comparing and screening targets;
before the video is sent to the detection and correction unit, SURF feature point matching is carried out for one time, the first frame target of the current video is subjected to feature matching with the detection result, when the number of matching points is larger than a set value, the detection result is really the target needing to be tracked at present, the detection is successful, and at the moment, the detection frame is sent to the detection and correction unit for carrying out the subsequent process.
Step 3, target tracking online learning;
predicting the target center position, comprising two parts of an initialization classifier and an online classification process:
1) initialization classifier
Extracting features of a data-enhanced reference frame in a sample library by using a feature extraction network, generating a two-dimensional Gaussian true value label ygt with the same size as a feature map by taking the target central position of the reference frame as a peak value, initializing a classifier according to the features and the label, reducing the distance between an actual value and a true value as much as possible by using a least square optimization algorithm, and solving a nonlinear least square problem by using a Gaussian Newton iteration method; the formula is as follows:
Figure BDA0002457148980000051
in the formula (1), x ∈ {1, …, M } represents the horizontal direction coordinate of the target center point of the reference frame, and M is the width of the feature map; y belongs to {1, …, N } represents the vertical direction coordinate of the target center point of the reference frame, and N is the height of the feature map; σ is the gaussian bandwidth.
2) Online classification process
Tracking the result (x) from the previous frame (t-1 th frame)t-1,yt-1,wt-1,ht-1) (wherein (x)t-1,yt-1) To estimate the center coordinates of the target, (w)t-1,ht-1) For estimating the width and height of the target), the center position of the target of the previous frame is the center of the target search area of the current frame (the t frame), the target search area is expanded to be wide and high according to a specified proportion k, and the search area (x) of the current frame is generatedt-1,yt-1,k*wt-1,k*ht-1). Then, the feature f of the search area is extracted using the feature extraction networktAfter two full connection layers, a prediction Gaussian response graph consistent with the size of the search area is generated
Figure BDA0002457148980000052
(
Figure BDA0002457148980000053
A mapping function representing a fully connected layer; weight1,weight2A weight coefficient matrix representing a full connection layer), and the maximum response position is the target center coordinate (x) estimated by the current framet,yt). The online training classifier fully considers the tracked target and the background area, and the classifier is continuously updated to estimate the position of the target.
Step 4, dynamically constructing a classifier training sample library
1) Setting a reference frame updating interval as T, calling a target detection unit to update a reference frame when the current frame number T can be completely divided by T, clearing all outdated samples in a sample library, using the updated reference frame to reinitialize a classifier, and simultaneously sequentially adding newly generated samples into the sample library along with the tracking process, so that the similarity between the characteristics of a target in the sample library and the currently tracked sample is higher, and the central position of the target can be accurately estimated.
2) When the confidence coefficient of the predicted position of the tracking frame is smaller than a set threshold, the current frame target tracking fails, the decision control unit sends information of the reference frame needing to be reinitialized to the detection and correction unit, and after the detection and correction unit receives the information of the detection frame of the target detection unit, the information is sent to the reference frame initialization unit, data enhancement and other operations are carried out, and finally the information is sent to a dynamically constructed sample library.
Step 5, fine trimming of the target tracking position;
the method comprises two parts of a feature extraction network and a similarity evaluation network:
1) feature extraction network
The feature extraction uses a ResNet18 network, balance and reserve the previous template information and update the current reference frame information, provide the features combining the current and historical states of the target for the neural network, improve the tracking stability, extract the search area features of the reference frame, the current frame, the image frame at the intermediate time of the reference frame and the current frame, and respectively send the search area features to the precise region-of-interest pooling layer for the similarity evaluation network to calculate the confidence coefficient of the predicted position.
2) Similarity evaluation network
The core of the similarity evaluation network is a precise region-of-interest pooling layer, the input of which comprises two parts, the first part is bilinear interpolation with interpolation coefficient of IC for an image feature map extracted by the network
IC(x,y,i,j)=max(0,1-|x-i|)×max(0,1-|y-j|) (2)
Mapping the discrete feature map to a continuous space and obtaining a feature map f (x, y)
Figure BDA0002457148980000061
In the formulae (2) and (b)3) Where (x, y) is the feature map center coordinate, (i, j) is the coordinate index on the feature map, wi,jThe weight corresponding to the position (i, j) on the feature map. The second part of the input is the coordinates (x) of the upper left corner of the rectangular box2,x1) And the coordinates of the lower right corner (y)2,y1). And performing accurate region-of-interest pooling operation according to the obtained continuous spatial feature map and the coordinates of the rectangular frame, and reserving the target features on the image to the maximum extent to prepare for further comparing the similarity of the reference target and the target of the historical frame. Finally, the feature map f (x, y) is doubly integrated
Figure BDA0002457148980000071
And dividing by the area of the rectangular frame to obtain a precise region of interest Pooling (PrROI Pooling)
Figure BDA0002457148980000072
And after the characteristics of the accurate region-of-interest pooling layer are obtained, splicing the three characteristics of the reference frame, the intermediate frame and the current frame, inputting the three characteristics into the full-connection layer and outputting the final position confidence coefficient. And comparing the similarity degree of the candidate target and the historical target, and finding the maximum similar target as a tracking result.
Compared with the prior art, the invention has the advantages that:
1) the invention combines the detection and tracking method, can effectively relieve the tracking drift condition of the tracked target caused by factors such as shielding, scale change, illumination and the like, and can realize robust target tracking.
2) The method has the capability of updating the reference frame characteristics in time according to the target change, and meanwhile, the error tracking caused by updating the reference frame characteristics can be avoided by introducing the characteristic point matching algorithm.
3) The method is suitable for long-term stable tracking of low-speed small targets in optical air remote sensing images.
Drawings
FIG. 1 is a block diagram of a small low-flying target detection and tracking system according to an embodiment of the present invention;
FIG. 2 is a diagram of a target detection network architecture according to an embodiment of the present invention;
FIG. 3 is a flow chart of online learning according to an embodiment of the present invention;
fig. 4 is a flowchart of position refinement according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be further described in detail below with reference to the accompanying drawings by way of examples.
As shown in fig. 1, a small low-flying target visual detection tracking system includes the following units:
(1) a video data input unit. A plurality of video sequences containing targets are input and randomly divided into two parts, wherein one part is used for training a target detection model, and the other part is used for online testing of a target tracking model.
(2) And a video preprocessing unit. The method comprises the steps of finishing the video preprocessing work in the early stage according to the requirements of a target detection and tracking unit, specifically deleting video segments without targets for a long time, and removing video segments which obviously do not accord with the characteristics of low and slow small targets.
(3) And a training data construction unit. In order to ensure the completeness and richness of training data, a training set and a verification set are constructed in a mode of extracting video frames at equal intervals, and data labeling is carried out, namely information such as the central position, the width and the like of a target in an image is determined and the information is used for a supervised training target detection model.
(4) And detecting a model training unit. Due to the problems of large scale change of low-slow small-flight targets and extreme unbalance of categories in the single-target training process, a target detection network with a pyramid structure is designed, and the problem of unbalanced target backgrounds is relieved by using focus loss. And stopping the training process after the training loss function is observed to tend to be stable, and storing the model file with the optimal performance in the verification process for providing the reset box information when the target tracking fails.
(5) And a target comparison screening unit. In view of the possibility of false alarms in the target detection result, the SURF feature matching algorithm is used for comparing the first frame true value frame of the tracked video with the detection result, so that false alarms which are obviously not low-slow small targets are eliminated, and the robustness of long-term stable tracking is further ensured.
(6) A correction unit is detected. Starting a detection and correction unit under the following two conditions, namely starting the detection and correction unit when the position confidence of the tracking frame is lower than a set threshold, which indicates that the target tracking of the current frame fails; and secondly, automatically starting a detection and correction unit when the specified frame number interval is reached, and ensuring that the difference between the current tracking target and the target characteristic of the reference frame is not too large.
(7) And a reference frame initialization unit. According to the received target position and scale information of the reference frame, cutting out a search area according to the size 5 times of the target size, zooming the image to the specified size 288 x 288, and inputting the image blocks of the search area with the specified size, the target position scale information and the like into a dynamic construction sample library of the classifier.
(8) And a sample library dynamic construction unit. And performing data enhancement on the reference frame sample, including basic operations such as rotation, scaling, dithering, blurring and the like, and receiving a newly added sample in the tracking process.
(9) And an online learning unit. And performing feature extraction on samples stored in the sample library by using a deep network ResNet18, obtaining a predicted Gaussian response map after passing through two fully-connected layers, generating a true Gaussian label by taking the target center as a peak point of Gaussian distribution according to label information in the sample library, and adjusting parameters of the two fully-connected layers on line by taking a difference between the predicted Gaussian response map and the true Gaussian label as an optimization target to achieve the purposes of realizing the predicted Gaussian response map and obtaining the predicted target position center through the feature extraction network and the fully-connected layers under the condition of no label.
(10) And a position finishing unit. The method comprises the steps of carrying out position refinement on an initial tracking result obtained by an online learning unit, obtaining a plurality of shaking frames by taking the center of a predicted target position obtained by the online learning unit of a current frame and the width and height of a target scale obtained by a previous frame as references, mapping the shaking frames to a search area of the current frame, carrying out feature extraction on the shaking frames by using an accurate region-of-interest pooling layer, splicing modulation features obtained by a reference frame, obtaining the confidence coefficient of the predicted position of each shaking frame through a full-connection layer, and combining the results of 3 shaking frames with the highest confidence coefficients to obtain the tracking result of the current frame after refinement.
(11) And a decision control unit. And in the tracking process, the target tracking state is judged according to the relation between the confidence coefficient of the predicted position of the tracking frame and a set threshold value, if the target is stably tracked, the tracking of the next frame is continued, and if the target is lost, the detection and correction unit is activated to detect the target of the current frame and update the reference frame so as to realize the long-term stable tracking of the low-slow small target.
(12) And a tracking result output unit. And after traversing all the frames of the video, outputting the position and scale information of each frame.
A small-sized low-flying target visual detection tracking method comprises the following steps:
step 1, constructing a target detection network;
the detection network structure for the low-slow small target mainly comprises two parts, namely a neural network structure with a multi-stage pyramid structure and a loss function for relieving extreme unbalance of a target background in a training process:
1) network architecture
Backbone network:
the backbone network for target detection uses feature pyramid levels P3 through P7. The main network is composed of two parts, P3 and P4, one part is calculated from the output of the corresponding feature extraction network ResNet (C3 and C4) through horizontal connection, the other part is to use the top-down approach to up-sample the deep small-size feature map to the same size as the shallow layer, and then perform the superposition operation and convolution operation to make the output channel number still 256, although the channel number of the feature map is always kept unchanged, but the information is incremental. P5 is computed from the output of the corresponding feature extraction network ResNet (C5) by cross-connection. P6 and P7 are obtained by using a convolution layer and an activation layer on the basis of C5, wherein P islIs lower than the input image by 2l(l denotes the pyramid level), all pyramid levels have C-256 channels.
And (3) classifying the subnets:
the target classification subnet predicts the probability of target presence at each spatial location of 9 anchor points and K target classes per a. This subnet is a small full convolution network connected to each of the backbone network pyramid levels, all pyramid levels sharing the parameters of the subnet. From a given pyramid level C-channel input feature map, the subnet applies 4 3 × 3 convolutional layers, each layer having C ═ 256 filters and Relu activation functions, then passes through the 3 × 3 convolutional layers with KA filters, finally appends sigmoid activation functions at each spatial position, and outputs KA binary prediction.
Regression subnet:
the bounding box regression sub-network runs parallel to the target classification sub-network, and another small full convolution network is added after each pyramid level to regress the offsets of each anchor box to the true targets that may exist nearby. The design subnet of the bounding box regression is the same as the classification subnet except that it has 4A linear outputs at each spatial position, and for a anchor boxes centered at each spatial position, 4 means the relative offset between the coordinate positions of the upper left and lower right corners of the 4 output prediction anchor boxes and the corresponding positions of the truth boxes. The target classification subnet and the box regression subnet, while sharing a common structure, use separate parameters. Fig. 2 shows the main structure of the target detection network.
2) Loss function
Using Focal Loss (Focal local, FL)
FL(pt)=-αt(1-pt)γlog(pt) (5)
Figure BDA0002457148980000111
The problem that the proportion of positive and negative samples is seriously unbalanced in target detection is solved, and the weight of the simple negative sample in training is reduced. In equation (5), FL (·) represents a focal loss; p is a radical oftA discrimination function value representing a target classification probability; alpha represents a balance factor for balancing the proportion unevenness of the positive and negative samples; gamma is used to balance the difficult and easy sample fractionsThe class balance factor, gamma, taking a value greater than 0 will reduce the loss of easily classified samples, making the model more concerned with the learning of difficult and misclassified samples. The formula (6) is an expression of a probability discrimination function, y is a prediction output label passing through an activation function, and the value is between 0 and 1; p represents the probability value that the target belongs to the label labeling category.
Step 2, comparing and screening targets;
in order to prevent the phenomenon that tracking fails due to the fact that a target does not exist in an image or the target is detected as a false alarm by updating a reference frame through background information, SURF feature point matching is performed once before the target is sent to a detection and correction unit, feature matching is performed on a first frame target of a current video and a detection result, when the number of matching points is larger than a set value, the detection result is really the target needing to be tracked at present, the detection is successful, and at the moment, a detection frame is sent to the detection and correction unit to perform a subsequent process.
Step 3, target tracking online learning;
as shown in fig. 3, the on-line learning of target tracking realizes the function of predicting the target center position in real time during the tracking process. The method mainly comprises two parts of an initialization classifier and an online classification process:
1) initialization classifier
For a reference frame subjected to data enhancement in a sample base, extracting features by using a feature extraction network, and generating a two-dimensional Gaussian true value label y with the same size as a feature map by taking the target central position of the reference frame as a peak valuegt
Figure BDA0002457148980000121
Initializing a classifier according to characteristics and labels, minimizing the distance between an actual value and a true value by using a least square optimization algorithm, solving a nonlinear least square problem by using a Gauss-Newton iteration method, wherein the basic idea of the Gauss-Newton iteration method is to use a Taylor series expansion to approximate to replace a nonlinear regression model, then continuously approximating the regression coefficient to the optimal regression coefficient of the nonlinear regression model by multiple iterations and multiple corrections of the regression coefficient, and finally minimizing the sum of squares of residuals of an original model.
2) Online classification process
As shown in FIG. 3, the tracking result (x) is based on the previous framet-1,yt-1,wt-1,ht-1) Wherein (x)t-1,yt-1) To estimate the center coordinates of the target, (w)t-1,ht-1) In order to estimate the width and height of the target, the width and height are expanded by a specified ratio k with the center position of the target in the previous frame as the center, and a search area (x) is generated from the current framet-1,yt-1,k*wt-1,k*ht-1) Then using the feature extraction network to extract the features f of the search areatAfter two full connection layers, a predicted Gaussian response graph with the same size as the search area is generated
Figure BDA0002457148980000122
The maximum response position is the target center coordinate (x) estimated by the current framet,yt). The online training classifier fully considers the tracked target and the background area, and the classifier is continuously updated to estimate the position of the target.
Step 4, dynamically constructing a classifier training sample library
In view of the fact that the motion state of a low-slow small-flight target is complex, the target size and the target form change greatly in the tracking process, if only the first frame of a video is used as a supervision frame, it is obvious that dynamic changes which cannot adapt to the target in real time exist, so the following two steps are adopted to further alleviate the problem, and the specific content is as follows:
firstly, a reference frame updating interval is set to be T, a target detection unit is called to update a reference frame when T can be completely divided by T, all outdated samples in a sample library are eliminated, a classifier is reinitialized by using the updated reference frame, and simultaneously newly generated samples are sequentially added into the sample library along with the tracking process, so that the similarity between the characteristics of a target in the sample library and the currently tracked sample is higher, and the accurate estimation of the central position of the target is facilitated.
And secondly, when the confidence coefficient of the predicted position of the tracking frame is smaller than a set threshold, the current frame target tracking fails, the decision control unit sends information of the reference frame needing to be reinitialized to the detection and correction unit, and after the detection and correction unit receives the information of the detection frame of the target detection unit, the information is sent to the reference frame initialization unit, data enhancement and other operations are carried out, and finally the information is sent to a dynamically constructed sample library.
Step 5, fine trimming of target tracking position
As shown in fig. 4, the final determination of the position and scale information of the current frame tracking frame according to the target center position predicted by the classifier and the target width and height of the previous frame is the main function of the target tracking position refinement. The method mainly comprises a feature extraction network and a similarity evaluation network:
1) feature extraction network
The ResNet-18 network is still used for feature extraction, in order to fully utilize historical information, previous template information is kept in balance with current reference frame updating information, features combining the current state and the historical state of a target are provided for a neural network, the tracking stability is improved, search area features are extracted from three parts of image frames of the reference frame, the current frame, the reference frame and the current frame at the intermediate time, and the three parts are respectively sent into an accurate interested area pooling layer and used for calculating the confidence coefficient of a prediction position by a similarity evaluation network.
2) Similarity evaluation network
The core of the similarity evaluation network is a precise region-of-interest pooling layer, and the input of the similarity evaluation network comprises two parts, namely an image feature map extracted by using the network, wherein (i, j) is a coordinate on the feature map, and wi,jBilinear interpolation using interpolation coefficients as IC for weighting values of corresponding positions (i, j) on the feature map
IC(x,y,i,j)=max(0,1-|x-i|)×max(0,1-|y-j|) (8)
Mapping discrete feature maps to a continuous space
Figure BDA0002457148980000141
Second, the upper left corner of the rectangular frame andlower right corner coordinate (x)2,x1) And (y)2,y1). Performing accurate region-of-interest pooling operation according to the obtained continuous spatial feature map and the coordinates of the rectangular frame,
Figure BDA0002457148980000142
and maximally preserving the target characteristics on the image and preparing for further comparison of the similarity of the reference target and the target of the historical frame. And after the characteristics of the accurate region-of-interest pooling layer are obtained, splicing the three characteristics of the reference frame, the intermediate frame and the current frame, inputting the three characteristics into the full-connection layer and outputting the final position confidence coefficient. And comparing the similarity degree of the candidate target and the historical target, and finding the maximum similar target as a tracking result.
It will be appreciated by those of ordinary skill in the art that the examples described herein are intended to assist the reader in understanding the manner in which the invention is practiced, and it is to be understood that the scope of the invention is not limited to such specifically recited statements and examples. Those skilled in the art can make various other specific changes and combinations based on the teachings of the present invention without departing from the spirit of the invention, and these changes and combinations are within the scope of the invention.

Claims (2)

1. A small low-flying target detection and tracking system, comprising: the system comprises a video data input unit, a video preprocessing unit, a training data construction unit, a detection model training unit, a target comparison screening unit, a detection correction unit, a reference frame initialization unit, a sample library dynamic construction unit, an online learning unit, a position fine modification unit, a decision control unit and a tracking result output unit;
the video data input unit is configured to: inputting a plurality of video sequences containing targets, and randomly dividing the video sequences into two parts, wherein one part is used for training a target detection model, and the other part is used for online testing of a target tracking model;
the video pre-processing unit is configured to: finishing the preliminary video preprocessing work according to the needs of a target detection and tracking unit, and specifically comprising deleting video segments without targets for a long time and removing video segments obviously not conforming to the characteristics of small targets flying at a slow speed per hour in a low-altitude airspace;
the training data construction unit is configured to: ensuring the completeness and richness of training data, constructing a training set and a verification set by extracting video frames at equal intervals, and carrying out data labeling, namely determining the central position, width and height information of a target in an image, wherein the information is used for a supervised training target detection model;
the detection model training unit is used for: creating a target detection network with a pyramid structure, and relieving the problem of unbalanced target background by using focus loss; stopping the training process after observing that the training loss function tends to be stable, storing a model file with optimal performance in the verification process, and providing reset frame information when the target tracking fails;
the target alignment screening unit is used for: comparing the first frame true value frame of the tracked video with the detection result by using a SURF (speeded up robust features) feature matching algorithm, eliminating false alarms which are obviously not low-slow small targets, and further ensuring the robustness of long-term stable tracking;
the detection correction unit is used for: starting a detection and correction unit when the following two conditions occur, namely starting the detection and correction unit when the position confidence coefficient of a tracking frame is lower than a set threshold value, which indicates that the target tracking of the current frame fails; secondly, automatically starting a detection and correction unit when a specified frame number interval is reached, and ensuring that the difference between the current tracking target and the target characteristics of the reference frame is not too large;
the reference frame initialization unit is configured to: cutting out a search area according to the received target position and scale information of the reference frame and according to the size 5 times of the target size, zooming the image to the specified size 288 x 288, and inputting the image block of the search area with the specified size and the scale information of the target position into a dynamic construction sample library of the classifier;
the sample library dynamic construction unit is used for: performing data enhancement on the reference frame sample, including basic operations of rotation, scaling, dithering and blurring, and receiving a newly added sample in the tracking process;
the online learning unit extracts features of samples stored in a sample library by using a deep network ResNet18 network, obtains a predicted Gaussian response map after passing through two fully-connected layers, generates a true Gaussian label by taking a target center as a peak point of Gaussian distribution according to label information in the sample library, adjusts parameters of the two fully-connected layers on line for an optimization target by reducing a difference between the predicted Gaussian response map and the true Gaussian label, and achieves the purposes of realizing the predicted Gaussian response map and obtaining a predicted target position center through the feature extraction network and the fully-connected layers under the condition of no label;
the position refinement unit is configured to: performing position refinement on an initial tracking result obtained by an online learning unit, obtaining a plurality of shaking frames by taking the center of a predicted target position obtained by the online learning unit of a current frame and the width and height of a target scale obtained by a previous frame as references, mapping the shaking frames onto a search area of the current frame, performing feature extraction on the shaking frames by using an accurate region-of-interest pooling layer, splicing modulation features obtained by a reference frame, and finally obtaining the confidence coefficient of the predicted position of each shaking frame through a full-connection layer, and merging the results of 3 shaking frames with the highest confidence coefficients to obtain the tracking result of the current frame after the refinement;
the decision control unit is configured to: judging a target tracking state through the relation between the confidence coefficient of the predicted position of the tracking frame and a set threshold value in the tracking process, if the target is stably tracked, continuing to track the next frame, and if the target is lost, activating a detection and correction unit to detect the target of the current frame and updating a reference frame so as to realize long-term stable tracking of low and slow small targets;
the tracking result output unit is used for: and after traversing all the frames of the video, outputting the position and scale information of each frame.
2. The detection and tracking method of the small-sized low-flying target detection and tracking system according to claim 1, characterized by comprising the following steps:
step 1, constructing a target detection network;
1) constructing a network structure comprising: a backbone network, a classification subnet and a regression subnet;
2) a loss function, which solves the problem of serious imbalance of the proportion of positive and negative samples in target detection by using focus loss and reduces the weight of simple negative samples in training;
step 2, comparing and screening targets;
before the video is sent to the detection and correction unit, SURF feature point matching is carried out for one time, a first frame target of a current video is subjected to feature matching with a detection result, when the number of matching points is larger than a set value, the detection result is really a target needing to be tracked at present, the detection is successful, and at the moment, a detection frame is sent to the detection and correction unit for subsequent processes;
step 3, target tracking online learning;
predicting the target center position, comprising two parts of an initialization classifier and an online classification process:
1) initialization classifier
For a reference frame subjected to data enhancement in a sample base, extracting features by using a feature extraction network, and generating a two-dimensional Gaussian true value label y with the same size as a feature map by taking the target central position of the reference frame as a peak valuegtInitializing a classifier according to the characteristics and the label, minimizing the distance between an actual value and a true value by using a least square optimization algorithm, and solving a nonlinear least square problem by using a Gauss-Newton iteration method; the formula is as follows:
Figure FDA0002784186600000031
in the formula (1), x ∈ {1, …, M } represents the horizontal direction coordinate of the target center point of the reference frame, and M is the width of the feature map; y belongs to {1, …, N } represents the vertical direction coordinate of the target center point of the reference frame, and N is the height of the feature map; sigma is the Gaussian bandwidth;
2) online classification process
Tracking the result (x) from the previous frame (t-1 th frame)t-1,yt-1,wt-1,ht-1) Which isIn (x)t-1,yt-1) To estimate the center coordinates of the target, (w)t-1,ht-1) In order to estimate the width and height of target, the central position of previous frame target is used as the centre of target search area of current frame (t frame), and according to the specified ratio k the current frame search area (x) can be expanded, widened and raised so as to produce current frame search areat-1,yt-1,k*wt-1,k*ht-1) (ii) a Then, the feature f of the search area is extracted using the feature extraction networktAfter two full connection layers, a prediction Gaussian response graph consistent with the size of the search area is generated
Figure FDA0002784186600000041
Figure FDA0002784186600000042
A mapping function representing a fully connected layer; weight1,weight2The weight coefficient matrix of the full connection layer is represented, and the maximum response position is the target center coordinate (x) estimated by the current framet,yt) (ii) a The online training classifier fully considers the tracked target and the background area, and the classifier is continuously updated to estimate the position of the target;
step 4, dynamically constructing a classifier training sample library
1) Setting a reference frame updating interval as T, calling a target detection unit to update a reference frame when the current frame number T can be completely divided by T, removing all outdated samples in a sample library, using the updated reference frame to reinitialize a classifier, and simultaneously sequentially adding newly generated samples into the sample library along with the tracking process, so that the similarity between the characteristics of a target in the sample library and the currently tracked sample is high, and the central position of the target can be accurately estimated;
2) when the confidence coefficient of the predicted position of the tracking frame is smaller than a set threshold, the current frame target tracking fails, the decision control unit sends information of a reference frame needing to be reinitialized to the detection and correction unit, and the detection and correction unit sends the information to the reference frame initialization unit after receiving the information of the detection frame of the target detection unit, performs data enhancement operation and finally sends the information to a dynamically constructed sample library;
step 5, fine trimming of the target tracking position;
the method comprises two parts of a feature extraction network and a similarity evaluation network:
1) feature extraction network
The feature extraction uses a ResNet18 network, balance and reserve the information of a previous template and update the information of a current reference frame, provide the features combining the current and historical states of a target for a neural network, improve the tracking stability, extract the features of a search area for three parts of an image frame at the middle time of the reference frame, the current frame, the reference frame and the current frame, respectively send the features into an accurate region-of-interest pooling layer, and are used for calculating the confidence coefficient of a predicted position by a similarity evaluation network;
2) similarity evaluation network
The core of the similarity evaluation network is a precise region-of-interest pooling layer, the input of which comprises two parts, the first part is bilinear interpolation with interpolation coefficient of IC for an image feature map extracted by the network
IC(x,y,i,j)=max(0,1-|x-i|)×max(0,1-|y-j|) (2)
Mapping the discrete feature map to a continuous space and obtaining a feature map f (x, y)
Figure FDA0002784186600000051
In the formulas (2) and (3), (x, y) is the feature map center coordinate, (i, j) is the coordinate index on the feature map, wi,jThe weight value corresponding to the position (i, j) on the feature map; the second part of the input is the coordinates (x) of the upper left corner of the rectangular box2,x1) And the coordinates of the lower right corner (y)2,y1) (ii) a Performing accurate region-of-interest pooling operation according to the obtained continuous spatial feature map and the coordinates of the rectangular frame, and reserving target features on the image to the maximum extent to prepare for further comparing the similarity of the reference target and the target of the historical frame; finally, the feature map f (x, y) is doubly integrated
Figure FDA0002784186600000052
And dividing by the area of the rectangular frame to obtain a precise region of interest Pooling (PrROI Pooling)
Figure FDA0002784186600000053
After the characteristics of the accurate region-of-interest pooling layer are obtained, the three characteristics of the reference frame, the intermediate frame and the current frame are spliced and input into the full-connection layer to output the final position confidence; and comparing the similarity degree of the candidate target and the historical target, and finding the maximum similar target as a tracking result.
CN202010309617.7A 2020-04-20 2020-04-20 Small-sized low-flying target visual detection tracking system and method thereof Active CN111508002B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010309617.7A CN111508002B (en) 2020-04-20 2020-04-20 Small-sized low-flying target visual detection tracking system and method thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010309617.7A CN111508002B (en) 2020-04-20 2020-04-20 Small-sized low-flying target visual detection tracking system and method thereof

Publications (2)

Publication Number Publication Date
CN111508002A CN111508002A (en) 2020-08-07
CN111508002B true CN111508002B (en) 2020-12-25

Family

ID=71869437

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010309617.7A Active CN111508002B (en) 2020-04-20 2020-04-20 Small-sized low-flying target visual detection tracking system and method thereof

Country Status (1)

Country Link
CN (1) CN111508002B (en)

Families Citing this family (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112489081B (en) * 2020-11-30 2022-11-08 北京航空航天大学 Visual target tracking method and device
CN112633162B (en) * 2020-12-22 2024-03-22 重庆大学 Pedestrian rapid detection and tracking method suitable for expressway external field shielding condition
CN115335872A (en) * 2021-02-26 2022-11-11 京东方科技集团股份有限公司 Training method of target detection network, target detection method and device
CN112949480A (en) * 2021-03-01 2021-06-11 浙江大学 Rail elastic strip detection method based on YOLOV3 algorithm
CN113012203B (en) * 2021-04-15 2023-10-20 南京莱斯电子设备有限公司 High-precision multi-target tracking method under complex background
CN113449680B (en) * 2021-07-15 2022-08-30 北京理工大学 Knowledge distillation-based multimode small target detection method
CN113724290B (en) * 2021-07-22 2024-03-05 西北工业大学 Multi-level template self-adaptive matching target tracking method for infrared image
CN113658222A (en) * 2021-08-02 2021-11-16 上海影谱科技有限公司 Vehicle detection tracking method and device
CN114066936B (en) * 2021-11-06 2023-09-12 中国电子科技集团公司第五十四研究所 Target reliability tracking method in small target capturing process
CN114241008B (en) * 2021-12-21 2023-03-07 北京航空航天大学 Long-time region tracking method adaptive to scene and target change
CN116596958B (en) * 2023-07-18 2023-10-10 四川迪晟新达类脑智能技术有限公司 Target tracking method and device based on online sample augmentation
CN117292306A (en) * 2023-11-27 2023-12-26 四川迪晟新达类脑智能技术有限公司 Edge equipment-oriented vehicle target detection optimization method and device
CN117576164B (en) * 2023-12-14 2024-05-03 中国人民解放军海军航空大学 Remote sensing video sea-land movement target tracking method based on feature joint learning

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108230367A (en) * 2017-12-21 2018-06-29 西安电子科技大学 A kind of quick method for tracking and positioning to set objective in greyscale video
CN109584269A (en) * 2018-10-17 2019-04-05 龙马智芯(珠海横琴)科技有限公司 A kind of method for tracking target

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7379652B2 (en) * 2005-01-14 2008-05-27 Montana State University Method and apparatus for detecting optical spectral properties using optical probe beams with multiple sidebands
JP6042146B2 (en) * 2012-09-18 2016-12-14 株式会社東芝 Object detection apparatus and object detection method
CN107862705B (en) * 2017-11-21 2021-03-30 重庆邮电大学 Unmanned aerial vehicle small target detection method based on motion characteristics and deep learning characteristics
CN108154118B (en) * 2017-12-25 2018-12-18 北京航空航天大学 A kind of target detection system and method based on adaptive combined filter and multistage detection
CN110363789B (en) * 2019-06-25 2022-03-25 电子科技大学 Long-term visual tracking method for practical engineering application
CN110533691B (en) * 2019-08-15 2021-10-22 合肥工业大学 Target tracking method, device and storage medium based on multiple classifiers
CN110717934B (en) * 2019-10-17 2023-04-28 湖南大学 Anti-occlusion target tracking method based on STRCF

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108230367A (en) * 2017-12-21 2018-06-29 西安电子科技大学 A kind of quick method for tracking and positioning to set objective in greyscale video
CN109584269A (en) * 2018-10-17 2019-04-05 龙马智芯(珠海横琴)科技有限公司 A kind of method for tracking target

Also Published As

Publication number Publication date
CN111508002A (en) 2020-08-07

Similar Documents

Publication Publication Date Title
CN111508002B (en) Small-sized low-flying target visual detection tracking system and method thereof
CN109614985B (en) Target detection method based on densely connected feature pyramid network
CN107563313B (en) Multi-target pedestrian detection and tracking method based on deep learning
CN107609525B (en) Remote sensing image target detection method for constructing convolutional neural network based on pruning strategy
Zhou et al. Efficient road detection and tracking for unmanned aerial vehicle
CN110287826B (en) Video target detection method based on attention mechanism
US20160154999A1 (en) Objection recognition in a 3d scene
CN107633226B (en) Human body motion tracking feature processing method
CN111882586B (en) Multi-actor target tracking method oriented to theater environment
CN108446634B (en) Aircraft continuous tracking method based on combination of video analysis and positioning information
CN111738055B (en) Multi-category text detection system and bill form detection method based on same
CN113483747B (en) Improved AMCL (advanced metering library) positioning method based on semantic map with corner information and robot
CN112836639A (en) Pedestrian multi-target tracking video identification method based on improved YOLOv3 model
JP2006209755A (en) Method for tracing moving object inside frame sequence acquired from scene
CN111160407A (en) Deep learning target detection method and system
CN112884742A (en) Multi-algorithm fusion-based multi-target real-time detection, identification and tracking method
CN110555868A (en) method for detecting small moving target under complex ground background
CN111126278A (en) Target detection model optimization and acceleration method for few-category scene
CN111354022A (en) Target tracking method and system based on kernel correlation filtering
CN112164093A (en) Automatic person tracking method based on edge features and related filtering
CN113052136B (en) Pedestrian detection method based on improved Faster RCNN
CN113379789A (en) Moving target tracking method in complex environment
CN112347967A (en) Pedestrian detection method fusing motion information in complex scene
CN111275733A (en) Method for realizing rapid tracking processing of multiple ships based on deep learning target detection technology
CN113628242A (en) Satellite video target tracking method and system based on background subtraction method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant