CN110517285B - Large-scene minimum target tracking based on motion estimation ME-CNN network - Google Patents

Large-scene minimum target tracking based on motion estimation ME-CNN network Download PDF

Info

Publication number
CN110517285B
CN110517285B CN201910718847.6A CN201910718847A CN110517285B CN 110517285 B CN110517285 B CN 110517285B CN 201910718847 A CN201910718847 A CN 201910718847A CN 110517285 B CN110517285 B CN 110517285B
Authority
CN
China
Prior art keywords
target
network
motion
cnn
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910718847.6A
Other languages
Chinese (zh)
Other versions
CN110517285A (en
Inventor
焦李成
杨晓岩
李阳阳
唐旭
程曦娜
刘旭
杨淑媛
冯志玺
侯彪
张丹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xidian University
Original Assignee
Xidian University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xidian University filed Critical Xidian University
Priority to CN201910718847.6A priority Critical patent/CN110517285B/en
Publication of CN110517285A publication Critical patent/CN110517285A/en
Application granted granted Critical
Publication of CN110517285B publication Critical patent/CN110517285B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/207Analysis of motion for motion estimation over a hierarchy of resolutions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10032Satellite or aerial image; Remote sensing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a large-scene minimum target tracking method based on a motion estimation ME-CNN network, which solves the problem of minimum target tracking by using motion parameters without registration and comprises the following implementation steps: acquiring an initial training set D of a target motion estimation network ME-CNN; constructing a network ME-CNN for estimating the motion of a target; calculating a network ME-CNN loss function by using the target motion parameters; judging whether the training set is an initial training set; updating a loss function training label; obtaining an initial model for predicting the movement position of the target; correcting the position of the prediction model; updating the training data set by using the corrected target position to complete target tracking of one frame; and obtaining a remote sensing video target tracking result. According to the method, the target motion position is predicted by using the deep learning network ME-CNN, the problems of large scene image registration in tracking and difficulty in super-fuzzy target feature extraction are solved, the target feature dependency is reduced, and the target tracking accuracy in a super-fuzzy video is improved.

Description

Large-scene minimum target tracking based on motion estimation ME-CNN network
Technical Field
The invention belongs to the technical field of remote sensing video processing, relates to remote sensing video target tracking of a large-scene minimum target, and particularly relates to a large-scene minimum target remote sensing video tracking method based on a motion estimation ME-CNN network. The method is used for safety monitoring, smart city construction, traffic facility monitoring and the like.
Background
Remote sensing target tracking is an important research direction in the field of computer vision, wherein target tracking of a remote sensing video with a small target and low resolution in a large scene shot by a moving satellite is a very challenging research problem. The remote sensing video of the large scene and the small target records the daily activity condition of a certain area for a period of time, because the height of satellite shooting is very high and covers most of a city, the resolution of the video is not high, the sizes of vehicles, ships and airplanes in the video are extremely small, the sizes of the vehicles in the video even reach about 3 to 3 pixels, the contrast with the surrounding environment is extremely low, and only one small bright spot can be observed by human eyes, so the problem of tracking the ultra-low pixel and extremely-small target belongs to the problem of tracking the large scene and the extremely-small target, and the difficulty is higher; and because the satellite for shooting the video continuously moves, the video has obvious deviation in one direction as a whole, and simultaneously partial regional scaling can occur due to the height of the region, so that the method of firstly carrying out image registration and then carrying out frame difference method to obtain the moving position of the target is difficult, and great challenge is brought to the remote sensing video target tracking of the extremely small target in a large scene.
Video target tracking is the prediction of target position and size in subsequent video frames after the target position and size in an initial frame of a video are given. At present, algorithms in the field of video tracking are mostly based on Neural networks (Neural networks) and related filtering (Correlation filters), wherein the Neural Network-based algorithms, such as the CNN-SVM method, have the main idea that a target is input into a multilayer Neural Network first, target characteristics are learned, tracking is performed by using a traditional SVM method, and the extracted target characteristics are learned through a large amount of training data and have higher discriminative performance than the traditional characteristics; the basic idea of an algorithm based on the relevant filtering, such as the KCF method, is to search a filtering template, convolve the image of the next frame with the filtering template, make the region with the largest response as the predicted target, and use the template and other search regions to perform convolution operation, and the search region with the largest response is the target position.
The algorithm of natural optical video tracking is difficult to be applied to the remote sensing video of a tiny small target in a large scene, and because the target is tiny and fuzzy in size, effective target features cannot be obtained by learning through a neural network. The traditional tracking method of the remote sensing video is not suitable for the video with continuous background deviation and partial area scaling, the technical methods of image registration and frame difference method cannot be realized, the contrast ratio of the target and the surrounding environment is extremely low, and the tracking method is easy to lose.
Disclosure of Invention
The invention aims to overcome the defects in the prior art and provides a large-scene small-target remote sensing video tracking method based on motion estimation, which is low in calculation complexity and higher in precision.
The invention relates to a large-scene minimum target remote sensing video tracking method based on a motion estimation ME-CNN network, which is characterized by comprising the following steps of:
(1) obtaining an initial training set D of a minimum target motion estimation network ME-CNN:
taking the front F frame images of the original remote sensing data video A, continuously marking a boundary box for the same target of each image, and arranging the vertex coordinates of the upper left corner of each boundary box together according to the video frame number sequence to be used as a training set D;
(2) constructing a network ME-CNN for estimating the movement of the minimum target: the system comprises three parallel convolution modules for extracting different characteristics of training data, and a connecting layer, a full connecting layer and an output layer are sequentially stacked;
(3) calculating the loss function of the network ME-CNN by using the minimum target motion parameter: calculating to obtain the motion trend of the target according to the motion rule of the target, taking the motion trend as a training label corresponding to the target, and calculating the Euclidean spatial distance between the training label and the prediction result of the ME-CNN network as a loss function of the ME-CNN network optimization training;
(4) judging whether the training set is an initial training set: judging whether the current training set is an initial training set, if not, executing the step (5) and updating the training labels in the loss function; otherwise, if the training set is the initial training set, executing the step (6) and entering the circular training of the network;
(5) updating training labels in the loss function: when the current training set is not the initial training set, recalculating the training labels of the loss function by using the data of the current training set, calculating the training labels by using the minimum target motion parameters by using the calculation method, wherein the method is the same as the method in the step (3), the recalculated training labels participate in the ME-CNN training of the motion estimation network, and entering the step (6);
(6) obtaining an initial model M1 for predicting the movement position of the target: inputting the training set D into a target motion estimation network ME-CNN, training the network according to the current loss function, and obtaining an initial model M1 for predicting the motion position of the target;
(7) position result of the corrected prediction model: calculating the auxiliary position offset of the target, and correcting the position result predicted by the motion estimation network ME-CNN by using the offset;
(7a) obtaining a target gray level image block: obtaining the target position (P) of the next frame according to the initial model M1 for predicting the target motion positionx,Py) Based on the obtained target position (P)x,Py) Taking out a gray image block of the target from the image of the next frame, and normalizing to obtain a normalized target gray image block;
(7b) obtaining a target position offset: carrying out brightness grading on the normalized target gray image block, determining the position of a target in the image block by using a vertical projection method, and calculating the distance between the center position of the target and the center position of the image block to obtain the offset of the target position;
(7c) obtaining a corrected target position: correcting the position of the predicted target by the motion estimation network ME-CNN by using the obtained target position offset to obtain all positions of the corrected target;
(8) and updating the training data set by using the corrected target position to complete target tracking of one frame: adding the obtained upper left corner position of the target into the last line of the training set D, removing the first line of the training set D, performing one-time operation to obtain a corrected and updated training set D, completing the training of one frame, and obtaining the target position result of one frame;
(9) judging whether the current video frame number is less than the total video frame number: if the number of the video frames is less than the total number of the video frames, repeating the steps (4) to (9) in a circulating way, performing tracking optimization training on the target until all the video frames are traversed, continuing the training, otherwise, if the number of the video frames is equal to the total number of the video frames, finishing the training, and executing the step (10);
(10) obtaining a remote sensing video target tracking result: and the accumulated output is the remote sensing video target tracking result.
The invention solves the problems of high calculation complexity and low tracking precision of the existing video tracking algorithm.
Compared with the prior art, the invention has the following advantages:
(1) the ME-CNN adopted by the invention does not need to carry out image registration and then carry out frame difference method or complex image background modeling in the traditional method to obtain the motion trail of the target, can carry out analysis on a training set consisting of target positions of the previous F frame image through a neural network, and can realize network self-circulation training by network prediction without manually marking target position labels in subsequent video frames, thereby greatly reducing the complexity of a tracking algorithm and improving the practicability of the algorithm.
(2) The algorithm adopted by the invention automatically corrects the target position of the remote sensing video by combining the ME-CNN network and the auxiliary position offset method, modifies the loss function of the motion estimation network according to the motion rule of the target, reduces the calculated amount of the network and improves the robustness of target tracking.
Drawings
FIG. 1 is a flow chart of an implementation of the present invention;
fig. 2 is a schematic structural diagram of an ME-CNN network proposed by the present invention;
FIG. 3 is a graph comparing the predicted trajectory results of the present invention for very small targets in a large scene with the standard target trajectory, where the predicted results of the present invention are green curves and red is the accurate target trajectory.
Detailed Description
The invention is described in detail below with reference to the figures and the specific embodiments.
Example 1
The remote sensing video tracking of the large-scene tiny target plays an important role in safety monitoring, smart city construction, traffic facility monitoring and the like. The remote sensing video researched by the invention is a remote sensing video with a small target and low resolution in a large scene shot by a mobile satellite. The video tracking method has the advantages that the researched video tracking target is extremely fuzzy, the target is extremely small, the contrast with the surrounding environment is low, the human eyes can hardly see that the target is a vehicle under the condition that the target does not move, the video can be subjected to image translation and partial zooming due to the movement of a satellite and the change of the altitude of a shooting area, the target tracking difficulty of the video is greatly improved compared with that of the clear video, and the video tracking method is also a challenge of remote sensing video tracking. The existing methods mainly comprise two methods, one method is to extract target characteristics by using neural network learning, extract a plurality of search frames in the next frame and select the frame with the highest target characteristic score as the position of a target. The other method is that firstly, the image is registered, then the frame difference method is carried out to obtain the target motion track, then a filtering template is searched, the image of the next frame and the filtering template are subjected to convolution operation, and the region with the largest response is the predicted target. Therefore, the invention provides a large-scene minimum target remote sensing video tracking method based on a motion estimation ME-CNN network by research aiming at the current situations, and referring to fig. 1, the method comprises the following steps:
(1) obtaining an initial training set D of a minimum target motion estimation network ME-CNN:
the method comprises the steps of taking front F frame images of an original remote sensing data video A, selecting only one target in each image, and continuously marking a boundary frame for the same target of each image.
(2) Constructing a network ME-CNN for estimating the movement of the minimum target: the ME-CNN network comprises three convolution modules which are connected in parallel and extract different characteristics of training data to obtain different motion characteristics of a target, and then different extracted motion characteristics, a full connection layer and an output layer are sequentially stacked and fused by a connection layer to obtain an output result, so that the ME-CNN network is formed. According to the method, three convolution modules are used for obtaining different motion characteristics of the target, a single convolution module is difficult to process to obtain the characteristics of the whole training set, and if the network layer is deep, the problem of gradient disappearance occurs, so that the method widens the network, extracts the characteristics of the training set under different conditions in multiple scales, reduces the complexity of the network, and accelerates the network speed. Because the video of the invention continuously moves and shifts and partial areas are zoomed due to different heights of regions, methods such as an image registration and frame difference method, background modeling and the like cannot be used aiming at the video, and at the moment, a target motion track can be obtained by using an ME-CNN network.
(3) Calculating the loss function of the network ME-CNN by using the minimum target motion parameter: the method comprises the steps of calculating the motion trend of a target according to the motion rule of the target, using the motion trend as a training label corresponding to the target, and calculating the Euclidean space distance between the training label and the prediction result of the ME-CNN network to be used as a loss function for optimizing the ME-CNN network.
(4) Judging whether the training set is an initial training set: and (5) judging whether the current training set is an initial training set, if not, executing the step (5), updating the training labels in the loss function, and further participating in network training. Otherwise, if the current training set is the initial training set, the step (6) is executed, and the circular training of the network is entered.
(5) Updating training labels in the loss function: because the training set D is continuously updated in the subsequent step (8), the training labels in the loss function need to be continuously adjusted according to the updated training set D in the training process, when the current training set is not the initial training set, the training labels of the loss function should be recalculated by using the data of the current training set, and the calculation method is the same as the method of the step (3) in that the training labels are calculated by using the minimum target motion parameters; and (5) the recalculated training label participates in the ME-CNN training of the motion estimation network, and the step (6) is entered.
(6) Obtaining an initial model M1 for predicting the movement position of the target: and inputting the training set D into the object motion estimation network ME-CNN, training the network according to the current loss function, and obtaining an initial model M1 for predicting the motion position of the object.
(7) Position result of the corrected prediction model: and calculating the auxiliary position offset of the target, and correcting the position result predicted by the motion estimation network ME-CNN by using the offset.
(7a) Obtaining a target gray level image block: obtaining the target position (P) of the next frame according to the initial model M1 for predicting the target motion positionx,Py) Based on the obtained target position (P)x,Py) The gray image blocks of the target are taken out from the image of the next frame and are normalized to obtain the normalized gray image blocks of the target, and because the size of the target is extremely small and the contrast with the surrounding environment is extremely low, the method for judging the offset by using the neural network has poor effect on the image blocks, a smaller target frame is firstly taken, and then the method for judging the offset in the frame is better.
(7b) Obtaining a target position offset: and (3) carrying out brightness grading on the normalized target gray image block, displaying the target and the road at different brightness, determining the position of the target in the image block by using a vertical projection method because the surrounding environment of the road and the contrast of the target are extremely low, and calculating the distance between the center position of the target and the center position of the image block to obtain the target position offset.
(7c) Obtaining a corrected target position: and correcting the position of the predicted target by the motion estimation network ME-CNN by using the obtained target position offset to obtain all the corrected positions of the target, including the position of the upper left corner of the target.
(8) And updating the training data set by using the corrected target position to complete target tracking of one frame: and adding the obtained position of the upper left corner of the target into the last line of the training set D, removing the first line of the training set D, performing one-time operation to obtain a corrected and updated training set D, completing the training of one frame, and obtaining the target position result of one frame.
(9) And (4) judging whether the current video frame number is less than the total video frame number, if so, circularly repeating the steps (4) to (9), updating the model parameters again, improving the model adaptability, performing tracking optimization training on the target until all the video frames are traversed, continuing the training, otherwise, if the current video frame number is equal to the total video frame number, ending the training, and executing the step (10).
(10) Obtaining a remote sensing video target tracking result: and after the training is finished, the accumulated target position output is the remote sensing video target tracking result.
The ME-CNN adopted by the invention does not need to carry out image registration and then carry out frame difference method or complex image background modeling in the traditional method to obtain the target motion track, and the new algorithm provided by the invention can effectively extract the target motion characteristics by analyzing the training set formed by the target positions of the previous F frame images through the neural network. Because the gradient disappears and other problems can occur when the network is too deep, the ME-CNN network of multi-scale analysis is adopted to predict the motion trend of the target, and the target position label in the subsequent video frame is not required to be marked manually, so that the network self-circulation training can be realized, the complexity of the tracking algorithm is greatly reduced, the practicability of the algorithm is improved, and the target position can be quickly and accurately found through the motion estimation network of the target without image registration. The method is characterized in that the ME-CNN network and the auxiliary position offset method are combined to automatically judge the target position of the remote sensing video, the movement speed of the target is obtained according to the movement condition of the target, the possible movement trend of the target is analyzed, the loss function of the movement estimation network is modified, and the robustness of target tracking is improved.
The method utilizes a deep learning-based method to carry out motion analysis on the super-blurred target, predicts the next prediction direction of the super-blurred target, corrects the motion estimation network by the position offset, and can track the target without subsequent labels, thereby avoiding the problems of image registration of large scenes in tracking and difficult extraction of the super-blurred target characteristics, obviously improving the accuracy of target tracking in the super-blurred video, and being also suitable for tracking in other various remote sensing videos.
Example 2
The method for tracking a large-scene minimum target remote sensing video based on a motion estimation ME-CNN network is the same as that in embodiment 1, and the method for constructing the network ME-CNN for estimating the minimum target motion described in step (2) comprises the following steps as shown in FIG. 2:
(2a) overall structure of the motion estimation network: the invention relates to a motion estimation network ME-CNN, which comprises three convolution modules connected in parallel. The invention constructs different motion characteristics extracted by using the connection layer fusion in the network ME-CNN for estimating the minimum target motion, uses the full connection layer to refine and analyze, and obtains a result through the output layer output.
(2b) Structure of three convolution modules in parallel: the convolution modules in parallel are convolution module I, convolution module I and convolution module I respectively, wherein
The convolution module I comprises a locally connected Locallyconnected1D convolution layer, and the step length is 2 to extract the coordinate position information of the target;
the convolution module I comprises cavity convolution, and the step length is 1;
the convolution module I comprises one-dimensional convolution with the step length of 2;
the convolution modules I, I and I obtain the position characteristics of different scales of the target to obtain three output data, and then the outputs of the three convolution modules are connected in series to obtain a fusion convolution result; and inputting the full connection layer and the output layer to obtain a final prediction result. According to the method, three convolution modules are used for obtaining different motion characteristics of the target, a single convolution module is difficult to process to obtain the characteristics of the whole training set, and if the network layer is deep, the problem of gradient disappearance occurs, so that the method widens the network, extracts the characteristics of the training set under different conditions in multiple scales, reduces the complexity of the network, and accelerates the network speed. Because the video of the invention continuously moves and shifts and partial areas are zoomed due to different heights of regions, methods such as an image registration and frame difference method, background modeling and the like cannot be used aiming at the video, and at the moment, a target motion track can be obtained by using an ME-CNN network.
Example 3
The method for tracking the large-scene minimum target remote sensing video based on the motion estimation ME-CNN network is the same as the embodiment 1-2, the loss function of the network ME-CNN is calculated by using the minimum target motion parameters in the step 3, the motion condition of the target is roughly analyzed by processing the data of the training set D, and a certain guiding function is provided for the optimization direction of the motion estimation network ME-CNN, and the method comprises the following steps:
(3a) acquiring the target displacement of a training set D: taking out the data of the F-th line, the F-2 th line and the F-4 th line of the training set D, and subtracting the data from the data of the first line of the training set D to obtain the target displacement S between the F-th frame, the F-2 nd frame and the F-4 th frame and the first frame in sequence1、S2、S3。S1Is the target displacement between the F-th frame and the first frame, S2Is the target displacement between the F-2 frame and the first frame, S3Is the target displacement between the F-4 th frame and the first frame. If the training set is not the initial training set but the training set D which is updated for i times, the frame number corresponding to each row of the training set is correspondingly changedChanging the frame into a 1+ i frame, a 2+ i frame, … … and an F + i frame, taking out the data of the F-th line, the F-2 th line and the F-4 th line of the training set D, subtracting the data of the first line of the training set D respectively to obtain the target displacement of the F + i frame, the F + i-2 th frame and the F + i-4 th frame respectively, and sequentially obtaining the displacement S between the target displacement of the F + i frame, the F + i-2 th frame and the F + i-4 th frame and the first frame1、S2、S3
(3b) Obtaining the motion trend of the target:
according to the motion rule of the target, the obtained target displacement is utilized to calculate the motion trend (G) of the target in the x and y directions of the image coordinate system according to the following formulax,Gy);
V1=(S1-S2)/2
V2=(S2-S3)/2
a=(V1-V2)/2
G=V1+a/2
The invention uses an image coordinate system, which takes the upper left corner of an image as an origin, the horizontal right direction is the x direction, and the vertical downward direction is the y direction. In the above formula, V1Is a displacement S1And S2Velocity of movement of the target, V2Is a displacement S2And S3The moving speed of the target, a is the moving acceleration, and G is the moving trend of the target.
(3c) Constructing a loss function of a motion estimation network ME-CNN:
calculating to obtain the motion trend of the target according to the motion rule of the target, using the motion trend as a training label corresponding to the target, and calculating to obtain the motion trend (G) of the targetx,Gy) And estimating the predicted position (P) of the network ME-CNN outputx,Py) The Euclidean spatial distance between the two networks is constructed as a loss function of the motion estimation network ME-CNN;
Figure BDA0002156443730000091
in the formula, GxIs the moving trend of the target in the x direction under the image coordinate system, GyFor the image sittingThe y-direction object motion tendency, PxFor the prediction result of the motion estimation network in the x-direction in the image coordinate system, PyAnd estimating the prediction result of the network in the y direction under the image coordinate system for the motion.
A comprehensive example is given below to further illustrate the invention
Example 4
The method for tracking the remote sensing video of the large-scene tiny target based on the motion estimation ME-CNN network is the same as the embodiment 1-3,
referring to fig. 1, a large-scene minimal target remote sensing video tracking method based on a motion estimation ME-CNN network includes the following steps:
(1) obtaining an initial training set D of a minimum target motion estimation network ME-CNN:
taking the former F frame images of the original remote sensing data video A, continuously marking a boundary frame for a target of each image, and superposing the top left corner vertex coordinates of each boundary frame together to form a training set D, wherein the training set D is a matrix of F rows and 2 columns, each row corresponds to the target coordinates of one frame in the video, the position of the target can be represented by the coordinates of the top left corner vertex and can also be represented by the center coordinates, the analysis of the motion condition of the target is not influenced, and the minimum target is simply called as the target in the invention.
(2) Constructing a network ME-CNN for estimating the movement of the minimum target: the method comprises three convolution modules which are connected in parallel and used for extracting different characteristics of training data so as to obtain different motion characteristics of a target, a single convolution layer is difficult to process so as to obtain the characteristics of the whole training set, if a network layer is deep, the problem of gradient disappearance can occur, therefore, the network is widened, the characteristics of different conditions of the training set are extracted in multiple scales, the complexity of the network can be reduced, the network speed is accelerated, and then connecting layers are sequentially stacked for fusing the extracted motion characteristics and the whole connecting layers to obtain results by adding an analysis layer and an output layer.
(2a) Overall structure of the motion estimation network: the motion estimation network ME-CNN comprises three convolution modules connected in parallel, and a connection layer, a full connection layer and an output layer are sequentially stacked;
(2b) structure of three convolution modules in parallel: the convolution modules in parallel are convolution module I, convolution module I and convolution module I respectively, wherein
The convolution module I comprises a locally connected Locallyconnected1D convolution layer, and the step length is 2 to extract the coordinate position information of the target;
the convolution module I comprises cavity convolution, and the step length is 1;
the convolution module I comprises one-dimensional convolution with the step length of 2;
the convolution modules I, I and I obtain the position characteristics of different scales of the target to obtain three output data, and then the outputs of the three convolution modules are connected in series to obtain a fusion convolution result; and inputting the full connection layer and the output layer to obtain a final prediction result.
(3) Constructing a loss function of the ME-CNN of the minimum target motion estimation network: calculating to obtain the motion trend of the target according to the motion rule of the target, taking the motion trend as a training label corresponding to the target, and calculating the Euclidean spatial distance between the motion trend and the prediction result of the ME-CNN network as a loss function of the ME-CNN network;
(3a) acquiring the target displacement of a training set D: if the training set is an initial training set, taking out the data of the F-th line, the F-2 th line and the F-4 th line of the training set D, and subtracting the data of the first line of the training set D respectively to obtain the target displacement S between the F-th frame, the F-2 nd frame and the F-4 th frame and the first frame in sequence1、S2、S3。S1Is the target displacement between the F-th frame and the first frame, S2Is the target displacement between the F-2 frame and the first frame, S3Is the target displacement between the F-4 th frame and the first frame. If the training set is not the initial training set but the training set D which is updated for i times, the frame number corresponding to each line of the training set is changed correspondingly and is changed into the 1+ i frame, the 2+ i frame, … … and the F + i frame, the data of the F-th line, the F-2 line and the F-4 line of the training set D are taken out and subtracted from the data of the first line of the training set D respectively, and the target displacement of the F + i frame, the F + i-2 frame and the F + i-4 frame is obtained and is S sequentially between the first frame and the F + i frame1、S2、S3
(3b) Obtaining the motion trend of the target:
according to the motion rule of the target, the obtained training data target displacement is utilized to calculate the motion trend (G) of the target in the x and y directions of the image coordinate system according to the following formulax,Gy)。
V1=(S1-S2)/2
V2=(S2-S3)/2
a=(V1-V2)/2
G=V1+a/2
(3c) Constructing a loss function of a motion estimation network ME-CNN:
calculated target motion tendency (G)x,Gy) And estimating the predicted position (P) of the network outputx,Py) The Euclidean spatial distance between the two networks is constructed as a loss function of the motion estimation network ME-CNN.
Figure BDA0002156443730000111
(4) Updating training labels in the loss function: because the training set D is continuously updated in the subsequent step (7), the training labels in the loss function need to be continuously adjusted according to the updated training set D in the training process, and participate in the ME-CNN training of the motion estimation network.
(5) Obtaining an initial model M1 for predicting the movement position of the target: and inputting the training set D into the object motion estimation network ME-CNN, training the network according to the loss function, and obtaining an initial model M1 for predicting the motion position of the object.
(6) Position result of the corrected prediction model: and calculating the auxiliary position offset of the target, and correcting the position result predicted by the motion estimation network ME-CNN by using the offset.
(6a) Obtaining a target gray level image block: obtaining the target position (P) of the next frame according to the initial model M1 for predicting the target motion positionx,Py) Based on the obtained target position (P)x,Py) Taking out the gray image block of the target from the image of the next frame, and normalizing to obtain the normalized targetThe gray scale image block has the advantages that the size of the target is extremely small, the contrast with the surrounding environment is extremely low, and the effect of the method for judging the offset by using the neural network is poor, so that a smaller target frame is obtained firstly, and then the method for judging the offset in the frame is better.
(6b) Obtaining a target position offset: and carrying out brightness grading on the normalized target gray image block, determining the position of a target in the image block by using a vertical projection method, and calculating the distance between the center position of the target and the center position of the image block to obtain the target position offset.
(6c) Obtaining a corrected target position: and correcting the position of the predicted target by the motion estimation network ME-CNN by using the obtained target position offset to obtain all the corrected positions of the target, including the position of the upper left corner of the target.
(7) And updating the training data set by using the corrected target position to complete target tracking of one frame: and adding the obtained position of the upper left corner of the target into the last line of the training set D, removing the first line of the training set D, performing one-time operation to obtain a corrected and updated training set, completing the training of one frame, and obtaining the target position result of one frame.
(8) Obtaining a remote sensing video target tracking result: and (5) repeating the step (4) to the step (7) circularly, continuously using the updated training set to obtain a training label again according to the method in the step (3), updating the network model, repeating iteration, and carrying out tracking approximate training on the target until all video frames are traversed, wherein the accumulated output is the tracking result of the remote sensing video target.
In the embodiment, the motion estimation model of the target can also extract road information of the target through the target motion of the previous frames, find the city where the target with the same longitude and latitude is located on the map, predict the target motion condition by matching the corresponding road condition through the road, fully utilize the three-dimensional information of the road, and accurately track the target under the condition that the road height changes violently and partial zooming exists in the video; the auxiliary position offset of the target can also be obtained by training the neural network, but the target and the surrounding environment need to be processed in advance to obtain image blocks with higher contrast, so that the neural network can be trained.
The technical effects of the invention are further explained by combining simulation tests as follows:
example 5
The method for tracking the remote sensing video of the large-scene tiny target based on the motion estimation ME-CNN network is the same as the embodiment 1-4,
simulation conditions and contents:
the simulation platform of the invention is as follows: intel Xeon CPU E5-2630v3CPU with a main frequency of 2.40GHz, 64GB running memory, Ubuntu16.04 operating system and Keras and Python software platforms. A display card: GeForce GTX TITAN X/PCIe/SSE2 × 2.
The invention uses a remote sensing video of a Libya Delner area shot by a Jilin No. I video satellite, a vehicle of a former 10-frame image is taken as a target, a frame is marked on the target in the image, and the position of the top left vertex of the frame is taken as a training set DateSet. The target video is tracked and simulated by the method and the conventional target tracking method based on KCF respectively.
Simulation content and results:
the comparison method, namely the conventional KCF-based target tracking method, is used for carrying out experiments under the simulation conditions, namely the comparison method and the comparison method are used for tracking the vehicle target in the remote sensing video of the Libya Delner area, a comparison graph of the predicted target track result (green curve) and the accurate target track (red curve) of the ME-CNN network is obtained and shown in figure 3, and the results shown in the table 1 are obtained.
TABLE 1 List of remote sensing video target tracking results in Libiadral area
Figure BDA0002156443730000121
Figure BDA0002156443730000131
And (3) simulation result analysis:
in table 1, Precision represents the area overlapping rate of the target position and the tag position predicted by the ME-CNN network, IOU represents the percentage of the average euclidean distance between the center position of the bounding box and the center position of the tag being smaller than a given threshold, in this example, the given threshold is selected to be 5, KCF represents the comparison method, and ME-CNN represents the method of the present invention.
Referring to table 1, it can be seen from the comparison of the data in table 1 that the present invention greatly improves the tracking accuracy, the present invention improves Precision from 63.21% to 85.63%,
from table 1, it can be seen that the percentage IOU where the average euclidean distance between the center position of the bounding box and the center position of the tag is less than the given threshold improves 58.72% of the KCF-based target tracking method of the comparative method to 76.51%.
Referring to fig. 3, the red curve in fig. 3 is a standard target trajectory curve, the green curve is a tracking prediction estimation curve for the same target by using the method, the extremely small target in the large scene is displayed in the green box, and comparing the two curves shows that the two curves are consistent in height and basically coincide, which proves that the method has high tracking accuracy.
In short, the method for tracking the large-scene minimal target remote sensing video based on the motion estimation ME-CNN network provided by the invention can improve the tracking accuracy under the conditions that a shooting satellite continuously moves, the video has integral translation and partial scaling, the resolution of the video is extremely low and the target size is extremely small, solves the problem that the minimal target tracking is carried out by using motion parameters without registration, and comprises the following implementation steps: obtaining an initial training set D of a minimum target motion estimation network ME-CNN; constructing a network ME-CNN for estimating the movement of the minimum target; calculating a loss function of the network ME-CNN by using the minimum target motion parameter; judging whether the training set is an initial training set; updating the training labels in the loss function; obtaining an initial model M1 for predicting the movement position of the target; correcting the position result of the prediction model; updating the training data set by using the corrected target position to complete target tracking of one frame; judging whether the current video frame number is less than the total video frame number; and obtaining a remote sensing video target tracking result. The invention uses the deep learning network ME-CNN to predict the target motion position, avoids the problems of large scene image registration and difficult extraction of the super-fuzzy target characteristics in the tracking of the existing method, reduces the dependency on the target characteristics, obviously improves the accuracy of target tracking in the super-fuzzy video, and is also suitable for tracking in other various remote sensing videos.

Claims (3)

1. A large-scene minimum target tracking method based on a motion estimation ME-CNN network is characterized by comprising the following steps:
(1) obtaining an initial training set D of a minimum target motion estimation network ME-CNN:
taking the front F frame images of the original remote sensing data video A, continuously marking a boundary box for the same target of each image, and arranging the vertex coordinates of the upper left corner of each boundary box together according to the video frame number sequence to be used as a training set D;
(2) constructing a network ME-CNN for estimating the movement of the minimum target: the system comprises three parallel convolution modules for extracting different characteristics of training data, and a connecting layer, a full connecting layer and an output layer are sequentially stacked;
(3) calculating the loss function of the network ME-CNN by using the minimum target motion parameter: calculating to obtain the motion trend of the target according to the motion rule of the target, taking the motion trend as a training label corresponding to the target, and calculating the Euclidean spatial distance between the training label and the prediction result of the ME-CNN network as a loss function of the ME-CNN network optimization training;
(4) judging whether the training set is an initial training set: judging whether the current training set is an initial training set, if not, executing the step (5) and updating the training labels in the loss function; otherwise, if the training set is the initial training set, executing the step (6) and entering the circular training of the network;
(5) updating training labels in the loss function: when the current training set is not the initial training set, recalculating the training labels of the loss function by using the data of the current training set, calculating the training labels by using the minimum target motion parameters by using the calculation method, wherein the method is the same as the method in the step (3), the recalculated training labels participate in the ME-CNN training of the motion estimation network, and entering the step (6);
(6) obtaining an initial model M1 for predicting the movement position of the target: inputting the training set D into a target motion estimation network ME-CNN, training the network according to the current loss function, and obtaining an initial model M1 for predicting the motion position of the target;
(7) position result of the corrected prediction model: calculating the auxiliary position offset of the target, and correcting the position result predicted by the motion estimation network ME-CNN by using the offset;
(7a) obtaining a target gray level image block: obtaining the target position (P) of the next frame according to the initial model M1 for predicting the target motion positionx,Py) Based on the obtained target position (P)x,Py) Taking out a gray image block of the target from the image of the next frame, and normalizing to obtain a normalized target gray image block;
(7b) obtaining a target position offset: carrying out brightness grading on the normalized target gray image block, determining the position of a target in the image block by using a vertical projection method, and calculating the distance between the center position of the target and the center position of the image block to obtain the offset of the target position;
(7c) obtaining a corrected target position: correcting the position of the predicted target by the motion estimation network ME-CNN by using the obtained target position offset to obtain all positions of the corrected target;
(8) and updating the training data set by using the corrected target position to complete target tracking of one frame: adding the obtained upper left corner position of the target into the last line of the training set D, removing the first line of the training set D, performing one-time operation to obtain a corrected and updated training set D, completing the training of one frame, and obtaining the target position result of one frame;
(9) judging whether the current video frame number is less than the total video frame number: if the number of the video frames is less than the total video frame number, repeating the steps (4) to (9) in a circulating way, performing tracking optimization training on the target until all the video frames are traversed, if the number of the video frames is equal to the total video frame number, finishing the training, and executing the step (10);
(10) obtaining a remote sensing video target tracking result: and the accumulated output is the remote sensing video target tracking result.
2. The large-scene extremely small target tracking method based on the motion estimation ME-CNN network as claimed in claim 1, wherein: the construction of the network ME-CNN for estimating the movement of the minimum target in the step (2) comprises the following steps:
(2a) overall structure of the motion estimation network: the motion estimation network ME-CNN comprises three convolution modules connected in parallel, and a connection layer, a full connection layer and an output layer are sequentially stacked;
(2b) structure of three convolution modules in parallel: the convolution modules in parallel are convolution module I, convolution module I and convolution module I respectively, wherein
The convolution module I comprises a locally connected Locallyconnected1D convolution layer, and the step length is 2 to extract the coordinate position information of the target;
the convolution module I comprises cavity convolution, and the step length is 1;
the convolution module I comprises one-dimensional convolution with the step length of 2;
the convolution modules I, I and I obtain the position characteristics of different scales of the target to obtain three output data, and then the outputs of the three convolution modules are connected in series to obtain a fusion convolution result; and inputting the full connection layer and the output layer to obtain a final prediction result.
3. The large-scene extremely small target tracking method based on the motion estimation ME-CNN network as claimed in claim 1, wherein: the step 3 of calculating the loss function of the network ME-CNN by using the minimum target motion parameter comprises the following steps:
(3a) acquiring the target displacement of a training set D: taking out the data of the F-th line, the F-2 th line and the F-4 th line of the training set D, and subtracting the data from the data of the first line of the training set D to obtain the target displacement S between the F-th frame, the F-2 nd frame and the F-4 th frame and the first frame in sequence1、S2、S3
(3b) Obtaining the motion trend of the target:
according to the motion rule of the target, the obtained target displacement is utilized to calculate the motion trend (G) of the target in the x and y directions of the image coordinate system according to the following formulax,Gy);
V1=(S1-S2)/2
V2=(S2-S3)/2
a=(V1-V2)/2
G=V1+a/2
In the formula, V1Is a displacement S1And S2Velocity of movement of the target, V2Is a displacement S2And S3The motion speed of the target is in the middle, a is the motion acceleration, and G is the motion trend of the target;
(3c) constructing a loss function of a motion estimation network ME-CNN:
calculating to obtain the motion trend of the target according to the motion rule of the target, using the motion trend as a training label corresponding to the target, and calculating to obtain the motion trend (G) of the targetx,Gy) And estimating the predicted position (P) of the network ME-CNN outputx,Py) The Euclidean spatial distance between the two networks is constructed as a loss function of the motion estimation network ME-CNN;
Figure FDA0003156641680000031
in the formula, GxIs the moving trend of the target in the x direction under the image coordinate system, GyIs the moving trend of the object in the y direction under the image coordinate system, PxFor the prediction result of the motion estimation network in the x-direction in the image coordinate system, PyAnd estimating the prediction result of the network in the y direction under the image coordinate system for the motion.
CN201910718847.6A 2019-08-05 2019-08-05 Large-scene minimum target tracking based on motion estimation ME-CNN network Active CN110517285B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910718847.6A CN110517285B (en) 2019-08-05 2019-08-05 Large-scene minimum target tracking based on motion estimation ME-CNN network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910718847.6A CN110517285B (en) 2019-08-05 2019-08-05 Large-scene minimum target tracking based on motion estimation ME-CNN network

Publications (2)

Publication Number Publication Date
CN110517285A CN110517285A (en) 2019-11-29
CN110517285B true CN110517285B (en) 2021-09-10

Family

ID=68624473

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910718847.6A Active CN110517285B (en) 2019-08-05 2019-08-05 Large-scene minimum target tracking based on motion estimation ME-CNN network

Country Status (1)

Country Link
CN (1) CN110517285B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111986233B (en) * 2020-08-20 2023-02-10 西安电子科技大学 Large-scene minimum target remote sensing video tracking method based on feature self-learning
CN114066937B (en) * 2021-11-06 2022-09-02 中国电子科技集团公司第五十四研究所 Multi-target tracking method for large-scale remote sensing image
CN115086718A (en) * 2022-07-19 2022-09-20 广州万协通信息技术有限公司 Video stream encryption method and device

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107886120A (en) * 2017-11-03 2018-04-06 北京清瑞维航技术发展有限公司 Method and apparatus for target detection tracking
CN108154522A (en) * 2016-12-05 2018-06-12 北京深鉴科技有限公司 Target tracking system
US10176388B1 (en) * 2016-11-14 2019-01-08 Zoox, Inc. Spatial and temporal information for semantic segmentation
CN109242884A (en) * 2018-08-14 2019-01-18 西安电子科技大学 Remote sensing video target tracking method based on JCFNet network
CN109376736A (en) * 2018-09-03 2019-02-22 浙江工商大学 A kind of small video target detection method based on depth convolutional neural networks
CN109636829A (en) * 2018-11-24 2019-04-16 华中科技大学 A kind of multi-object tracking method based on semantic information and scene information

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10176388B1 (en) * 2016-11-14 2019-01-08 Zoox, Inc. Spatial and temporal information for semantic segmentation
CN108154522A (en) * 2016-12-05 2018-06-12 北京深鉴科技有限公司 Target tracking system
CN107886120A (en) * 2017-11-03 2018-04-06 北京清瑞维航技术发展有限公司 Method and apparatus for target detection tracking
CN109242884A (en) * 2018-08-14 2019-01-18 西安电子科技大学 Remote sensing video target tracking method based on JCFNet network
CN109376736A (en) * 2018-09-03 2019-02-22 浙江工商大学 A kind of small video target detection method based on depth convolutional neural networks
CN109636829A (en) * 2018-11-24 2019-04-16 华中科技大学 A kind of multi-object tracking method based on semantic information and scene information

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
一种基于CNN-AE特征提取的目标跟踪方法;殷鹤楠,佟国香;《软件导刊》;20180630;第17卷(第6期);第22-26页 *

Also Published As

Publication number Publication date
CN110517285A (en) 2019-11-29

Similar Documents

Publication Publication Date Title
CN109800689B (en) Target tracking method based on space-time feature fusion learning
CN108416266B (en) Method for rapidly identifying video behaviors by extracting moving object through optical flow
CN112215128B (en) FCOS-fused R-CNN urban road environment recognition method and device
CN112750150B (en) Vehicle flow statistical method based on vehicle detection and multi-target tracking
JP6650657B2 (en) Method and system for tracking moving objects in video using fingerprints
Gao et al. A real-time defect detection method for digital signal processing of industrial inspection applications
CN110517285B (en) Large-scene minimum target tracking based on motion estimation ME-CNN network
CN110287826B (en) Video target detection method based on attention mechanism
CN104463903B (en) A kind of pedestrian image real-time detection method based on goal behavior analysis
CN111161313B (en) Multi-target tracking method and device in video stream
CN104615986B (en) The method that pedestrian detection is carried out to the video image of scene changes using multi-detector
CN109101932B (en) Multi-task and proximity information fusion deep learning method based on target detection
CN112836640A (en) Single-camera multi-target pedestrian tracking method
CN110647836B (en) Robust single-target tracking method based on deep learning
JP2014071902A5 (en)
CN111445497B (en) Target tracking and following method based on scale context regression
CN112434566B (en) Passenger flow statistics method and device, electronic equipment and storage medium
CN117949942B (en) Target tracking method and system based on fusion of radar data and video data
CN107808524A (en) A kind of intersection vehicle checking method based on unmanned plane
CN113033482A (en) Traffic sign detection method based on regional attention
CN111414938B (en) Target detection method for bubbles in plate heat exchanger
CN112949453A (en) Training method of smoke and fire detection model, smoke and fire detection method and smoke and fire detection equipment
CN109558877B (en) KCF-based offshore target tracking algorithm
CN107316030A (en) Unmanned plane is to terrain vehicle automatic detection and sorting technique
CN111986233B (en) Large-scene minimum target remote sensing video tracking method based on feature self-learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant