CN110992378A

CN110992378A - Dynamic update visual tracking aerial photography method and system based on rotor flying robot

Info

Publication number: CN110992378A
Application number: CN201911220924.1A
Authority: CN
Inventors: 谭建豪; 谭姗姗; 殷旺; 刘力铭; 王耀南
Original assignee: Hunan University
Current assignee: Hunan University
Priority date: 2019-12-03
Filing date: 2019-12-03
Publication date: 2020-04-10
Anticipated expiration: 2039-12-03
Also published as: CN110992378B

Abstract

The invention belongs to the technical field of unmanned aerial vehicles, and discloses a dynamic update visual tracking aerial photography method and system based on a rotor wing flying robot, wherein an HOG + SVM is used for detecting a target in a picture; then, the AlexNet network structure is improved by designing three important influence factors, namely the size of a twin network receptive field, the total network step length and feature filling, and a smooth matrix and a background suppression matrix are added, so that the features of the first frames are effectively utilized; and (3) integrating multilayer characteristic elements to learn the appearance change and background suppression of the target on line, and using continuous video sequence training. The invention utilizes the dynamic twin network to ensure the balance of precision and real-time tracking, uses the dynamic update network to rapidly learn the appearance change of the target, fully utilizes the space-time information of the target, and effectively solves the problems of drift, target shielding and the like. According to the invention, a deeper network is selected to obtain target characteristics, and appearance learning and background suppression are used for dynamic tracking, so that robustness is effectively increased.

Description

Dynamic update visual tracking aerial photography method and system based on rotor flying robot

Technical Field

The invention belongs to the technical field of unmanned aerial vehicles, and particularly relates to a unmanned aerial vehicle

Relates to a dynamic update visual tracking aerial photography method and system based on a rotor flying robot.

Background

Currently, the closest prior art: unmanned Aerial Vehicles (UAVs) are Unmanned Aerial vehicles that are operated by radio remote control devices or program control devices and are capable of autonomously performing flight missions without human intervention. In military, due to the characteristics of small size, strong maneuverability, easy control and the like, the rotor flying robot can operate in extreme environments, so that the rotor flying robot is widely applied to anti-terrorism explosion prevention, traffic monitoring, earthquake resistance and disaster relief. In the civil field, unmanned aerial vehicle can be used for fields such as high altitude shooting, pedestrian detection. When a rotor flying robot performs a specific task, it is generally required to perform tracking flight on a specific target and transmit information of the target to a ground station in real time. Therefore, tracking flight of a vision-based rotor flying robot is a great concern and is a current research hotspot.

The tracking flight of the rotor flying robot refers to that a camera is carried on the rotor flying robot flying at low altitude, an image frame sequence of a ground moving target is obtained in real time, image coordinates of the target are calculated and used as input of visual servo control, the speed required by an aircraft is obtained, the position and the posture of the rotor flying robot are further automatically controlled, and the tracked ground moving target is maintained near the visual field center of the camera. The traditional twin network tracking method is good in real-time performance, but when the target is lost due to target shielding and the influence of complex background or illumination is caused, the situation that the target cannot be correctly tracked may occur by taking the first frame as a standard reference. The method aims at the situation that the target is lost due to the influences of shielding, appearance change of the target, tracker drifting, background factor interference and the like in the aerial photography process of the rotor flying robot.

In summary, the problems of the prior art are as follows: (1) the existing rotor flying robot is easy to cause the situations of drift, target loss and the like due to the influences of shielding, illumination, background factor interference and the like in the aerial photography process.

(2) In the prior art, AlexNet network is basically used for extracting features by a tracker, and deeper features related to a target can be extracted by adopting a deeper CIRESNet network, so that the tracker can lock the target in a search area and reduce the influence of a complex background.

(3) Although the existing twin network tracker operates at a high frame rate, the absence of an update part in its frame means that the tracker cannot quickly cope with drastic changes in the target or background, possibly leading to tracking drift in some cases.

The difficulty of solving the technical problems is as follows: when the appearance of the target is changed drastically during the tracking process, the method of identifying the position of the target in the search area using the color feature and the contour feature may fail.

In the tracking process, if each frame is re-detected or a threshold value is used for judging whether tracking loss occurs, the operation time is increased.

More feature information can be obtained by using the CIResNet network for feature extraction, but the tracker frame rate is slightly reduced due to the fact that the CIResNet network is deeper compared with the AlexNet network.

The significance of solving the technical problems is as follows: the tracking precision can be improved by using deeper network extraction features, and the overall performance of the tracker can be improved.

The dynamic updating part enables the robustness of the tracker to be improved, and the tracker does not learn only the characteristic information of the first frame any more, but continuously learns the tracking result of the previous frame, so that the tracker adapts to the change of the target.

The CIRESNet network can effectively extract more sample characteristics, and the tracker can learn more characteristic information of a target, so that the method is suitable for the increase of the complex background capability.

Disclosure of Invention

Aiming at the problems in the prior art, the invention provides a dynamic update visual tracking aerial photography method and system based on a rotor wing flying robot.

The invention is realized in this way, a dynamic update visual tracking aerial photography method based on a rotor flying robot, comprising the following steps:

firstly, carrying out target detection on an input image by using an HOG feature extraction algorithm and a support vector machine algorithm SVM;

and step two, transmitting target frame information obtained by target detection to a visual tracking part, and tracking the target in real time by adopting a dynamic updating twin network based on a CIRESNet network.

Further, in the first step, the target detection method comprises:

(1) dividing the image into a plurality of connected regions which are 8 multiplied by 8 pixel cell units;

(2) collecting gradient amplitude and gradient direction of each pixel point in a cell unit, averagely dividing the gradient direction of [ -90 degrees, 90 degrees ] into 9 intervals (bin), and using the gradient amplitude as weight;

(3) performing histogram statistics on the gradient amplitude of each pixel in the unit in each direction bin interval to obtain a one-dimensional gradient direction histogram;

(4) performing contrast normalization on the histogram on the spatial block;

(5) extracting HOG descriptors through a detection window, and combining the HOG descriptors of all blocks in the detection window to form a final feature vector;

(6) inputting the feature vector into a linear SVM, and performing target detection by using an SVM classifier;

(7) dividing a detection window into overlapped blocks, calculating HOG descriptors for the blocks, and putting formed feature vectors into a linear SVM for target/non-target binary classification;

(8) scanning the detection window at all positions and scales of the whole image, and carrying out non-maximum suppression on the output pyramid to detect a target;

the method for carrying out contrast normalization on the histogram in the step (4) comprises the following steps:

the density of each histogram in this bin is first calculated and then normalized for each cell unit in the bin based on this density.

Further, in the first step, the HOG feature extraction method specifically includes:

① normalizing the whole image, and normalizing the color space of the input image by Gamma correction method, wherein the Gamma correction formula is as follows:

f(I)＝I^γ；

wherein, I is an image pixel value, and Gamma is a Gamma correction coefficient;

② calculating the gradient of the horizontal and vertical coordinates of the image, and calculating the gradient direction value of each pixel position;

G_x(x,y)＝H(x+1,y)-H(x-1,y)；

G_y(x,y)＝H(x,y+1)-H(x,y-1)；

in the formula, Gx (x, y) and Gy (x, y) respectively represent the horizontal gradient and the vertical gradient of the pixel point (x, y) in the input image;

in the formula, G (x, y), H (x, y), α (x, y) respectively represent the gradient magnitude, pixel value and gradient direction of the pixel point at (x, y);

③ histogram calculation, dividing the image into small cell units, providing a code for the local image area;

④ grouping the cell units into large blocks, normalized gradient histograms within the blocks;

⑤ collect the HOG features of all overlapped blocks in the detection window and combine them into the final feature vector for classification.

Further, the step of tracking the target in real time comprises:

(1) obtaining a first frame from a video sequence as a template frame O₁Obtaining a search area Z using the current frame_tSeparately obtaining f via a CIRESNet-16 network^l(O₁) and f^l(Z_t)；

(2) The network adds a transform matrix V and a transform matrix W, both of which can be computed quickly in the frequency domain by FFT. The transformation matrix V is obtained by the tracking result of the t-1 th frame and the first frame target, acts on the convolution characteristic of the target template, learns the change of the target to ensure that the convolution characteristic of the template at the t-th moment is approximately equal to the convolution characteristic of the template at the t-1 th moment, and smoothes the change of the current frame relative to the previous frames;

the transformation matrix W is obtained from the tracking result of the t-1 frame, acts on the convolution characteristic of the candidate region at the t moment, and learns background suppression so as to eliminate the influence caused by irrelevant background characteristics in the target region;

for the transformation matrix V and the transformation matrix W, training is performed using canonical linear regression, f^l(O₁) and f^l(Z_t) Respectively obtaining after transforming the matrix

And

wherein "+" represents a cyclic convolution operation,

representing the change of the appearance form of the target to obtain the target template after the current update,

representing background suppression transformation to obtain a more suitable current search template; the final model is as follows:

adding two transformation matrixes of a smooth matrix V and a background suppression W into the final model on the basis of the twin network, wherein the smooth matrix V learns the appearance change of the previous frame; the background suppression matrix W eliminates clutter in the background.

Further, in step two, the dynamic updating twin network based on CIResNet includes:

performing 7 multiplied by 7 convolution after the cutting operation to delete the characteristics influenced by the filling;

(II) entering an improved network CIRESNet unit after passing through a maximum pooling layer with the stride of 2, wherein the CIR unit stage network has 3 layers in total, the first layer is 1 multiplied by 1 convolution, and the number of channels is 64; the second layer is a 3 × 3 convolution, and the number of channels is 64; the third layer is 1 multiplied by 1 convolution, and the number of channels is 256; adding the feature graph after passing through the convolutional layer, and then entering a crop operation, wherein the crop operation is a 3 multiplied by 3 convolution and offsets the feature that padding is 1 influence;

(III) entering a CIR-D unit, wherein the CIR-D unit stage network has a total of 12 layers, and the first layer, the second layer and the third layer are taken as unit blocks to circulate for 4 times; the first layer is 1 × 1 convolution, and the number of channels is 128; the second layer is a 3 × 3 convolution, and the number of channels is 128; the third layer is 1 multiplied by 1 convolution, and the number of channels is 512;

(IV) cross-correlation operations: the improved twin network structure takes an image pair as input, including an example image Z and a candidate search image X; image Z represents an object of interest, while X represents a search area in a subsequent video frame, typically larger; both inputs were processed by ConvNet with parameter θ; two feature maps are generated, the cross-correlation being:

b represents a deviation term, and the formula searches the image X in a mode of Z so that the maximum value in the response image f is matched with the target position; the network was trained offline by means of random image pairs (Z, X) and corresponding ground labels y obtained from training videos, and the parameter θ in ConvNet was obtained by minimizing the following loss parameters in the training set:

the basic formula of the loss function is:

l(y,v)＝log(1+exp(-yv))；

where y ∈ (+1, -1) represents a true value, and v represents a sample search imageThe actual score of (a); from the sigmoid function, the above formula represents the probability of a positive sample as

Probability of negative example is

Then the following is easily obtained from the formula of the cross entropy:

further, in step (iii), the first block of the CIR-D unit stage is down-sampled by the proposed CIR-D unit, and the number of filters is doubled after down-sampling the feature size; changing the step length of the volume in the bottleneck layer and the quick connection layer from 2 to 1 by CIR-D, and inserting and cutting again after adding operation to delete the characteristics influenced by filling; finally, performing spatial down-sampling of the feature map using maximum pooling; the spatial size of the output feature map is 7 × 7, each feature receiving information from an area of 77 × 77 pixels in size on the input image plane; performing addition operation on the feature graph passing through the convolution layer, and then entering a crop operation and a maximum pooling layer; the key idea of these modifications is to ensure that only the functions affected by the padding are deleted while keeping the inherent block structure unchanged.

Further, in the step two, in the real-time tracking of the target by adopting the dynamic updating twin network based on the CIRESNet, the dynamic updating algorithm comprises the following steps:

(1) inputting a picture to obtain a template image O1;

(2) determining a candidate frame search area Zt in a frame to be tracked;

(3) mapping the original image to a specific feature space through feature mapping to respectively obtain f^l(O₁) and f^l(Z_t) These two depth features;

(4) learning the change of the tracking result of the previous frame and the template frame of the first frame according to the RLR;

fast calculation in the frequency domain can result in:

thereby obtaining the variation

As follows:

in this connection, it is possible to use,

wherein O represents the target, f represents the matrix, the upper right index represents the l channel, and the lower right index represents the several frames, namely, the tracking result of the previous frame and the target of the first frame are obtained;

(5) obtaining the suppression quantity of the current frame background according to the RLR calculation formula in the frequency domain

wherein ,G_t-1Is a map of the same size as the search area of the previous frame,

is to G_t-1Multiplying the central point of the picture by a Gaussian smoothing; learning object changes through online

And background suppression transform

(6) Element multi-layer feature fusion;

(7) joint training is performed, first by forward propagation, for a given N-frame video sequence { I }_tTracking | t ═ 1.. times, N } to obtain N response graphs, and using { S }_t1., N, while denoted by J_tI t 1.., N } represents N target frames;

(8) gradient propagation and parameter updating using BPTT and SGD to obtain L_tAll parameters; by

Calculate out

And

through the left-hand CirConv and RLR layers, efficient propagation of the loss gradient to f is ensured^l；

wherein ,

f and E after representing Fourier transform are discrete Fourier transform matrixes, and for a multi-feature fusion formula, the discrete Fourier transform matrixes are converted into

The invention also aims to provide a dynamic update visual tracking aerial photography system based on the rotor wing flying robot, which implements the dynamic update visual tracking aerial photography method based on the rotor wing flying robot.

The invention further aims to provide an information data processing terminal for realizing the dynamic update visual tracking aerial photography method based on the rotor wing flying robot.

It is another object of the present invention to provide a computer-readable storage medium comprising instructions that, when executed on a computer, cause the computer to perform the method for dynamically updating a visual tracking aerial photography based on a rotary-wing flying robot.

In summary, the advantages and positive effects of the invention are: (1) by adopting a deeper CIREsNet network and a sample learning method, the classification standard is automatically established, the adaptability of a complex background is enhanced, and the effective extraction of more sample characteristics is met.

(2) According to the invention, the smooth transformation matrix V is added in the traditional twin network, the target appearance change of the previous frames can be learned on line, the space-time information is effectively utilized, and meanwhile, the background suppression matrix W is added, so that the influence of background disorder factors can be effectively controlled.

(3) The first frame is not singly used as a standard reference, and dynamic tracking is performed by using appearance learning and background suppression, so that the problems of shielding and the like can be effectively solved.

(4) The precision and the overlapping rate are both increased, and the speed can reach 16fps, thereby basically meeting the real-time requirement.

Table 1: tracking index comparisons

Tracking device	Accuracy of measurement	Overlap ratio	Speed (fps)
				Ours	0.5512	0.2905	16.
SiamFC	0.5355	0.2889	65
				DSiam	0.5414	0.2804	25
DSST	0.5078	0.1678	134

The algorithm is realized and debugged under an ubuntu16.04 operating system, and the hardware of the computer is configured into an IntelCorei7-8700k, a main frequency of 3.7GHz and a GeForce RTX2080TI display card.

According to the dynamic update visual tracking aerial photography method based on the rotor wing flying robot, the CIRESNet network is used for replacing the original AlexNet network, and compared with the AlexNet network, the network level is deeper, and the characteristic acquisition of a target is facilitated. Compared with the traditional twin network, the method adds the target appearance change of the previous frames of the online learning of the smooth transformation matrix V, effectively utilizes the space-time information, and simultaneously adds the background suppression matrix W to effectively control the influence of background clutter factors. The method provided by the invention selects a deeper network to obtain the target characteristics instead of singly taking the first frame as a standard reference, and uses appearance learning and background suppression to perform dynamic tracking, thereby effectively increasing the robustness.

Drawings

Fig. 1 is a flowchart of a method for dynamically updating a visual tracking aerial photography based on a rotor flying robot according to an embodiment of the present invention.

Fig. 2 is a schematic diagram of a dynamic update visual tracking aerial photography method based on a rotor flying robot according to an embodiment of the present invention.

Fig. 3 is a block diagram of a detecting section provided in the embodiment of the present invention.

Fig. 4 is a block diagram of a tracking section provided by an embodiment of the present invention.

Fig. 5 is a schematic diagram illustrating a CIResNet network according to an embodiment of the present invention.

Fig. 6 is a schematic diagram of a single-layer network structure according to an embodiment of the present invention.

Figure 7 is a diagram of the results on the UAV data set provided by embodiments of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail with reference to the following embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

The tracking flight of the rotor flying robot refers to that a camera is carried on the rotor flying robot flying at low altitude, an image frame sequence of a ground moving target is obtained in real time, image coordinates of the target are calculated and used as input of visual servo control, the speed required by an aircraft is obtained, the position and the posture of the rotor flying robot are further automatically controlled, and the tracked ground moving target is maintained near the visual field center of the camera. The traditional twin network tracking method is good in real-time performance, but when the target is lost due to target shielding and the influence of complex background or illumination is caused, the situation that the target cannot be correctly tracked may occur by taking the first frame as a standard reference.

Aiming at the problems in the prior art, the invention provides a dynamic update visual tracking aerial photography method based on a rotor flying robot, a CIRESNet network is used for replacing an original AlexNet network, and compared with the AlexNet network, the CIRESNet network has deeper network hierarchy and is beneficial to the feature acquisition of a target. Compared with the traditional twin network, the method adds the target appearance change of the previous frames of the online learning of the smooth transformation matrix V, effectively utilizes the space-time information, and simultaneously adds the background suppression matrix W to effectively control the influence of background clutter factors. The method provided by the invention selects a deeper network to obtain the target characteristics instead of singly taking the first frame as a standard reference, and uses appearance learning and background suppression to perform dynamic tracking, thereby effectively increasing the robustness. The present invention will be described in detail below with reference to the accompanying drawings.

As shown in fig. 1, a dynamic update visual tracking aerial photography method based on a rotor flying robot provided by an embodiment of the present invention includes the following steps:

s101: and performing target detection on the input image by using an HOG (histogram of ordered gradient) feature + Support Vector Machine (SVM) algorithm.

Even if the corresponding gradient and edge position information of the target in the image is unknown, the appearance and the shape of the target are still described by using the distribution of local gradient or edge direction. The HOG feature is used as a basis for constructing feature description by calculating and counting a gradient direction histogram of a target region, and the principle can keep good invariance on geometric change and optical deformation of an image.

Firstly, an image is divided into a plurality of connected regions, usually a unit (cell) of 8 x 8 pixels, which is called a cell unit, then the gradient amplitude and direction of each pixel point in the cell unit are collected, the gradient direction of [ -90 degrees and 90 degrees ] is averagely divided into 9 intervals (bins), then histogram statistics is carried out on the gradient amplitude of each pixel in the unit in each bin interval, and a one-dimensional gradient direction histogram is obtained. To improve the invariance of features to illumination and shadows, histograms need to be contrast normalized, usually by normalizing them over a larger range. We first calculate the density of each histogram in this bin and then normalize each cell unit in the bin according to this density, where the normalized block descriptor is called the HOG descriptor.

And combining the HOG description groups of all the blocks in the detection window to form a final feature vector, and then carrying out target detection by using an SVM classifier. FIG. 3 depicts the feature extraction and target detection process, with the detection window divided into overlapping blocks, HOG descriptors computed for these blocks, and the resulting feature vectors placed into a linear SVM for target/non-target dichotomy. The detection window scans all positions and scales of the whole image, and performs non-maximum suppression on the output pyramid to detect the target.

S102: target frame information obtained by target detection is transmitted to a visual tracking part, a dynamic update twin network based on CIRESNet is adopted to track the target in real time, and a tracking frame is shown in figure 4.

Obtaining a first frame from a video sequence as a template frame O₁Obtaining a search area Z using the current frame_tSeparately obtaining f via a CIRESNet-16 network^l(O₁) and f^l(Z_t)。

The end result of the conventional twin network is expressed as follows:

the result of this formula calculation is a similarity where corr represents the correlation filtering, which can be replaced by other metric functions,^trepresents time, and l represents the l-th layer.

Unlike the traditional Simese network, the network proposed by the invention adds two transformation matrixes, wherein the first transformation matrix V acts on the convolution characteristic of the target template, so that the convolution characteristic of the template at the t-th moment is approximately equal to the convolution characteristic of the template at the t-1 th moment, and the transformation matrix is learned from the t-1 th frame and is considered as smooth deformation of the target. The second transformation matrix W acts on the convolution features of the candidate region at time t to emphasize that the target region eliminates irrelevant background features.

For the transformation matrices V and W, the invention uses canonical linear regression for training, f^l(O₁) and f^l(Z_t) Respectively obtaining after transforming the matrix

And

wherein "+" represents a cyclic convolution operation,

represents the change of the appearance form of the target,

representing the background suppression transform. The final model is as follows:

the model adds two transformation matrixes of smoothing and background suppression on the basis of the twin network, and the smoothing matrix learns the appearance change of the previous frame, so that the spatiotemporal information can be effectively utilized; the background suppression matrix eliminates clutter influence factors in the background and enhances robustness. Meanwhile, the CIRESNet-16 network is used for replacing an AlexNet network in the traditional twin network, and the precision is higher.

The detailed description of the HOG feature extraction in step S101 is:

1) to reduce the influence of the illumination factor, the whole image needs to be normalized first. In the texture intensity of the image, the local exposure contribution of the surface layer has a large proportion, so that the local shadow and illumination change of the image can be effectively reduced by performing compression processing. The image is typically converted to a gray scale map where the color space of the input image is normalized (or normalized) using Gamma correction. The Gamma correction is understood to be that the image contrast effect of a dark or bright part in an image is improved, and the local shadow and illumination change of the image can be effectively reduced, and the Gamma correction formula is as follows:

f(I)＝I^γ(3)

where I is the image pixel value and γ is the Gamma correction factor.

2) Calculating the gradients of the horizontal coordinate and the vertical coordinate of the image, and calculating the gradient direction value of each pixel position according to the gradients; the derivation operation can capture contour and some texture information, and can further weaken the influence of illumination;

G_x(x,y)＝H(x+1,y)-H(x-1,y) (4)

G_y(x,y)＝H(x,y+1)-H(x,y-1) (5)

in the above formula, Gx (x, y), Gy (x, y) respectively represent the horizontal gradient and the vertical gradient at the pixel point (x, y) in the input image.

G (x, y), H (x, y), α (x, y) respectively represent the gradient magnitude, pixel value and gradient direction of the pixel point at (x, y).

3) And (3) histogram calculation: the image is divided into small cell units (which may be rectangular or circular) with the purpose of providing a code for the local image area.

4) The cell units are grouped into large blocks (blocks) with gradient histograms normalized within the blocks.

5) All overlapping blocks in the detection window are collected for HOG features and combined into a final feature vector for classification.

The detailed description of the improved network CIResNet-16 in step S102 is as follows:

the CIRESNet-16 is divided into three stages (step size 8) and consists of 18 weighted convolutional layers.

(1) The features affected by the fill are removed by a crop operation (size 2) followed by a 7 x 7 convolution.

(2) After passing through the maximum pooling layer with the stride of 2, entering an improved network CIRESNet unit, wherein the CIR unit is a network with a total of 3 layers in the stage as shown in (a) in FIG. 5, the first layer is 1 multiplied by 1 convolution, and the number of channels is 64; the second layer is a 3 × 3 convolution, and the number of channels is 64; the third layer is a 1 × 1 convolution and the number of channels is 256. As described in fig. 5, the feature map after passing through the convolutional layer is subjected to an addition operation and then enters a crop operation, which is a 3 × 3 convolution, so as to cancel the feature with the effect of padding being 1.

(3) Enter CIR-D (Downsampling CIR) unit as shown in FIG. 5 (b), which is a total of 12 layers of the network at this stage, and the first, second and third layers are used as unit blocks to cycle for 4 times. Wherein the first layer is 1 × 1 convolution, and the number of channels is 128; the second layer is a 3 × 3 convolution, and the number of channels is 128; the third layer is 1 × 1 convolution and the number of channels is 512.

The first block at this stage (4 blocks in total) is down-sampled by the proposed CIR-D unit, and after down-sampling the feature map size, the number of filters will be doubled to improve feature resolvability. CIR-D changes the step of the convolution in the bottleneck and quick connect layers from 2 to 1, and inserts the cut again after the add operation to remove the features affected by the fill. Finally, spatial down-sampling of the feature map is performed using maximum pooling. The spatial size of the output feature map is 7 × 7, and each feature receives information from an area of 77 × 77 pixels in size on the input image plane. As shown in fig. 5, the feature map after passing through the convolutional layer is subjected to an addition operation and then enters a crop operation and a maximum pooling layer. The key idea of these modifications is to ensure that only the functions affected by the padding are deleted while keeping the inherent block structure unchanged.

(4) Cross-correlation operations:

the improved twin network structure takes as input an image pair, comprising an example image Z and a candidate search image X. Image Z represents an object of interest (e.g., an image block centered on the target object in the first video frame), while X represents a search area, typically larger, in subsequent video frames. Both inputs are processed by ConvNet with parameter theta. This will produce two signatures that are cross-correlated as:

where b represents the bias term and the whole formula corresponds to an exhaustive search in the Z mode for the image X, with the aim of matching the maximum in the response map f with the target position. To achieve this goal, the network is trained offline by means of random image pairs (Z, X) and corresponding ground labels y obtained from training videos, the parameter θ in ConvNet being obtained by minimizing the following loss parameters in the training set:

the basic formula of the loss function is:

l(y,v)＝log(1+exp(-yv)) (10)

where y ∈ (+1, -1) represents the true value and v represents the actual score of the sample search image. From the sigmoid function, the above formula represents the probability of a positive sample as

Probability of negative example is

Then the following is easily obtained from the formula of the cross entropy:

the step of dynamically updating the algorithm in step S102 is:

(1) inputting a picture to obtain a template image O1;

(2) determining a candidate frame search area Zt in a frame to be tracked;

(4) learning the change of the tracking result of the previous frame and the template frame of the first frame according to a Regulated Linear Regression (RLR);

fast calculation in the frequency domain can result in:

thereby obtaining the variation

As follows:

in this connection, it is possible to use,

where O denotes the target, f denotes a matrix, the upper right index denotes the ith channel, and the lower right index denotes the several frames, i.e. the tracking result of the previous frame and the target of the first frame.

is to G_t-1The picture center point is multiplied by a gaussian smoothing, the purpose of which is to highlight the center and suppress the edges. Learning object changes through online

And background suppression transform

The improved model can improve the tracking precision and the real-time speed by starting the adaptive capacity of the static twin network on line.

(6) Element multi-layer feature fusion;

the center weight of the shallow feature is high, the peripheral weight of the deep feature is high, the center of the deep feature is low, if the target is in the center of the search area, the shallow feature can better locate the target, and if the target is in the periphery of the search area, the deep feature can also effectively determine the position of the target.

That is, when the target is close to the center of the search area, the deeper layer features are helpful for eliminating background interference, and the shallower layer features are helpful for obtaining accurate positioning of the target; and if the target is positioned at the periphery of the search area, only the deeper layer characteristics can effectively determine the position of the target.

(8) a schematic diagram of a single layer network architecture is shown in fig. 6. Where "Eltwise" (elementary multi-layer fusion) is trained to a matrix γ, where the values in the matrix represent the weights for different positions of different feature maps. BPTT (backward propagation time) and SGD (Stochastic Gradient Description) are used for Gradient propagation and parameter updating. To effectively use BPTT and random gradient (SGD) trained networks, L must be obtained_tAll parameters, as shown in FIG. 6, are represented by

Calculate out

And

then through the left "CirConv" and "RLR" layers to ensure that the loss gradient can propagate efficiently to f^l。

wherein ,

f, E after fourier transform is a discrete fourier transform matrix, and can also be calculated using the above process for cell-based multi-layer fusion. For multi-feature fusion formula, the method can be converted into

The model has reliable online adaptability, effectively learns foreground and background changes and inhibits background interference, does not damage real-time response capability, and has excellent balance tracking performance in experiments. In addition, the model is directly trained on the marked video sequence as a whole instead of on the image pair, so that rich space-time information of a moving object can be better captured. Meanwhile, the model uses joint training, and all parameters can be subjected to offline learning through back propagation, so that data training is facilitated. The specific effect is shown in fig. 7.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When used in whole or in part, can be implemented in a computer program product that includes one or more computer instructions. When loaded or executed on a computer, cause the flow or functions according to embodiments of the invention to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, the computer instructions may be transmitted from one website site, computer, server, or data center to another website site, computer, server, or data center via wire (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL), or wireless (e.g., infrared, wireless, microwave, etc.)). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that includes one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims

1. A dynamic update visual tracking aerial photography method based on a rotor wing flying robot is characterized by comprising the following steps:

2. The dynamic update vision tracking aerial photography method based on the rotor flying robot as claimed in claim 1, wherein in the step one, the target detection method comprises:

(4) performing contrast normalization on the histogram on the spatial block;

3. The dynamic update vision tracking aerial photography method based on the rotor flying robot as claimed in claim 1, wherein in step one, the HOG feature extraction method specifically comprises:

f(I)＝I^γ；

G_x(x,y)＝H(x+1,y)-H(x-1,y)；

G_y(x,y)＝H(x,y+1)-H(x,y-1)；

4. A method for dynamically updating vision tracking aerial photography based on rotor flying robots as claimed in claim 1 wherein the step of tracking the targets in real time comprises:

(2) The network adds a transform matrix V and a transform matrix W, both of which are computed rapidly in the frequency domain by FFT. The transformation matrix V is obtained by the tracking result of the t-1 th frame and the first frame target, acts on the convolution characteristic of the target template, learns the change of the target to ensure that the convolution characteristic of the template at the t-th moment is approximately equal to the convolution characteristic of the template at the t-1 th moment, and smoothes the change of the current frame relative to the previous frames; the transformation matrix W is obtained from the tracking result of the t-1 frame, acts on the convolution characteristic of the candidate region at the t moment, and learns the influence caused by irrelevant background characteristics in the target region by background suppression and elimination;

And

wherein "+" represents a cyclic convolution operation,

5. The method of claim 1, wherein the step two, the dynamic updating twin network based on CIResNet comprises:

the basic formula of the loss function is:

l(y,v)＝log(1+exp(-yv))；

wherein y ∈ (+1, -1) represents a true value, and v represents an actual score of the sample search image; from the sigmoid function, the above formula represents the probability of a positive sample as

Probability of negative example is

Then the following is easily obtained from the formula of the cross entropy:

6. a method for dynamically updating vision tracking aerial photography based on rotor-flying robots as claimed in claim 5 wherein in step (iii) the first block of the CIR-D unit stage is downsampled by the proposed CIR-D unit and the number of filters is doubled after downsampling the signature size; changing the step length of the volume in the bottleneck layer and the quick connection layer from 2 to 1 by CIR-D, and inserting and cutting again after adding operation to delete the characteristics influenced by filling; finally, performing spatial down-sampling of the feature map using maximum pooling; the spatial size of the output feature map is 7 × 7, each feature receiving information from an area of 77 × 77 pixels in size on the input image plane; performing addition operation on the feature graph passing through the convolution layer, and then entering a crop operation and a maximum pooling layer; the key idea of these modifications is to ensure that only the functions affected by the padding are deleted while keeping the inherent block structure unchanged.

7. The method according to claim 1, wherein in step two, the dynamic update twin network based on CIResNet is used for real-time tracking of the target, and the dynamic update algorithm comprises:

(1) inputting a picture to obtain a template image O1;

(2) determining a candidate frame search area Zt in a frame to be tracked;

fast calculation in the frequency domain yields:

thereby obtaining the variation

As follows:

wherein ,f₁ ^l＝f^l(O₁),

And background suppression transform

(6) Element multi-layer feature fusion;

Calculate out

And

wherein ,

8. A dynamically updated visual tracking aerial photography system based on a rotary-wing flying robot implementing the dynamically updated visual tracking aerial photography method based on a rotary-wing flying robot of claim 1.

9. An information data processing terminal for realizing the dynamic update visual tracking aerial photography method based on the rotor wing flying robot according to any one of claims 1 to 7.

10. A computer-readable storage medium comprising instructions that, when executed on a computer, cause the computer to perform the method for dynamically updating vision-tracking aerial photography based on rotor-flying robots of any of claims 1 to 7.