CN116797628A - Multi-scale unmanned aerial vehicle aerial photographing target tracking method and device - Google Patents

Multi-scale unmanned aerial vehicle aerial photographing target tracking method and device Download PDF

Info

Publication number
CN116797628A
CN116797628A CN202310429983.XA CN202310429983A CN116797628A CN 116797628 A CN116797628 A CN 116797628A CN 202310429983 A CN202310429983 A CN 202310429983A CN 116797628 A CN116797628 A CN 116797628A
Authority
CN
China
Prior art keywords
feature
branch
weighted
unmanned aerial
feature map
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310429983.XA
Other languages
Chinese (zh)
Inventor
金国栋
薛远亮
谭力宁
高晶
龙江雄
田思远
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Rocket Force University of Engineering of PLA
Original Assignee
Rocket Force University of Engineering of PLA
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Rocket Force University of Engineering of PLA filed Critical Rocket Force University of Engineering of PLA
Priority to CN202310429983.XA priority Critical patent/CN116797628A/en
Publication of CN116797628A publication Critical patent/CN116797628A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a multi-scale unmanned aerial vehicle aerial photographing target tracking method and device, which relate to the technical field of image tracking and comprise the following steps: the method comprises the steps of acquiring an unmanned aerial vehicle aerial video, inputting an initial frame and a current frame of the unmanned aerial vehicle aerial video into a template branch and a search branch in a twin tracking network constructed on the basis of a G-ResNet network, respectively outputting three groups of first weighted feature images and second weighted feature images from three convolution blocks of layer2, layer3 and layer4 of the G-ResNet network, and carrying out weighted fusion on the three groups of first weighted feature images and the second weighted feature images by utilizing a plurality of anchor frame-free regional suggestion networks to obtain a target tracking result of the current frame. The method can solve the problem that the unmanned aerial vehicle tracking algorithm cannot reach the balance state of precision and speed well.

Description

Multi-scale unmanned aerial vehicle aerial photographing target tracking method and device
Technical Field
The invention relates to the technical field of image tracking, in particular to a multi-scale unmanned aerial vehicle aerial photographing target tracking method and device.
Background
Object tracking is used to estimate the state of a tracked object for each frame in a video sequence, and information about this tracked object is only given in the first frame. In the tracking process of the unmanned aerial vehicle to the target, video images are transmitted to a ground station for display through a data link system, a control hand controls a stable unmanned aerial vehicle platform and a camera system to search for a scout target through a control rod and other instructions, when the target of interest appears on a picture, the target of interest is selected, and a ground computer extracts a series of characteristics of the target and takes the characteristics as a template. And the computer confirms the position information of the interested target in the subsequent image by calculating the similarity between the template image and the subsequent image, so as to realize continuous tracking of the target.
The main problems of the unmanned aerial vehicle tracking algorithm are divided into two aspects:
tracking accuracy: the unmanned aerial vehicle has the advantages that the video shot by the unmanned aerial vehicle is large in visual field, wide in range, relatively small in shot target size, few in pixel points contained in the target, few and unobvious in characteristics of the target, and more in background information contained in the target, so that the target is easy to track by a plurality of similar target interferences, the background and the target are difficult to distinguish by an algorithm, and an error target is easy to track; camera shake and flying speed change easily occur in the flying process of the unmanned plane, so that problems of motion blur, appearance change and the like are caused, and the small target characterization and discrimination capability of an algorithm are checked; unmanned aerial vehicle's mobility is good, and the flight motion generally has higher degree of freedom, restriction condition that restricts the flight is few, appears the circumstances such as rapid movement, scale change are big more easily, and tracking algorithm's scale adaptability is insufficient will contain too much background information, pollutes target information.
Tracking speed: the shooting equipment carried by the unmanned aerial vehicle during execution is visible light, thermal infrared, SAR and the like, a large amount of data can be collected by one task, the task is usually executed in a mode that a plurality of unmanned aerial vehicles cooperate, the amount of information data to be processed is increased, and the tracking algorithm is required to process a large amount of data information in real time.
The traditional related filtering tracking algorithm is fast, but uses the characteristics of manual design to represent the target, the representation capability of the target is insufficient, and the tracking precision is difficult to improve. Most twin tracking algorithms use a series of complicated operations to pursue tracking accuracy, and neglect the requirement on tracking speed, but the tracking speed is not satisfied and is difficult to deploy on an unmanned plane platform in real time. The existing unmanned aerial vehicle tracking algorithm cannot well reach the balance state of precision and speed.
Disclosure of Invention
The present invention aims to solve at least one of the technical problems existing in the prior art. Therefore, the first aspect of the present invention provides a multi-scale unmanned aerial vehicle aerial photographing target tracking method, which comprises:
acquiring an unmanned aerial vehicle aerial video;
inputting an initial frame and a current frame of an aerial video of an unmanned aerial vehicle into a twin tracking network constructed based on a G-ResNet network, outputting three groups of first weighted feature images and second weighted feature images from three convolution blocks of layer2, layer3 and layer4 of the G-ResNet network respectively, wherein the G-ResNet network is obtained by parallelly stacking a plurality of convolution layer groups with the same topological structure, replacing a convolution kernel of 3 times 3 of a residual module of each Bottleneck in the ResNet50 network, and adding a double multi-scale attention module behind each Bottleneck;
and carrying out weighted fusion on the three groups of first weighted feature images and the second weighted feature images by utilizing a plurality of area suggestion networks without anchor frames, and tracking the target of the current frame according to the predicted frames and the predicted positions in the weighted fusion results.
Further, replacing the 3 by 3 convolution kernels of the residual modules of each Bottleneck in the resnet50 network by a plurality of convolutionally layered groups of identical topologies stacked in parallel, comprising:
in layer1, a 3 by 3 convolution kernel of 64 channels in the residual modules of 3 Bottleneck is divided into 32 groups of parallel stacked convolution kernel groups of 3 by 3 of 4 channels by grouping convolution;
in layer2, a 3 by 3 convolution kernel of 128 channels in the residual modules of the 4 Bottleneck is divided into 32 groups of parallel stacked convolution kernel groups of 3 by 3 with 8 channels;
in layer3, the 3 by 3 convolution kernels with 256 channels in the residual modules of 6 Bottleneck are divided into 32 groups of parallel stacked convolution kernel groups with 16 channels and a size of 3 by grouping convolution;
in layer4, the 3 by 3 convolution kernels of the channel number 512 in the residual block of 3 Bottleneck are divided into 32 sets of parallel stacked convolutional kernel-sized 3 by 3 convolutional groups of channel number 32 by group convolution.
Further, the first weighted feature map and the second weighted feature map are output from three convolution blocks of layer2, layer3 and layer4 of the G-res net network, respectively, comprising:
extracting a first characteristic diagram and a second characteristic diagram output by a first Bottleneck in layers 2, 3 and 4 of a template branch and a search branch respectively through a double multi-scale attention module;
grouping the first feature map and the second feature map respectively to obtain a plurality of grouping feature maps corresponding to the first feature map and the second feature map respectively;
decomposing each grouping feature map into a first sub-feature map and a second sub-feature map;
processing the first sub-feature map and the second sub-feature map by using the position attention module and the channel attention module respectively to obtain a sub-feature map with position attention response and a third sub-feature map with channel attention response and a fourth sub-feature map with channel attention response respectively;
channel fusion is carried out on the third sub-feature diagram and the fourth sub-feature diagram to obtain a fifth sub-feature diagram corresponding to the grouping feature diagram;
acquiring a plurality of fifth sub-feature graphs corresponding to the plurality of grouping feature graphs;
shuffling the fifth sub-feature graphs to obtain a template branch of a first Bottleneck and a weighted feature graph output by the first Bottleneck of the search branch;
and sequentially and backward propagating the weighted feature graphs output by the first Bottleneck of the template branch and the search branch, and respectively outputting a first weighted feature graph and a second weighted feature graph from the last Bottleneck of the layers 2, 3 and 4.
Further, an expression of a position attention response, comprising:
wherein ,representing a first sub-feature map, IN (X) k1 ) Representing normalization completion using an instance->Spatial information statistics +.>,/> and />Respectively for reinforcement->And sigmoid nonlinear activation functions.
Further, a channel attention response expression comprising:
wherein H and W represent the height and width of the second sub-feature map, respectively,representing a second sub-feature map, F gap Representing a global average pooling function,/->For a pair ofsPerforming scaling and offset operations, ">Representing a sigmoid nonlinear activation function.
Further, using a plurality of anchor-free regional suggestion networks, performing weighted fusion on the three sets of first weighted feature maps and the second weighted feature maps, including:
an RPN module without an anchor frame strategy is respectively arranged among three convolution blocks of a template branch and a layer2, a layer3 and a layer4 of a search branch of the G-ResNet network, the RPN module without the anchor frame strategy comprises a classification branch and a regression branch, and the regression branch predicts the offset between a target pixel point and a real frame through the regression branch;
respectively inputting the first weighted feature map and the second weighted feature map into a convolution network in a regression branch and a classification branch of an RPN module without an anchor frame strategy, outputting the regression map and the classification map from the regression branch, and outputting the regression map and the classification map from the classification branch;
performing deep cross-correlation operation on the two regression graphs output by the classification branch and the regression branch to obtain a regression result;
performing deep cross-correlation operation on the two classification graphs output by the classification branch and the regression branch to obtain a classification result;
acquiring the position of the maximum value of the classification result as the predicted position of the target;
and obtaining a prediction boundary frame corresponding to the prediction position from the regression result as a target prediction frame.
The invention also provides a multi-scale unmanned aerial vehicle aerial photographing target tracking device, which comprises:
the acquisition module is used for acquiring the aerial video of the unmanned aerial vehicle;
the processing module is used for inputting an initial frame and a current frame of an aerial video of the unmanned aerial vehicle into a template branch and a search branch in a twin tracking network constructed based on a G-ResNet network, outputting three groups of first weighted feature graphs and second weighted feature graphs from three convolution blocks of layer2, layer3 and layer4 of the G-ResNet network respectively, wherein the G-ResNet network is obtained by replacing a convolution kernel of 3 times 3 of a residual module of each Bottleneck in a ResNet50 network through a plurality of convolution layer groups of the same topological structure stacked in parallel and adding a double multi-scale attention module behind each Bottleneck;
and the output module is used for carrying out weighted fusion on the three groups of first weighted feature images and the second weighted feature images by utilizing a plurality of area suggestion networks without anchor frames, and tracking the target of the current frame according to the predicted frame and the predicted position in the weighted fusion result.
The invention also provides an electronic device comprising a processor and a memory, wherein at least one instruction, at least one program, code set or instruction set is stored in the memory, and the at least one instruction, the at least one program, code set or instruction set is loaded and executed by the processor to implement the multi-scale unmanned aerial vehicle aerial photographing target tracking method according to any one of the first aspect.
The present invention also provides a computer readable storage medium having stored therein at least one instruction, at least one program, code set or instruction set, the at least one instruction, at least one program, code set or instruction set being loaded and executed by a processor to implement the multi-scale unmanned aerial vehicle aerial target tracking method according to any of the first aspects.
The embodiment of the invention provides a multi-scale unmanned aerial vehicle aerial photographing target tracking method and device, which have the following beneficial effects compared with the prior art:
1) By utilizing the sub-space learning idea of grouping-conversion-fusion, a grouping residual error network G-ResNet is designed, deep semantic features and diversified features of the target can be extracted, challenges such as appearance change and motion blur of the target are effectively met, and the representation capability of the small target is enhanced.
2) A multi-scale attention module DMSAM is designed, feature images are grouped to extract target feature information of different scales, then double attention is used for respectively extracting local features of targets in space and channel dimensions and establishing global dependence relationship between the targets and the background, and finally information communication between different channels is established, so that the scale adaptation capability and the anti-interference capability of the invention are enhanced.
3) An area suggestion module AF-RPN based on an anchor frame-free strategy is provided to replace a predefined anchor frame, distinguish targets from backgrounds pixel by pixel, and realize self-adaptive perception capability on target scales. And a plurality of AF-RPNs are cascaded on the G-ResNet, so that complementary detail information and semantic information are effectively utilized to realize robust tracking and accurate positioning of a tracking target. Meanwhile, the speed reaches 40.5 FPS, and the real-time requirement is met.
Drawings
In order to more clearly illustrate the technical solution of the present invention, the following description will make a brief introduction to the drawings used in the description of the embodiments or the prior art. It should be apparent that the drawings in the following description are only some embodiments of the present invention, and that other drawings can be obtained from these drawings without inventive effort to those of ordinary skill in the art.
Fig. 1 is a flowchart of an aerial target tracking method of a multi-scale unmanned aerial vehicle provided by an embodiment of the present invention;
fig. 2 is a network model diagram of an unmanned aerial vehicle target tracking method based on a dual multi-scale attention module;
fig. 3 is a layer1 replacement example diagram of a multi-scale unmanned aerial vehicle aerial target tracking method according to an embodiment of the present invention;
fig. 4 is a dash schematic diagram of a multi-scale unmanned aerial vehicle aerial target tracking method according to an embodiment of the present invention;
fig. 5 is a schematic shuffle diagram of a multi-scale unmanned aerial vehicle aerial target tracking method according to an embodiment of the present invention;
FIG. 6 is an AF-RPN schematic diagram of an aerial target tracking method for a multi-scale unmanned aerial vehicle according to an embodiment of the present invention;
fig. 7 is a block diagram of an aerial target tracking device of a multi-scale unmanned aerial vehicle according to an embodiment of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The present specification provides method operational steps as described in the examples or flowcharts, but may include more or fewer operational steps based on conventional or non-inventive labor. When implemented in a real system or server product, the methods illustrated in the embodiments or figures may be performed sequentially or in parallel (e.g., in a parallel processor or multithreaded environment).
At present, the target tracking algorithm is mainly divided into a tracking algorithm based on correlation filtering and a tracking algorithm based on deep learning. The correlation filter tracking algorithm uses a correlation filter in the signal processing field to calculate the similarity between the template and the search image, and the Fourier transform is utilized to accelerate in the frequency domain, so that the operation amount is greatly reduced, the operation speed is improved, and hundreds of frames per second can be reached. However, most of related filtering algorithms are used for representing a tracking target by using a traditional feature extraction algorithm, so that the robustness and the accuracy are insufficient, and the target tracking task in a complex scene cannot be effectively processed.
Due to the great potential of the twin tracking algorithm in precision and speed, the twin tracking algorithm gradually becomes a mainstream algorithm in the field of target tracking, and most of follow-up tracking algorithms are researched based on twin structures. The working principle of the twin tracking algorithm can be expressed as formula (1), and the twin tracking algorithm mainly consists of a feature extraction partSimilarity calculation part (+)>) And a tracking result generation section.
(1)
in the formula ,is a similarity response graph; />Is a feature extraction section; />Is a cross-correlation operation; />Deviation for each position; />Is an identity matrix.
1) A feature extraction section: and extracting features by using a twin neural network, wherein the two branches are respectively a template branch and a search branch. Target image of template branch input initial frameAs template, output as template feature map->The search branch inputs the search image of the subsequent frame +.>Output as search feature map->
2) Similarity calculation part): feature information on feature graphs for integrating two branches, calculating similarity between a search feature graph and a template feature graph, and generating a similarity response graph +.>
3) A tracking result generation section: and predicting the target position on the search image according to the obtained response graph, wherein the position with the maximum response is generally regarded as the target predicted position, and then carrying out target scale estimation and bounding box regression.
The process of on-line tracking by the twin tracking algorithm mainly comprises the following steps:
inputting the video sequence into the feature extraction part frame by frame;
if the frame is the first frame, extracting target features by a template branch to serve as template features;
if the frame is not the first frame, the searching branch extracts the target feature of the current frame as the searching feature;
the similarity calculation part calculates the similarity between the feature images and generates a response image;
the tracking result generating part predicts the target position in the current frame by using the similarity response diagram;
repeating the steps 3-5 until the last frame of the video sequence.
Fig. 1 is a flowchart of a multi-scale unmanned aerial vehicle aerial target tracking method provided by an embodiment of the present invention, where, as shown in fig. 1, the method includes:
step 101, acquiring an unmanned aerial vehicle aerial video;
step 102, inputting an initial frame and a current frame of an unmanned aerial vehicle aerial video into a twin tracking network constructed based on a G-ResNet network, outputting three groups of first weighted feature graphs and second weighted feature graphs from three convolution blocks of layer2, layer3 and layer4 of the G-ResNet network respectively, wherein the G-ResNet network is obtained by replacing a convolution kernel of 3 times 3 of a residual module of each Bottleneck in a ResNet50 network through a plurality of convolution layer groups of the same topological structure stacked in parallel, and adding a double multi-scale attention module behind each Bottleneck;
and 103, carrying out weighted fusion on the three groups of first weighted feature images and the second weighted feature images by utilizing a plurality of area suggestion networks without anchor frames, and tracking the target of the current frame according to the predicted frames and the predicted positions in the weighted fusion result.
Fig. 2 is a network model diagram of an unmanned aerial vehicle target tracking method based on a double multi-scale attention module. As shown in fig. 2, first, a packet residual network (Group Residual Network, G-ResNet) is designed, convolutional blocks with the same topology are stacked in parallel, diversified features of the target are extracted, and the characterization capability of the tracking target is enhanced without increasing the network depth. Second, to better screen features, two multi-scale attentives (Dual Multi Scale Attention Module, DMSAM) are used to extract multi-scale feature information of the target, suppressing interference information in both channel and spatial dimensions. And in the final tracking frame generation stage, a plurality of anchor frame-free regional suggestion networks (Anchor Free Region Proposal Network, AF-RPN) are used for adaptively sensing the scale change of the target, so that the problem of the scale change is effectively solved. Experiments show that the method can more effectively cope with the problems of scale change, small targets, motion blur, partial shielding and the like, improves the tracking effect on aerial targets, achieves the speed of 40.5 FPS and meets the real-time requirement.
In one possible implementation, replacing the 3 by 3 convolution kernels of the residual modules of each Bottleneck in the resnet50 network with a plurality of convolutionally grouped layers of the same topology stacked in parallel, comprises:
in layer1, a 3 by 3 convolution kernel of 64 channels in the residual modules of 3 Bottleneck is divided into 32 groups of parallel stacked convolution kernel groups of 3 by 3 of 4 channels by grouping convolution;
in layer2, a 3 by 3 convolution kernel of 128 channels in the residual modules of the 4 Bottleneck is divided into 32 groups of parallel stacked convolution kernel groups of 3 by 3 with 8 channels;
in layer3, the 3 by 3 convolution kernels with 256 channels in the residual modules of 6 Bottleneck are divided into 32 groups of parallel stacked convolution kernel groups with 16 channels and a size of 3 by grouping convolution;
in layer4, the 3 by 3 convolution kernels of the channel number 512 in the residual block of 3 Bottleneck are divided into 32 sets of parallel stacked convolutional kernel-sized 3 by 3 convolutional groups of channel number 32 by group convolution.
In the embodiment provided by the invention, the invention has a higher network layer numberAdding radix to the deep ResNet-50 increases network performance. Increasing the cardinality of the network more effectively increases the network's feature description capabilities than increasing the number of network layers, while not increasing the number of network parameters. Based on the design concept of packet-transform-merge (split-transform-merge), as shown in FIG. 3, FIG. 3 shows an alternative example of layer1, taking into account the residual blockIs the main extraction part of the feature information, thus +_in the residual block>Instead of stacking multiple convolutions of the same topology in parallel. In the convolution process of the common convolution, one channel of the output feature map needs all channels of the input feature map to participate in calculation. In the implementation of the parallel stacking operation, the number of channels is 64 +.>Dividing into 32 groups of 4 +.>Different convolution groups can be regarded as different subspaces, and the feature information learned by each subspace is different from each other with emphasis, namely, the diversified feature information of the target is extracted.
In one possible implementation, three sets of first and second weighted feature maps are output from three convolution blocks of layer2, layer3, and layer4, respectively, of the G-res net network, including:
extracting a first characteristic diagram and a second characteristic diagram output by a first Bottleneck in layers 2, 3 and 4 of a template branch and a search branch respectively through a double multi-scale attention module;
grouping the first feature map and the second feature map respectively to obtain a plurality of grouping feature maps corresponding to the first feature map and the second feature map respectively;
decomposing each grouping feature map into a first sub-feature map and a second sub-feature map;
processing the first sub-feature map and the second sub-feature map by using the position attention module and the channel attention module respectively to obtain a sub-feature map with position attention response and a third sub-feature map with channel attention response and a fourth sub-feature map with channel attention response respectively;
channel fusion is carried out on the third sub-feature diagram and the fourth sub-feature diagram to obtain a fifth sub-feature diagram corresponding to the grouping feature diagram;
acquiring a plurality of fifth sub-feature graphs corresponding to the plurality of grouping feature graphs;
shuffling the fifth sub-feature graphs to obtain a template branch of a first Bottleneck and a weighted feature graph output by the first Bottleneck of the search branch;
and sequentially and backward propagating the weighted feature graphs output by the first Bottleneck of the template branch and the search branch, and respectively outputting a first weighted feature graph and a second weighted feature graph from the last Bottleneck of the layers 2, 3 and 4.
In the embodiment provided by the invention, the attention module can adaptively allocate weights and selectively screen the feature map information, so that the network is helped to pay attention to the interested target better, and the defect of G-ResNet can be effectively overcome. Thus, to enhance the discrimination capabilities of the present invention, a dual multiscale attention module (DMSAM) was introduced on G-res net. As shown in fig. 4, in order for the network to learn feature information of different scales, the DMSAM first extracts features of various scales and groups them; then, the local features and the global dependencies are captured adaptively by using the position and channel attention module in parallel; and finally fusing and shuffling the feature maps of all channels to strengthen information exchange among different channels.
First assume that the input feature map is, wherein />Representing the number, height and width of channels of the feature map, respectively. To reduce the calculation cost, will->Divided into +.>A group of sub-feature maps is provided,because the sub-feature images are divided according to channels, each sub-feature image can capture specific semantic information in the training process, and the sub-feature images are +.>Dividing into two parts to obtain->One uses channel attention to capture the interrelationship between channels and the other uses position attention to find the spatial relationship between features. Thus by weight allocation of the attention module, the network knows better what is concerned (what) and where is concerned (where) is meaningful.
In one possible embodiment, the expression of the position attention response includes:
(2)
wherein ,representing a first sub-feature map, IN (X) k1 ) Representing normalization completion using an instance->Spatial information statistics +.>,/> and />Respectively for reinforcement->And sigmoid nonlinear activation functions.
In the embodiment provided by the invention, the object similar to the tracking target is always present in the unmanned aerial vehicle tracking process, so that the characteristic information of the tracking target is present on the characteristic diagram, and the characteristic information of the similar object is also present. The position attention is to enhance the discrimination of similar objects and give a larger degree of attention to the position of the target. The present invention uses instance normalization (Instance Normalization, IN) to complete the alignmentSpatial information statistics->Last position attention response +.>From formula (3):
(3)
in the formula : and />For reinforcement->Is a representation of the capabilities of (1). The weight design of the position attention response to each position of the feature map effectively suppresses the interference of the similar objects, and the aim of the position (where) on the image is clearly focused by the network.
In one possible implementation, the channel attention response expression includes:
(4)
(5)
wherein H and W represent the height and width of the second sub-feature map, respectively,representing a second sub-feature map, F gap Representing a global average pooling function,/->For a pair ofsPerforming scaling and offset operations, ">Representing a sigmoid nonlinear activation function.
Different channels on the feature map of the deep network represent different semantic information. The process of channel attention allocation weights can be seen as a process of selecting semantic attributes for different channels. The present invention uses Global Average Pooling (GAP) to compressA feature layer on the channel, obtaining the result +.>
(6)
To learn the nonlinear relationship between channels, the following followsBy sigmoid nonlinear activation function +.>Obtaining weight coefficient, adaptively guiding network to select proper characteristic diagram, channel attention response +.>Obtained from the formula (7):
(7)
in the formula :for->Scaling and shifting operations are performed. And (3) distributing weights of the feature images according to different semantic information, wherein the weight of the channel where the target is located is the largest. In the cross-correlation operation, the responses on the other channels are suppressed, and the aim of what class (what) the network should pay attention to is clear.
Attention response before shuffling and />Ligation to obtain a novel subfcharacteristic map->All new sub-feature maps are superimposed according to the channel and combined to form a feature map +.>As shown in formula (8). Then equation (9) using channel shuffling (channel_shuffle) as shown in FIG. 5, the operation process +.>. First will->Expanded into->Four-dimensional matrix, then +.>Dimension is unchanged, pair->The dimensions are transposed, and then the dimensions of the matrix are compressed to obtain an output characteristic diagram. The shuffling operation can effectively integrate the characteristic information on each channel, and strengthen the information exchange between channels.
(8)
(9)
In the DMSAM, target feature information of different scales is extracted from the group feature map, then the double attentions are used for respectively extracting local features of the targets in the channel and space dimensions, establishing global dependency relationship between the targets and the background, finally establishing information communication between different channels, increasing the difference between the targets and interference information, and improving the scale adaptability and discrimination capability of the invention.
In one possible implementation, using a plurality of anchor-free regional suggestion networks, the weighted fusion of the three sets of first weighted feature maps and the second weighted feature map includes:
an RPN module without an anchor frame strategy is respectively arranged among three convolution blocks of a template branch and a layer2, a layer3 and a layer4 of a search branch of the G-ResNet network, the RPN module without the anchor frame strategy comprises a classification branch and a regression branch, and the regression branch predicts the offset between a target pixel point and a real frame through the regression branch;
respectively inputting the first weighted feature map and the second weighted feature map into a convolution network in a regression branch and a classification branch of an RPN module without an anchor frame strategy, outputting the regression map and the classification map from the regression branch, and outputting the regression map and the classification map from the classification branch;
performing deep cross-correlation operation on the two regression graphs output by the classification branch and the regression branch to obtain a regression result;
performing deep cross-correlation operation on the two classification graphs output by the classification branch and the regression branch to obtain a classification result;
acquiring the position of the maximum value of the classification result as the predicted position of the target;
and obtaining a prediction boundary frame corresponding to the prediction position from the regression result as a target prediction frame.
In one possible implementation manner, a set of anchor frames with different scales are predefined in the RPN module to perform scale estimation, the prior information of the anchor frames is obtained by analyzing from video, the prior information is against the starting point of the tracking task, and the tracking performance is sensitive to the parameters of the anchor frames and needs to be set manually and carefully. Therefore, in order to get rid of excessive dependence on target priori information, the adaptive estimation of the target scale is completed in the RPN module by using an anchor-free frame strategy. RPN module (AF-RPN) based on anchor-free frame strategy, and boundary frame regression branch thereofInstead of regression of the size (length, width, center point position) of the anchor, the offset l, t, b between the target pixel point and the real frame (group-trunk) is predicted ,r The method comprises the steps of carrying out a first treatment on the surface of the Before branch of classification->Whether the target in the anchor is a positive sample is judged by calculating the area intersection ratio (Intersection overUnion, ioU) of the anchor and the real frame. Therefore, the anchor-frame-free strategy needs a new positive and negative sample discrimination method: mapping the pixel points of the similarity response graph back into the search image and falling on the ellipse E 1 Except negative samples; fall on ellipse E 2 The inner is a positive sample as shown in fig. 6.
(9)
in the formula :、/>classification results and regression results; />Representing a depth cross-correlation operation; />Extracting a network for the features; />For the width, height and number of channels of the feature map,
and finding the maximum value on the classification result S, wherein the position of the maximum value is the predicted position of the target, and meanwhile, the position in the regression result has a corresponding predicted boundary box which is used as the predicted box of the target.
The invention also provides a multi-scale unmanned aerial vehicle aerial photographing target tracking device 200, as shown in fig. 7, comprising:
an acquisition module 201, configured to acquire an aerial video of the unmanned aerial vehicle;
the processing module 202 is configured to input an initial frame and a current frame of an aerial video of the unmanned aerial vehicle into a template branch and a search branch in a twin tracking network constructed based on a G-res net network, output three sets of a first weighted feature map and a second weighted feature map from three convolution blocks of layer2, layer3 and layer4 of the G-res net network, respectively, where the G-res net network is obtained by stacking a plurality of convolution groups of the same topological structure in parallel, replacing a convolution kernel of 3 by 3 of a residual module of each boltleck in the ResNet50 network, and adding a double multi-scale attention module behind each boltleck;
and the output module 203 is configured to perform weighted fusion on the three sets of first weighted feature maps and the second weighted feature maps by using a plurality of area suggestion networks without anchor frames, and track the target of the current frame according to the prediction frame and the prediction position in the weighted fusion result.
In yet another embodiment of the present invention, there is further provided an apparatus, including a processor and a memory, where the memory stores at least one instruction, at least one program, a code set, or an instruction set, and the at least one instruction, the at least one program, the code set, or the instruction set is loaded and executed by the processor to implement the multi-scale unmanned aerial vehicle aerial target tracking method described in the embodiments of the present invention.
In yet another embodiment of the present invention, a computer readable storage medium is provided, where at least one instruction, at least one section of program, a code set, or an instruction set is stored, where the at least one instruction, the at least one section of program, the code set, or the instruction set is loaded and executed by a processor to implement the multi-scale unmanned aerial vehicle aerial target tracking method described in the embodiments of the present invention.
In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes a plurality of computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the present invention, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in or transmitted from one computer-readable storage medium to another, for example, by wired (e.g., coaxial cable, optical fiber, digital Subscriber Line (DSL)), or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of a plurality of available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid State Disk (SSD)), etc.
It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
In this specification, each embodiment is described in a related manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for system embodiments, since they are substantially similar to method embodiments, the description is relatively simple, as relevant to see a section of the description of method embodiments.
The foregoing description is only of the preferred embodiments of the present invention and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention are included in the protection scope of the present invention.

Claims (10)

1. The multi-scale unmanned aerial vehicle aerial photographing target tracking method is characterized by comprising the following steps of:
acquiring an unmanned aerial vehicle aerial video;
inputting an initial frame and a current frame of an aerial video of an unmanned aerial vehicle into a twin tracking network constructed based on a G-ResNet network, outputting three groups of first weighted feature images and second weighted feature images from three convolution blocks of layer2, layer3 and layer4 of the G-ResNet network respectively, wherein the G-ResNet network is obtained by parallelly stacking a plurality of convolution layer groups with the same topological structure, replacing a convolution kernel of 3 times 3 of a residual module of each Bottleneck in the ResNet50 network, and adding a double multi-scale attention module behind each Bottleneck;
and carrying out weighted fusion on the three groups of first weighted feature images and the second weighted feature images by utilizing a plurality of area suggestion networks without anchor frames, and tracking the target of the current frame according to the predicted frames and the predicted positions in the weighted fusion results.
2. The multi-scale unmanned aerial vehicle aerial target tracking method of claim 1, wherein the replacing the 3 by 3 convolution kernels of the residual modules of each boltreck in the resnet50 network by a plurality of convolved layer groups of the same topology stacked in parallel comprises:
in layer1, a 3 by 3 convolution kernel of 64 channels in the residual modules of 3 Bottleneck is divided into 32 groups of parallel stacked convolution kernel groups of 3 by 3 of 4 channels by grouping convolution;
in layer2, a 3 by 3 convolution kernel of 128 channels in the residual modules of the 4 Bottleneck is divided into 32 groups of parallel stacked convolution kernel groups of 3 by 3 with 8 channels;
in layer3, the 3 by 3 convolution kernels with 256 channels in the residual modules of 6 Bottleneck are divided into 32 groups of parallel stacked convolution kernel groups with 16 channels and a size of 3 by grouping convolution;
in layer4, the 3 by 3 convolution kernels of the channel number 512 in the residual block of 3 Bottleneck are divided into 32 sets of parallel stacked convolutional kernel-sized 3 by 3 convolutional groups of channel number 32 by group convolution.
3. The method for tracking an aerial target of a multi-scale unmanned aerial vehicle according to claim 1, wherein the outputting three sets of the first weighted feature map and the second weighted feature map from three convolution blocks of layer2, layer3 and layer4 of the G-res net network respectively comprises:
extracting a first characteristic diagram and a second characteristic diagram output by a first Bottleneck in layers 2, 3 and 4 of a template branch and a search branch respectively through a double multi-scale attention module;
grouping the first feature map and the second feature map respectively to obtain a plurality of grouping feature maps corresponding to the first feature map and the second feature map respectively;
decomposing each grouping feature map into a first sub-feature map and a second sub-feature map;
processing the first sub-feature map and the second sub-feature map by using the position attention module and the channel attention module respectively to obtain a sub-feature map with position attention response and a third sub-feature map with channel attention response and a fourth sub-feature map with channel attention response respectively;
channel fusion is carried out on the third sub-feature diagram and the fourth sub-feature diagram to obtain a fifth sub-feature diagram corresponding to the grouping feature diagram;
acquiring a plurality of fifth sub-feature graphs corresponding to the plurality of grouping feature graphs;
shuffling the fifth sub-feature graphs to obtain a template branch of a first Bottleneck and a weighted feature graph output by the first Bottleneck of the search branch;
and sequentially and backward propagating the weighted feature graphs output by the first Bottleneck of the template branch and the search branch, and respectively outputting a first weighted feature graph and a second weighted feature graph from the last Bottleneck of the layers 2, 3 and 4.
4. A multi-scale unmanned aerial vehicle aerial target tracking method as claimed in claim 3, wherein the expression of the position attention response comprises:
wherein ,representing a first sub-feature map, IN (X) k1 ) Representing normalization completion using an instance->Spatial information statistics +.> and />Respectively for reinforcement->And sigmoid nonlinear activation functions.
5. A multi-scale unmanned aerial vehicle aerial target tracking method as claimed in claim 3, wherein the channel attention response expression comprises:
6. wherein H and W represent the height and width of the second sub-feature map, respectively,representing a second sub-feature map, F gap Representing a global average pooling function,/->For a pair ofsPerforming scaling and offset operations, ">Representing a sigmoid nonlinear activation function.
7. The method for tracking an aerial target of a multi-scale unmanned aerial vehicle according to claim 1, wherein the weighting and fusing the three sets of the first weighted feature map and the second weighted feature map by using a plurality of area suggestion networks without anchor frames comprises:
an RPN module without an anchor frame strategy is respectively arranged among three convolution blocks of a template branch and a layer2, a layer3 and a layer4 of a search branch of the G-ResNet network, the RPN module without the anchor frame strategy comprises a classification branch and a regression branch, and the regression branch is used for predicting the offset between a target pixel point and a real frame;
respectively inputting the first weighted feature map and the second weighted feature map into a convolution network in a regression branch and a classification branch of an RPN module without an anchor frame strategy, outputting the regression map and the classification map from the regression branch, and outputting the regression map and the classification map from the classification branch;
performing deep cross-correlation operation on the two regression graphs output by the classification branch and the regression branch to obtain a regression result;
performing deep cross-correlation operation on the two classification graphs output by the classification branch and the regression branch to obtain a classification result;
acquiring the position of the maximum value of the classification result as the predicted position of the target;
and obtaining a prediction boundary frame corresponding to the prediction position from the regression result as a target prediction frame.
8. Multi-scale unmanned aerial vehicle target tracking means that takes photo by plane, its characterized in that includes:
the acquisition module is used for acquiring the aerial video of the unmanned aerial vehicle;
the processing module is used for inputting an initial frame and a current frame of an aerial video of the unmanned aerial vehicle into a template branch and a search branch in a twin tracking network constructed based on a G-ResNet network, outputting three groups of first weighted feature graphs and second weighted feature graphs from three convolution blocks of layer2, layer3 and layer4 of the G-ResNet network respectively, wherein the G-ResNet network is obtained by replacing a convolution kernel of 3 times 3 of a residual module of each Bottleneck in a ResNet50 network through a plurality of convolution layer groups of the same topological structure stacked in parallel and adding a double multi-scale attention module behind each Bottleneck;
and the output module is used for carrying out weighted fusion on the three groups of first weighted feature images and the second weighted feature images by utilizing a plurality of area suggestion networks without anchor frames, and tracking the target of the current frame according to the predicted frame and the predicted position in the weighted fusion result.
9. An electronic device comprising a processor and a memory, wherein the memory stores at least one instruction, at least one program, a set of codes, or a set of instructions, the at least one instruction, the at least one program, the set of codes, or the set of instructions being loaded and executed by the processor to implement the multi-scale unmanned aerial vehicle aerial target tracking method of any of claims 1-6.
10. A computer readable storage medium having stored therein at least one instruction, at least one program, code set, or instruction set loaded and executed by a processor to implement the multi-scale unmanned aerial vehicle aerial target tracking method of any of claims 1-6.
CN202310429983.XA 2023-04-21 2023-04-21 Multi-scale unmanned aerial vehicle aerial photographing target tracking method and device Pending CN116797628A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310429983.XA CN116797628A (en) 2023-04-21 2023-04-21 Multi-scale unmanned aerial vehicle aerial photographing target tracking method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310429983.XA CN116797628A (en) 2023-04-21 2023-04-21 Multi-scale unmanned aerial vehicle aerial photographing target tracking method and device

Publications (1)

Publication Number Publication Date
CN116797628A true CN116797628A (en) 2023-09-22

Family

ID=88036950

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310429983.XA Pending CN116797628A (en) 2023-04-21 2023-04-21 Multi-scale unmanned aerial vehicle aerial photographing target tracking method and device

Country Status (1)

Country Link
CN (1) CN116797628A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117437530A (en) * 2023-10-12 2024-01-23 中国科学院声学研究所 Synthetic aperture sonar interest small target twin matching identification method and system

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117437530A (en) * 2023-10-12 2024-01-23 中国科学院声学研究所 Synthetic aperture sonar interest small target twin matching identification method and system

Similar Documents

Publication Publication Date Title
CN110334779B (en) Multi-focus image fusion method based on PSPNet detail extraction
CN112446270A (en) Training method of pedestrian re-identification network, and pedestrian re-identification method and device
CN112446380A (en) Image processing method and device
CN111667399A (en) Method for training style migration model, method and device for video style migration
CN110807757B (en) Image quality evaluation method and device based on artificial intelligence and computer equipment
CN109492596B (en) Pedestrian detection method and system based on K-means clustering and regional recommendation network
CN112488978A (en) Multi-spectral image fusion imaging method and system based on fuzzy kernel estimation
CN111126278A (en) Target detection model optimization and acceleration method for few-category scene
CN112215332A (en) Searching method of neural network structure, image processing method and device
CN112561028A (en) Method for training neural network model, and method and device for data processing
CN110610143A (en) Crowd counting network method, system, medium and terminal for multi-task joint training
CN111260687B (en) Aerial video target tracking method based on semantic perception network and related filtering
CN116797628A (en) Multi-scale unmanned aerial vehicle aerial photographing target tracking method and device
CN116486288A (en) Aerial target counting and detecting method based on lightweight density estimation network
CN111291631A (en) Video analysis method and related model training method, device and apparatus
Zhuang et al. Blind image deblurring with unknown kernel size and substantial noise
CN116977674A (en) Image matching method, related device, storage medium and program product
Meng et al. A mobilenet-SSD model with FPN for waste detection
CN113554656B (en) Optical remote sensing image example segmentation method and device based on graph neural network
CN114358204A (en) No-reference image quality evaluation method and system based on self-supervision
CN112801890B (en) Video processing method, device and equipment
CN108257148B (en) Target suggestion window generation method of specific object and application of target suggestion window generation method in target tracking
CN114372931A (en) Target object blurring method and device, storage medium and electronic equipment
Wei et al. Lightweight multimodal feature graph convolutional network for dangerous driving behavior detection
CN113256546A (en) Depth map completion method based on color map guidance

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination