CN111340850A

CN111340850A - Ground target tracking method of unmanned aerial vehicle based on twin network and central logic loss

Info

Publication number: CN111340850A
Application number: CN202010198544.9A
Authority: CN
Inventors: 林白; 耿洋洋; 李冬冬; 蒯杨柳
Original assignee: System General Research Institute Academy Of Systems Engineering Academy Of Military Sciences; National University of Defense Technology
Current assignee: System General Research Institute Academy Of Systems Engineering Academy Of Military Sciences; National University of Defense Technology
Priority date: 2020-03-20
Filing date: 2020-03-20
Publication date: 2020-06-26

Abstract

The invention discloses an unmanned aerial vehicle ground target tracking method based on a twin network and central logic loss, which aims to solve the problems of more network parameters, large calculated amount, unbalanced positive and negative training samples and the like in the existing target tracking technology and belongs to the technical field of computer image processing. The method comprises the following steps: extracting a target feature map of a first frame image at a known target position by using a first feature extraction network, and extracting a search feature map of a second frame image by using a second feature extraction network; calculating the cross correlation between the search area of the second frame image and the target area of the first frame image according to the target characteristic diagram and the search characteristic diagram to obtain a score response diagram of the second frame image, and further obtaining the target position of the second frame image according to the score response diagram of the second frame image; the first feature extraction network and the second feature extraction network are two branches of a twin convolutional network and respectively consist of a lightweight convolutional neural network.

Description

Ground target tracking method of unmanned aerial vehicle based on twin network and central logic loss

Technical Field

The invention belongs to the technical field of computer image processing, and particularly relates to a ground target tracking method of an unmanned aerial vehicle based on a twin network and central logic loss.

Background

Deep learning based visual tracking algorithms, such as the full convolution twin network based visual tracking algorithm shown in fig. 1, have been accepted and approved by a wide range of developers and users.

The method comprises the steps of respectively inputting a template image extracted from a first frame and a search image extracted from each frame into two sub-networks to extract high-level semantic features, and then performing cross-correlation on the high-level semantic features to obtain the similarity of the template image at each position in the search image. Typically, the parameters in both sub-networks are shared and can be learned offline using training data. The high-level semantic features have rich semantic characteristics related to the target category, so that the method has strong robustness on target appearance change caused by occlusion, distortion and the like, and the network does not need to be updated in the tracking process, thereby greatly reducing the calculation amount of the algorithm and ensuring the real-time performance of the algorithm. The network has two inputs: the target template image and the search area image are input to carry out feature extraction through a twin subnet of a twin neural network sharing parameters.

However, such methods are mainly oriented to general visual tracking tasks and cannot meet the hardware platform of the unmanned aerial vehicle with limited computing and storage resources. Firstly, a great number of weight parameters exist in the convolutional neural network, and the requirement for the memory of the equipment is high when the great number of weight parameters are stored. Secondly, the computing resources of the embedded hardware platforms such as the unmanned aerial vehicle are limited, so that the efficient and real-time convolution calculation in the convolution neural network is difficult to realize.

Meanwhile, in the actual tracking process, such implementation requires intensive detection of the target at each position of the search area. For the unmanned aerial vehicle aerial image, the search area generally contains many negative samples in a simple background (such as area 3 in fig. 2), few difficult negative samples (such as area 1 in fig. 2), and a positive sample containing a foreground target (such as area 2 in fig. 2), and the large number of negative samples in the simple background causes imbalance of training samples and dominates training of the network, thereby causing model degradation.

Disclosure of Invention

The invention aims to provide a ground target tracking method of an unmanned aerial vehicle based on a twin network and central logic loss aiming at the characteristics of an unmanned aerial vehicle hardware platform, so as to solve the problems of more network parameters, large calculated amount, unbalanced positive and negative training samples and the like in the existing target tracking technology.

According to a first aspect of the present invention, a training method for a ground target visual tracking model by an unmanned aerial vehicle is provided, where the tracking model is a twin convolutional network, and two branches of the twin convolutional network are respectively a first feature extraction network and a second feature extraction network, and the training method includes:

acquiring a video sequence data set, wherein the data set comprises a template image and a search image which are paired;

extracting a target feature map of the template image by using a first feature extraction network, and extracting a search feature map of the search image by using a second feature extraction network;

calculating the cross correlation between the search area of the search image and the target area of the template image according to the target characteristic diagram and the search characteristic diagram to obtain a score response diagram of the search image;

calculating the difference between the score response graph and the true value according to a central logic loss function to obtain a difference result; and

back-propagating the difference results to adjust the weights of the layers in the twin neural network;

the first feature extraction network and the second feature extraction network are respectively composed of a lightweight convolutional neural network.

Optionally, the lightweight convolutional neural network is a MobileNetV2 model.

Optionally, the central logic loss function is:

wherein, v ∈ R^m×nIs a score plot of the network output, y ∈ { +1, -1} is the artificially labeled true value a/(1+ exp (b · yv)) is a modulation factor for the logic loss that adaptively adjusts the contribution of each training sample to the training loss according to the input yv.

Further, when yv > 0, the modulation factor assigns a first weight to the logic loss; when yv < 0, the modulation factor assigns a second weight to the logic loss, the first weight being less than the second weight.

According to a second aspect of the invention, an unmanned aerial vehicle ground target visual tracking method comprises the following steps:

extracting a target feature map of a first frame image at a known target position by using a first feature extraction network, and extracting a search feature map of a second frame image by using a second feature extraction network;

calculating the cross correlation between the search area of the second frame image and the target area of the first frame image according to the target characteristic diagram and the search characteristic diagram to obtain a score response diagram of the second frame image, and further obtaining the target position of the second frame image according to the score response diagram of the second frame image;

the first feature extraction network and the second feature extraction network are two branches of a twin convolutional network and respectively composed of a lightweight convolutional neural network, and the lightweight convolutional neural network is a MobileNet V2 model.

According to a third aspect of the present invention, an unmanned aerial vehicle visual tracking device for a ground target comprises:

the identification unit is used for extracting a target feature map of a first frame image with a known target position by using a first feature extraction network and extracting a search feature map of a second frame image by using a second feature extraction network;

the calculating unit is used for calculating the cross correlation between the searching area of the second frame image and the target area of the first frame image according to the target characteristic graph and the searching characteristic graph to obtain a score response graph of the second frame image;

the determining unit is used for obtaining the target position of the second frame image according to the score response image of the second frame image;

the first feature extraction network and the second feature extraction network in the identification unit are two branches of a twin convolution network and respectively consist of a lightweight convolution neural network.

According to a fourth aspect of the invention, an electronic device comprises:

at least one processor; and

a memory communicatively coupled to the processor and storing instructions executable by the processor; when the instructions are executed by the processor, the processor performs the drone-to-ground target visual tracking method, or the training method.

According to a fifth aspect of the present invention, a readable storage medium is stored with a computer program, wherein the computer program is executed by a processor to implement the method for visually tracking the target by the drone or the training method.

The invention adopts the MobileNet V2 model in the lightweight network as the feature extraction sub-network at the front end of the depth frame, thereby reducing the calculation complexity and the number of parameters in the convolutional neural network. Meanwhile, the processing speed and the accuracy can be well balanced, so that the method can adapt to the limited storage and calculation resources of the hardware platform of the unmanned aerial vehicle.

In addition, the invention applies different weights to different training samples in the search area by adopting a central logic loss function, solves the problem of unbalance of positive and negative training samples, avoids the problem of network degradation in offline training and enables the learned convolution characteristics to have stronger discrimination.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below. It is obvious that the drawings in the following description are some embodiments of the invention, and that for a person skilled in the art, other drawings can be derived from them without inventive exercise.

FIG. 1 is a schematic diagram of a conventional twin network structure;

FIG. 2 is a practical target tracking scenario including a simple negative example (area 3), a difficult negative example (area 1), and a positive example (area 2) containing a foreground target;

FIG. 3 is a schematic diagram of a visual tracking network model structure according to an embodiment of the invention;

FIG. 4 is a schematic diagram of a MobileNet V2 network structure;

FIG. 5 is a schematic flow chart of a visual tracking network training and tracking method according to an embodiment of the present invention;

fig. 6 is a schematic flow chart of a method for visually tracking a ground target by an unmanned aerial vehicle according to an embodiment of the invention;

fig. 7 is a schematic structural diagram of a visual tracking device for a ground target of an unmanned aerial vehicle according to an embodiment of the invention.

Detailed Description

Exemplary embodiments of the present invention are described below with reference to the accompanying drawings, in which various details of embodiments of the invention are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

In the following description, UAV (unmanned aerial Vehicle) refers primarily to an unmanned aerial Vehicle that is operated using a radio remote control device and a self-contained program control device, or is operated autonomously, either completely or intermittently, by an onboard computer.

FIG. 3 illustrates a twin network based visual tracking network constructed in accordance with an embodiment of the invention.

As shown in fig. 3, the visual tracking network includes a first branch and a second branch with the same structure, and the two branch distributions include a feature extraction network for extracting a feature map of an image. The feature extraction network of each branch in the invention is composed of a lightweight convolutional neural network. According to one embodiment of the invention, the lightweight convolutional neural network is the MobileNet V2 model.

Fig. 4 shows the network structure of the MobileNet V2 model. MobileNet V2 uses convolution that is separable in depth as an efficient building block. In addition, MobileNet V2 introduces two new architectural features: 1) a linear bottleneck layer between layers; 2) and connecting shortcuts between the bottleneck layers. This network architecture provides additional security, privacy and power benefits to MobileNet V2.

Therefore, the MobileNet V2 network sets a first convolutional layer of 1 × 1 for expansion before extracting the features of the image by using the DW convolutional layer, and then compresses by using a third convolutional layer after extracting the features, so that the MobileNet V2 network realizes the process of expansion → convolutional feature extraction → compression, and avoids the problem of channel loss.

Meanwhile, ReLU functions are respectively arranged at the output ends of the first convolution layer and the second convolution layer. Since the expansion → convolution extraction → compression process described above is used, a problem is encountered after compression, namely the ReLU function destroys the feature, the ReLU function has all zero outputs for negative inputs, and therefore the Linear function is used for the activation function of the third convolutional layer in order to avoid further loss of the feature.

The vision tracking network based on the twin network provided by the invention adopts a central logic loss function, and the central logic loss function is as follows:

wherein:

v∈R^m×nis a score chart of the network output, and m × n is the size of the score chart;

y ∈ { +1, -1} is the true value of the manual annotation;

a/(1+ exp (b · yv)) is a modulation factor for the logic loss, and a and b are parameters in the modulation factor, e.g. according to an embodiment of the present invention, a is 2 and b is 1.

The modulation factor adaptively adjusts the contribution of each training sample to the training loss according to the input yv. When yv is greater than 0, the sample is a simple sample, and the modulation factor assigns smaller weight to the logic loss; conversely, when yv < 0, the sample is a difficult sample and the modulation factor assigns a greater weight to the logic loss.

Most convolutional neural networks use logic loss or cross entropy loss as a supervisory signal in a deep model training process, and a model trained by the loss function has good separability but poor discriminability.

Unlike closed set problems such as object classification and recognition, target tracking belongs to an open set problem, and requires not only separability of features output by a depth model, but also strong discriminability. When processing long-tailed datasets, where most samples belong to few classes and many other classes have very few samples, it can be tricky how to weight the loss of the different classes. For the target tracking problem, foreground targets are easier to collect as positive samples, while the difficult negative samples in the background that are useful for training are few.

Therefore, the invention can self-adaptively adjust the proportion of positive and negative samples by using the central logic loss function in an end-to-end form, thereby preventing the learned network from being influenced by imbalance of training samples. Specifically, the invention applies different weights to different samples by adopting the central logic loss function, so that the problem of imbalance of foreground-background training samples can be solved.

Fig. 5 shows an exemplary process of visual tracking network training and unmanned aerial vehicle-to-ground target visual tracking using the trained network model according to an embodiment of the present invention.

As shown in fig. 5, the training process of the visual tracking network in the present invention includes a pre-training phase and a fine-tuning phase.

In the pre-training stage, for example, the video database in ImageNet large-scale visual recognition challenge match ILSVRC is adopted as a sample video sequence, and the marked sample video sequence is used for training the visual tracking network. During training, the maximum iteration times and the learning rate of the visual tracking network are set, a network parameter initialization method and a back propagation method are selected, and network parameters are optimized.

According to an embodiment of the invention, the maximum iteration times and the learning rate are set, and the back propagation method is selected as follows:

maximum number of iterations: 50 epoch;

initial learning rate: 0.001;

the network initialization method comprises the following steps: xavier process

The back propagation method comprises the following steps: random gradient descent method.

After the pre-training phase is completed, the network parameters are further optimized through a fine-tuning phase.

In the fine adjustment stage, the method forms a video data set by using an unmanned aerial vehicle to collect the ground targets, labels each frame image of the video data set according to the category, divides the labeled video data set into a training set, a verification set and a test set, and finally processes the labeled video data set into data types which can be identified by a visual tracking network model.

And training the pre-trained visual tracking network by utilizing the training set and the verification set so as to finely tune the visual tracking network and reserve the structure and parameters of the finely tuned visual tracking network model.

And testing the finely adjusted visual tracking network by using the test set to obtain the tracking accuracy. And judging the tracking accuracy, wherein if the tracking accuracy can meet the actual engineering requirements, the visual tracking network model can be applied to the actual task of identifying the specific target of the unmanned aerial vehicle on the ground. Otherwise, the training set cannot meet the actual engineering requirements, the training set needs to be expanded, and the pre-training and fine-tuning steps are restarted until the actual engineering requirements are met.

In the training process, the invention adopts a central logic loss function, and the central logic loss function is as follows:

wherein:

v∈R^m×nis a score plot of the network output;

y ∈ { +1, -1} is the true value of the manual annotation;

a/(1+ exp (b · yv)) is a modulation factor for the logic loss that adaptively adjusts the contribution of each training sample to the training loss according to the input yv.

When yv is greater than 0, the sample is a simple sample, and the modulation factor assigns smaller weight to the logic loss; conversely, when yv < 0, the sample is a difficult sample and the modulation factor assigns a greater weight to the logic loss.

Therefore, the invention applies different weights to different samples by adopting the central logic loss function, thereby being capable of processing the problem of the imbalance of the foreground-background training samples.

After the network model training is completed, the visual tracking network is applied to an actual scene of the unmanned aerial vehicle for tracking the ground target, and the target in the video acquired by the unmanned aerial vehicle is tracked.

Fig. 5 and 6 show schematic flows of the twin network-based unmanned aerial vehicle ground target tracking method according to the embodiment of the invention.

As shown in fig. 6, the method includes the steps of:

extracting a first frame image at a known target position, namely a target feature map of a template image, by using a first feature extraction network, and extracting a second frame image, namely a search feature map of a search image, by using a second feature extraction network;

the first feature extraction network and the second feature extraction network are two branches of a twin convolutional network and respectively consist of a lightweight convolutional neural network.

For example, the 1 st frame of image in the video sequence is calibrated, and the target position of the 2 nd frame of image can be obtained by using the visual tracking network; and then the target position of the image of the 3 rd frame can be obtained by utilizing the visual tracking network according to the calibration result of the 1 st frame or the tracking result of the image of the 2 nd frame. By analogy, the target position of each frame of image in the video sequence can be obtained, and the target tracking of the video sequence is realized.

As shown in fig. 5, the method further includes updating parameters of the feature extraction network according to a score response graph obtained by cross-correlation calculation, so as to further improve accuracy and reliability of the feature extraction network.

Fig. 7 is a ground target visual tracking apparatus of an unmanned aerial vehicle according to an embodiment of the present invention, including:

the identification unit 701 is used for extracting a target feature map of a first frame image with a known target position by using a first feature extraction network and extracting a search feature map of a second frame image by using a second feature extraction network;

a calculating unit 702, configured to calculate, according to the target feature map and the search feature map, a cross-correlation between a search region of the second frame image and a target region of the first frame image, to obtain a score response map of the second frame image;

a determining unit 703, configured to obtain a target position of the second frame image according to the score response map of the second frame image;

An embodiment of the present invention further provides an electronic device, including:

at least one processor; and

Alternatively, the memory may be separate or integrated with the processor.

When the memory is independently provided, the electronic device further comprises a bus for connecting the memory and the processor.

Further, an embodiment of the present invention further provides a readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the method for visually tracking the ground target by the drone or the training method.

In the embodiments provided in the present invention, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described device embodiments are merely illustrative, and for example, the division of the modules is only one logical division, and other divisions may be realized in practice, for example, a plurality of modules may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or modules, and may be in an electrical, mechanical or other form.

The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.

In addition, functional modules in the embodiments of the present invention may be integrated into one processing unit, or each module may exist alone physically, or two or more modules are integrated into one unit. The unit formed by the modules can be realized in a hardware form, and can also be realized in a form of hardware and a software functional unit.

The integrated module implemented in the form of a software functional module may be stored in a computer-readable storage medium. The software functional module is stored in a storage medium and includes several instructions to enable a computer device (which may be a personal computer, a server, or a network device) or a processor to execute some steps of the methods described in the embodiments of the present application.

It should be understood that the processor may be a Central Processing Unit (CPU), but may also be other general purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), and the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of a method disclosed in connection with the present invention may be embodied directly in a hardware processor, or in a combination of the hardware and software modules within the processor.

The memory may comprise a high-speed RAM memory, and may further comprise a non-volatile storage NVM, such as at least one disk memory, and may also be a usb disk, a removable hard disk, a read-only memory, a magnetic or optical disk, etc.

The bus may be an Industry Standard Architecture (ISA) bus, a Peripheral Component Interconnect (PCI) bus, an extended EISA (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, the buses in the figures of the present application are not limited to only one bus or one type of bus.

The storage medium may be implemented by any type or combination of volatile or non-volatile memory devices, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks. A storage media may be any available media that can be accessed by a general purpose or special purpose computer.

An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. Of course, the storage medium may also be integral to the processor. The processor and the storage medium may reside in an Application Specific Integrated Circuit (ASIC). Of course, the processor and the storage medium may reside as discrete components in an electronic device or host device.

Those skilled in the art will appreciate that all or part of the steps for implementing the above-described method embodiments may be implemented by hardware associated with program instructions. The program may be stored in a computer-readable storage medium. When executed, the program performs steps comprising the method embodiments described above; the storage medium includes various media that can store program codes, such as ROM, RAM, magnetic disk, or optical disk. The above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims

1. The utility model provides a training method of unmanned aerial vehicle to ground target visual tracking model, the tracking model is twin convolution network, and two branches of this twin convolution network are first characteristic extraction network, second characteristic extraction network respectively, its characterized in that, this training method includes:

2. The training method of claim 1, wherein the lightweight convolutional neural network is a MobileNetV2 model.

3. Training method according to claim 1 or 2, characterized in that the central logic loss function is:

wherein, v ∈ R^m×nIs a score plot of the network output, y ∈ { +1, -1} is the artificially labeled true value a/(1+ exp (b · yv)) is a modulation factor for the logic loss that adaptively adjusts the contribution of each training sample to the training loss according to the input yv, a, b are parameters in the modulation factor.

4. Training method according to claim 3, wherein the modulation factor assigns a first weight for the logic loss when yv > 0; when yv < 0, the modulation factor assigns a second weight to the logic loss, the first weight being less than the second weight.

5. The training method according to claim 3, wherein the parameters a-2 and b-1 in the central logic loss function modulation factor.

6. An unmanned aerial vehicle ground target visual tracking method is characterized by comprising the following steps:

7. An unmanned aerial vehicle is to ground target visual tracking device which characterized in that includes:

the first feature extraction network and the second feature extraction network in the identification unit are two branches of a twin convolutional network and respectively comprise a lightweight convolutional neural network, and the lightweight convolutional neural network is a MobileNet V2 model.

8. An electronic device, comprising:

at least one processor; and a memory communicatively coupled to the processor and storing instructions executable by the processor; when the instructions are executed by the processor, the processor performs the drone-to-ground target visual tracking method of claim 6, or the training method of any of claims 1-5.

9. A readable storage medium having stored thereon a computer program, wherein the computer program, when executed by a processor, implements the drone-to-ground target visual tracking method of claim 6, or the training method of any one of claims 1-5.