CN116805327A

CN116805327A - An infrared small target tracking method based on dynamic convolution kernel

Info

Publication number: CN116805327A
Application number: CN202310828952.1A
Authority: CN
Inventors: 马超; 赵晗馨; 黄源; 候毅; 乔木
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2023-07-07
Filing date: 2023-07-07
Publication date: 2023-09-26

Abstract

The invention discloses an infrared small target tracking method based on a dynamic convolution kernel. Tracking weak and small targets in an infrared image sequence includes: obtaining an infrared video frame sequence containing a small target; and tracking the first frame in the infrared video frame sequence. Select the target in the image, expand it from the center of the target by N times the side length, and reshape it into a square as a template; expand the i-th frame image in the infrared video frame sequence by 2N times the side length based on the center coordinates of the target, and reshape it into a square as Search area; input the template and search area into the tracking network based on the dynamic convolution module for matching, and output the target in the single frame image; expand the target center coordinates in the single frame image by 2N times the side length, and reshape it into a square as the new The search area, input the template and the new search area into the tracking network based on the dynamic convolution module for matching, and output the target in the single frame image until all frames have been tracked. The invention solves the problem of difficulty in extracting features of small infrared targets.

Description

Infrared small target tracking method based on dynamic convolution kernel

Technical Field

The application relates to an infrared target tracking technology, in particular to an infrared small target tracking method based on a dynamic convolution kernel.

Background

Infrared small object tracking is a fundamental but challenging research task in the field of computer vision, with the aim of predicting the position and shape of objects in subsequent frames after selecting the object of interest in an initial frame of video. The infrared imaging system has the advantages of long acting distance, good adaptability to weather and background, strong anti-interference capability, all-weather operation and the like, and the infrared target tracking plays an important role in the fields of security guard, military and the like.

The infrared target is often in complex environments such as ground background, ground-air boundary background, sky background and the like. When the environment of the target is complex, similar interference is easy to occur due to the low identification degree of the infrared small target. And the signal-to-noise ratio of the infrared image is low, the target is often submerged in a large piece of noise, and tracking drift is easy to cause.

The traditional infrared target tracking algorithm mainly comprises an algorithm based on image filtering, an algorithm based on feature matching and an algorithm based on target and background classification. When the signal-to-noise ratio of the infrared image is low, the background is complex, and similar target interference (false alarm) exists, the algorithm performance can be obviously reduced.

The existing method enhances the local contrast ratio in the preprocessing stage, such as enhancing the contrast ratio between the target and the background by adopting methods of Gaussian curvature filtering, IHBF, LOG filtering and the like, so as to extract strong characteristics for tracking. They are applications directed to a specific class of small infrared targets and lack universality. The neural network-based method can obtain depth characteristics suitable for various complex scenes by training a large number of samples. However, the current method mainly utilizes a backbone network (such as a residual connection network ResNet) which is universal in the image field to extract the characteristics of the infrared small target, and lacks adjustment for the characteristics of the infrared small target.

Disclosure of Invention

The application aims to solve the technical problem of providing the infrared small target tracking method based on the dynamic convolution kernel to reduce the extraction difficulty of the infrared small target aiming at the defects of the prior art.

In order to solve the technical problems, the application adopts the following technical scheme: an infrared small target tracking method based on a dynamic convolution kernel comprises the following steps:

s1, acquiring an infrared video frame sequence containing a small target;

s2, selecting a target in a first frame image in an infrared video frame sequence, expanding the side length by N times from the center of the target, and remolding the target into a square to serve as a template, wherein the template comprises appearance information of the target and local surrounding scenes of the target; expanding an ith frame image in an infrared video frame sequence by 2N times of side length by taking a target center coordinate as a center, and reshaping the ith frame image into a square as a search area; i > 1;

s3, inputting the template and the search area into a tracking network based on a dynamic convolution module for matching, and outputting a target in a single-frame image;

and S4, expanding the central coordinates of the targets in the single-frame image by 2N times of side length, remolding the targets into squares to serve as new search areas, inputting the templates and the new search areas into a tracking network based on a dynamic convolution module for matching, and outputting the targets in the single-frame image until all frames are tracked.

Aiming at the problem that the characteristic of the infrared small target is difficult to extract, the tracking method based on dynamic convolution provided by the application utilizes the multi-layer dynamic convolution module to mine the more key information in the template and the search area characteristic, thereby reducing the characteristic extraction difficulty of the infrared small target. The dynamic convolution kernel is generated by taking the dynamic convolution module as the template, so that the strong characteristics are extracted, and the method is suitable for weak targets under complex background, so that the trained network is more flexible, and the tracking precision is improved.

The tracking network based on the dynamic convolution modules comprises at least one dynamic convolution module connected in series, and all the dynamic convolution modules are connected with a backbone network of the Siamese tracker.

The application introduces dynamic convolution into the field of infrared small target tracking, and provides a method for mapping template features into dynamic convolution kernels, so that feature expression capability is effectively improved.

The number of the dynamic convolution modules is greater than or equal to 1.

The dynamic convolution module comprises a convolution unit formed by cascade connection of a first convolution layer and a plurality of second convolution layers; the input of the first convolution layer is a template characteristic, and the input of the convolution unit is a search area characteristic; after the characteristics output by the first convolution layer and the convolution unit are cascaded, inputting the characteristics into a third convolution layer; the third convolution layer maps the cascaded features into dynamic convolution kernels, and the dynamic convolution kernels and the search area features perform convolution operation to obtain a response chart; and the response graph is cascaded with the search area characteristics to obtain the output of the dynamic convolution module.

The Siamese tracker comprises a backbone network and a similarity calculation part; the backbone network is used for extracting features of the template and the search area, and the similarity calculation part obtains a subarea similar to the template in the search area, and the subarea is the output target.

The main network comprises two feature extraction modules which are respectively used for extracting features of the template and features of the search area; the feature extraction module is a ResNet50 network, and the fourth stage of the ResNet50 network is the final output.

The convolution step of the fourth stage downsampling unit of the ResNet50 network is 1 to obtain greater feature resolution.

The 3 x 3 convolution of the fourth stage is replaced with an expanded convolution of step 2 to increase the receptive field.

And obtaining a sub-region similar to the template in the search region through cross-correlation calculation.

Compared with the prior art, the application has the following beneficial effects:

1. the method introduces dynamic convolution into the field of infrared small target tracking, and effectively improves the characteristic expression capability.

2. The problem that the characteristics of the infrared small target are difficult to extract is solved by utilizing the multilayer dynamic convolution module to mine the template and search more key information in the regional characteristics.

Drawings

FIG. 1 is a diagram of a trace network architecture based on a dynamic convolution kernel in accordance with an embodiment of the present application;

FIG. 2 is a process of convolving a dynamic convolution kernel in accordance with an embodiment of the present application;

FIG. 3 is a schematic diagram of a dynamic convolution module according to an embodiment of the present application;

fig. 4 is a block diagram of a Siamese tracker according to an embodiment of the present application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments of the present application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

The terms "first," "second," and the like, herein do not denote any order, quantity, or importance, but rather are used to distinguish one element from another. The terms "a," "an," and other similar words are not intended to mean that there is only one of the things, but rather that the description is directed to only one of the things, which may have one or more. In this document, the terms "comprise," "include," and other similar words are intended to denote a logical relationship, but not to be construed as implying a spatial structural relationship. For example, "a includes B" is intended to mean that logically B belongs to a, and not that spatially B is located inside a. In addition, the terms "comprising," "including," and other similar terms should be construed as open-ended, rather than closed-ended. For example, "a includes B" is intended to mean that B belongs to a, but B does not necessarily constitute all of a, and a may also include other elements such as C, D, E.

The embodiment of the application provides an infrared small target tracking method based on a dynamic convolution kernel, which comprises the following steps:

step one: inputting an infrared video sequence containing a small target;

step two: giving a true value of an interesting target in the first frame image (the true value comprises coordinate values and the length and width of the image) or framing the target in the first frame image, expanding the side length twice from the center of the target, and reshaping the target into a square as a template, wherein the square comprises appearance information of the target and local surrounding scenes of the appearance information;

step three: the second frame of image extends four times of side length by using the center coordinate of the target, and is remodeled into a square as a search area, wherein the search area generally covers the possible movement range of the target;

step four: inputting the template and the search area into a tracking network based on a dynamic convolution kernel to match, and outputting a target in a single-frame image;

step five: and (3) expanding the center coordinates of the target obtained from the previous step by four times of side length, remolding the target into a square shape to serve as a new search area, inputting the template and the new search area into a tracking network based on a dynamic convolution kernel for matching, and outputting the target in a single-frame image.

Repeating the steps three to five for the rest of the frames until all frames have been tracked.

The tracking network based on the dynamic convolution kernel designed by the embodiment of the application comprises one or more dynamic convolution modules and a Siamese tracker, wherein the dynamic convolution modules and the Siamese tracker have the same structure but do not share parameters. When a section of infrared video sequence containing a small target is input, after a template image containing the target and a search area image are determined, the images are input into a tracking network based on a dynamic convolution kernel for tracking. The image is firstly sent to a backbone network part of a Siamese tracker, and search area characteristics and template characteristics are output; each dynamic convolution module takes template characteristics and the search area characteristics processed in the previous step as input, and the processed new search area characteristics as output. The dynamic convolution module can mine more interesting parts in the template and the search area features, and more effective features can be extracted. After extracting effective features by one or more dynamic convolution modules, the template features and the search area features are sent to a similarity calculation part of a Siamese tracker for tracking.

The process of convolving with a dynamic convolution kernel in accordance with embodiments of the present application may be expressed as:

where K represents a convolution kernel of size K x K.Iin and Iout represent the input and output images, respectively. i and j are coordinates in the image, u and v are coordinates in each Ki, j. These pixel-by-pixel convolution checks perform a weighted summation operation on nearby images.

The dynamic convolution module designed by the embodiment of the application is shown in fig. 3. Firstly, the dimension of the template features is reduced through a convolution kernel of 1*1, and the features of the search area are subjected to convolution operation through convolution kernels of 3*3, 2 x 2 and 3*3 to further extract the features; the two-part feature is then cascaded and then mapped into a dynamic convolution kernel by the convolution kernel of 3*3, which facilitates mining the part from the template that needs attention in the search area. Predicting the search area and the dynamic convolution kernel to obtain a response chart; and finally, cascading the search area and the response diagram, and obtaining a new search area after the size is adjusted. The dynamic convolution module can be used in a superposition mode to enhance the feature extraction capability and improve the tracking effect.

The Siamese tracker adopted by the embodiment of the application is divided into two components of a main network and a similarity calculation part, the main network extracts the characteristics of the template and the search area, the similarity calculation part obtains a sub-area similar to the template in the search area, and the similarity estimation is obtained by calculating the cross correlation.

The embodiment of the application takes the template image and the search area image as the input of the backbone network. The embodiment of the present application uses a modified version of ResNet50 for feature extraction. The ResNet50 first stage consists of a 7*7 convolution layer with a stride of 2, plus a 3*3 maximum pooling layer with a stride of 2, and the second, third and fourth stages consist of three layers of bottleneck blocks. The convolution step of the fourth stage downsampling unit is changed from 2 to 1 to obtain greater feature resolution. The 3 x 3 convolution of the fourth stage is modified to an expanded convolution with a stride of 2 to increase the receptive field. The embodiment of the application removes the last stage of ResNet50 and takes the output of the fourth stage as the final output.

In the embodiment of the application, similarity calculation is realized through cross-correlation operation, namely template features are used as convolution kernels, convolution operation is carried out on a search area, and the similarity of each position is calculated to obtain a score map, wherein the position with the highest score corresponds to a target position.

Example 2

Embodiment 2 of the present application provides a terminal device corresponding to embodiment 1, where the terminal device may be a processing device for a client, for example, a mobile phone, a notebook computer, a tablet computer, a desktop computer, etc., so as to execute the method of the embodiment.

The terminal device of the present embodiment includes a memory, a processor, and a computer program stored on the memory; the processor executes the computer program on the memory to implement the steps of the method of embodiment 1 described above.

In some implementations, the memory may be high-speed random access memory (RAM: random Access Memory), and may also include non-volatile memory (non-volatile memory), such as at least one disk memory.

In other implementations, the processor may be a Central Processing Unit (CPU), a Digital Signal Processor (DSP), or other general-purpose processor, which is not limited herein.

Example 3

Embodiment 3 of the present application provides a computer-readable storage medium corresponding to embodiment 1 described above, on which a computer program/instructions is stored. The steps of the method of embodiment 1 described above are implemented when the computer program/instructions are executed by a processor.

The computer readable storage medium may be a tangible device that retains and stores instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any combination of the preceding.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein. The scheme in the embodiment of the application can be realized by adopting various computer languages, such as object-oriented programming language Java, an transliteration script language JavaScript and the like.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the application.

It will be apparent to those skilled in the art that various modifications and variations can be made to the present application without departing from the spirit or scope of the application. Thus, it is intended that the present application also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims

1. The method for tracking the infrared small target based on the dynamic convolution kernel is characterized by comprising the following steps of:

s1, acquiring an infrared video frame sequence containing a small target;

2. The dynamic convolution kernel-based target tracking method according to claim 1, wherein the dynamic convolution module-based tracking network comprises at least one dynamic convolution module connected in series, all of which are connected to a backbone network of a Siamese tracker.

3. The method of claim 2, wherein the number of dynamic convolution modules is greater than or equal to 1.

4. The target tracking method based on dynamic convolution kernel according to claim 2, wherein the dynamic convolution module comprises a convolution unit formed by a first convolution layer and a plurality of second convolution layers in a cascaded manner; the input of the first convolution layer is a template characteristic, and the input of the convolution unit is a search area characteristic; after the characteristics output by the first convolution layer and the convolution unit are cascaded, inputting the characteristics into a third convolution layer; the third convolution layer maps the cascaded features into dynamic convolution kernels, and the dynamic convolution kernels and the search area features perform convolution operation to obtain a response chart; and the response graph is cascaded with the search area characteristics to obtain the output of the dynamic convolution module.

5. The dynamic convolution kernel-based object tracking method according to claim 2, wherein the Siamese tracker comprises a backbone network and a similarity calculation part; the backbone network is used for extracting features of the template and the search area, and the similarity calculation part obtains a subarea similar to the template in the search area, and the subarea is the output target.

6. The method for tracking a target based on a dynamic convolution kernel according to claim 5, wherein the backbone network comprises two feature extraction modules sharing parameters, each of which is used for extracting features of a template and features of a search area; the feature extraction module is a ResNet50 network, and the fourth stage of the ResNet50 network is the final output.

7. The method of claim 5, wherein the convolution stride of the fourth stage downsampling unit of the res net50 network is 1.

8. The method of claim 5, wherein the 3 x 3 convolution of the fourth stage of the res net50 network is replaced with an expanded convolution with a stride of 2.

9. The method for tracking a target based on a dynamic convolution kernel according to claim 5, wherein the sub-regions similar to the template in the search region are obtained by cross-correlation calculation.