Disclosure of Invention
The application aims to solve the technical problem of providing the infrared small target tracking method based on the dynamic convolution kernel to reduce the extraction difficulty of the infrared small target aiming at the defects of the prior art.
In order to solve the technical problems, the application adopts the following technical scheme: an infrared small target tracking method based on a dynamic convolution kernel comprises the following steps:
s1, acquiring an infrared video frame sequence containing a small target;
s2, selecting a target in a first frame image in an infrared video frame sequence, expanding the side length by N times from the center of the target, and remolding the target into a square to serve as a template, wherein the template comprises appearance information of the target and local surrounding scenes of the target; expanding an ith frame image in an infrared video frame sequence by 2N times of side length by taking a target center coordinate as a center, and reshaping the ith frame image into a square as a search area; i > 1;
s3, inputting the template and the search area into a tracking network based on a dynamic convolution module for matching, and outputting a target in a single-frame image;
and S4, expanding the central coordinates of the targets in the single-frame image by 2N times of side length, remolding the targets into squares to serve as new search areas, inputting the templates and the new search areas into a tracking network based on a dynamic convolution module for matching, and outputting the targets in the single-frame image until all frames are tracked.
Aiming at the problem that the characteristic of the infrared small target is difficult to extract, the tracking method based on dynamic convolution provided by the application utilizes the multi-layer dynamic convolution module to mine the more key information in the template and the search area characteristic, thereby reducing the characteristic extraction difficulty of the infrared small target. The dynamic convolution kernel is generated by taking the dynamic convolution module as the template, so that the strong characteristics are extracted, and the method is suitable for weak targets under complex background, so that the trained network is more flexible, and the tracking precision is improved.
The tracking network based on the dynamic convolution modules comprises at least one dynamic convolution module connected in series, and all the dynamic convolution modules are connected with a backbone network of the Siamese tracker.
The application introduces dynamic convolution into the field of infrared small target tracking, and provides a method for mapping template features into dynamic convolution kernels, so that feature expression capability is effectively improved.
The number of the dynamic convolution modules is greater than or equal to 1.
The dynamic convolution module comprises a convolution unit formed by cascade connection of a first convolution layer and a plurality of second convolution layers; the input of the first convolution layer is a template characteristic, and the input of the convolution unit is a search area characteristic; after the characteristics output by the first convolution layer and the convolution unit are cascaded, inputting the characteristics into a third convolution layer; the third convolution layer maps the cascaded features into dynamic convolution kernels, and the dynamic convolution kernels and the search area features perform convolution operation to obtain a response chart; and the response graph is cascaded with the search area characteristics to obtain the output of the dynamic convolution module.
The Siamese tracker comprises a backbone network and a similarity calculation part; the backbone network is used for extracting features of the template and the search area, and the similarity calculation part obtains a subarea similar to the template in the search area, and the subarea is the output target.
The main network comprises two feature extraction modules which are respectively used for extracting features of the template and features of the search area; the feature extraction module is a ResNet50 network, and the fourth stage of the ResNet50 network is the final output.
The convolution step of the fourth stage downsampling unit of the ResNet50 network is 1 to obtain greater feature resolution.
The 3 x 3 convolution of the fourth stage is replaced with an expanded convolution of step 2 to increase the receptive field.
And obtaining a sub-region similar to the template in the search region through cross-correlation calculation.
Compared with the prior art, the application has the following beneficial effects:
1. the method introduces dynamic convolution into the field of infrared small target tracking, and effectively improves the characteristic expression capability.
2. The problem that the characteristics of the infrared small target are difficult to extract is solved by utilizing the multilayer dynamic convolution module to mine the template and search more key information in the regional characteristics.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments of the present application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
The terms "first," "second," and the like, herein do not denote any order, quantity, or importance, but rather are used to distinguish one element from another. The terms "a," "an," and other similar words are not intended to mean that there is only one of the things, but rather that the description is directed to only one of the things, which may have one or more. In this document, the terms "comprise," "include," and other similar words are intended to denote a logical relationship, but not to be construed as implying a spatial structural relationship. For example, "a includes B" is intended to mean that logically B belongs to a, and not that spatially B is located inside a. In addition, the terms "comprising," "including," and other similar terms should be construed as open-ended, rather than closed-ended. For example, "a includes B" is intended to mean that B belongs to a, but B does not necessarily constitute all of a, and a may also include other elements such as C, D, E.
The embodiment of the application provides an infrared small target tracking method based on a dynamic convolution kernel, which comprises the following steps:
step one: inputting an infrared video sequence containing a small target;
step two: giving a true value of an interesting target in the first frame image (the true value comprises coordinate values and the length and width of the image) or framing the target in the first frame image, expanding the side length twice from the center of the target, and reshaping the target into a square as a template, wherein the square comprises appearance information of the target and local surrounding scenes of the appearance information;
step three: the second frame of image extends four times of side length by using the center coordinate of the target, and is remodeled into a square as a search area, wherein the search area generally covers the possible movement range of the target;
step four: inputting the template and the search area into a tracking network based on a dynamic convolution kernel to match, and outputting a target in a single-frame image;
step five: and (3) expanding the center coordinates of the target obtained from the previous step by four times of side length, remolding the target into a square shape to serve as a new search area, inputting the template and the new search area into a tracking network based on a dynamic convolution kernel for matching, and outputting the target in a single-frame image.
Repeating the steps three to five for the rest of the frames until all frames have been tracked.
The tracking network based on the dynamic convolution kernel designed by the embodiment of the application comprises one or more dynamic convolution modules and a Siamese tracker, wherein the dynamic convolution modules and the Siamese tracker have the same structure but do not share parameters. When a section of infrared video sequence containing a small target is input, after a template image containing the target and a search area image are determined, the images are input into a tracking network based on a dynamic convolution kernel for tracking. The image is firstly sent to a backbone network part of a Siamese tracker, and search area characteristics and template characteristics are output; each dynamic convolution module takes template characteristics and the search area characteristics processed in the previous step as input, and the processed new search area characteristics as output. The dynamic convolution module can mine more interesting parts in the template and the search area features, and more effective features can be extracted. After extracting effective features by one or more dynamic convolution modules, the template features and the search area features are sent to a similarity calculation part of a Siamese tracker for tracking.
The process of convolving with a dynamic convolution kernel in accordance with embodiments of the present application may be expressed as:
where K represents a convolution kernel of size K x K.Iin and Iout represent the input and output images, respectively. i and j are coordinates in the image, u and v are coordinates in each Ki, j. These pixel-by-pixel convolution checks perform a weighted summation operation on nearby images.
The dynamic convolution module designed by the embodiment of the application is shown in fig. 3. Firstly, the dimension of the template features is reduced through a convolution kernel of 1*1, and the features of the search area are subjected to convolution operation through convolution kernels of 3*3, 2 x 2 and 3*3 to further extract the features; the two-part feature is then cascaded and then mapped into a dynamic convolution kernel by the convolution kernel of 3*3, which facilitates mining the part from the template that needs attention in the search area. Predicting the search area and the dynamic convolution kernel to obtain a response chart; and finally, cascading the search area and the response diagram, and obtaining a new search area after the size is adjusted. The dynamic convolution module can be used in a superposition mode to enhance the feature extraction capability and improve the tracking effect.
The Siamese tracker adopted by the embodiment of the application is divided into two components of a main network and a similarity calculation part, the main network extracts the characteristics of the template and the search area, the similarity calculation part obtains a sub-area similar to the template in the search area, and the similarity estimation is obtained by calculating the cross correlation.
The embodiment of the application takes the template image and the search area image as the input of the backbone network. The embodiment of the present application uses a modified version of ResNet50 for feature extraction. The ResNet50 first stage consists of a 7*7 convolution layer with a stride of 2, plus a 3*3 maximum pooling layer with a stride of 2, and the second, third and fourth stages consist of three layers of bottleneck blocks. The convolution step of the fourth stage downsampling unit is changed from 2 to 1 to obtain greater feature resolution. The 3 x 3 convolution of the fourth stage is modified to an expanded convolution with a stride of 2 to increase the receptive field. The embodiment of the application removes the last stage of ResNet50 and takes the output of the fourth stage as the final output.
In the embodiment of the application, similarity calculation is realized through cross-correlation operation, namely template features are used as convolution kernels, convolution operation is carried out on a search area, and the similarity of each position is calculated to obtain a score map, wherein the position with the highest score corresponds to a target position.
Example 2
Embodiment 2 of the present application provides a terminal device corresponding to embodiment 1, where the terminal device may be a processing device for a client, for example, a mobile phone, a notebook computer, a tablet computer, a desktop computer, etc., so as to execute the method of the embodiment.
The terminal device of the present embodiment includes a memory, a processor, and a computer program stored on the memory; the processor executes the computer program on the memory to implement the steps of the method of embodiment 1 described above.
In some implementations, the memory may be high-speed random access memory (RAM: random Access Memory), and may also include non-volatile memory (non-volatile memory), such as at least one disk memory.
In other implementations, the processor may be a Central Processing Unit (CPU), a Digital Signal Processor (DSP), or other general-purpose processor, which is not limited herein.
Example 3
Embodiment 3 of the present application provides a computer-readable storage medium corresponding to embodiment 1 described above, on which a computer program/instructions is stored. The steps of the method of embodiment 1 described above are implemented when the computer program/instructions are executed by a processor.
The computer readable storage medium may be a tangible device that retains and stores instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any combination of the preceding.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein. The scheme in the embodiment of the application can be realized by adopting various computer languages, such as object-oriented programming language Java, an transliteration script language JavaScript and the like.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the application.
It will be apparent to those skilled in the art that various modifications and variations can be made to the present application without departing from the spirit or scope of the application. Thus, it is intended that the present application also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.