CN116342649A

CN116342649A - Method, device and equipment for tracking specific target object in low-resolution image in complex scene

Info

Publication number: CN116342649A
Application number: CN202310232244.1A
Authority: CN
Inventors: 刘通
Original assignee: Individual
Current assignee: Individual
Priority date: 2023-03-10
Filing date: 2023-03-10
Publication date: 2023-06-27

Abstract

The invention belongs to the technical field of image processing, and particularly relates to a method, a device and equipment for tracking a specific target object in a low-resolution image in a complex scene.

Description

Method, device and equipment for tracking specific target object in low-resolution image in complex scene

Technical Field

The invention belongs to the technical field of image processing, and particularly relates to a method for tracking a specific target object in a low-resolution image under a complex scene.

Background

In a complex natural scene in the wild, a specific moving target tracking task based on vision is limited by the following factors: (1) smoke interference; (2) The low-cost imaging device has insufficient image resolution, and shooting a moving object easily causes blurring of the object on the image; (3) the scale range of the target object is larger. The first two factors cause that a target object in an image or a video is blurred and difficult to distinguish, and the third factor causes difficulty in the self-adaptive detection and tracking of a multi-scale object for a tracking algorithm.

Disclosure of Invention

The invention aims to provide a method, a device and equipment for tracking a specific target object in a low-resolution image in a complex scene, and solves the technical problem that the specific target object in the low-resolution image in the complex scene is difficult to track in the prior art.

To solve the above technical problem, a first aspect of the present invention is:

the method for tracking the specific target object in the low-resolution image in the complex scene comprises the following steps:

acquiring an image containing a target object to be tracked as a template frame;

obtaining a first high-resolution sharpened image corresponding to the template frame through a smoke removal and sharpening module;

Specifying a target object in the first high resolution sharpened image;

acquiring an image sequence needing target object tracking, and acquiring a second high-resolution sharpened image corresponding to each image frame in the image sequence through a smoke removal and sharpening module;

and obtaining the position coordinates of the target object matched with the target object appointed in the first high-resolution clear image in each second high-resolution clear image through a target object tracking module.

Preferably, the obtaining, by the smoke removal and sharpening module, the first high resolution sharpened image corresponding to the template frame includes:

two sub-networks are respectively a first sub-network and a second sub-network, the first sub-network removes smoke interference in the template frame through fusion of multi-scale images, and the second sub-network realizes definition of a low-resolution image through an encoder-decoder network structure, so that the first high-resolution definition image is finally obtained.

Preferably, the removing smoke interference in the template frame by the first sub-network through fusion of the multi-scale images includes:

Let the first sub-network be G ¹ The template frame is a low resolution image I polluted by smoke ^LR，0 Image I using a deep convolutional network model ^LR，0 Performing up-sampling and down-sampling for multiple times, and extracting image features to obtain a feature map;

fusing the acquired feature images, and capturing an image I ^LR，0 Is fused to obtain a feature map f ¹ 。

Preferably, the smoke removing and sharpening module acquires a real image which contains a target object to be tracked and has no smoke interference as a first comparison image in the training process, and fuses the obtained characteristic image f ¹ Performing convolution operation to obtain an intermediate prediction feature map

Calculating said intermediate prediction feature map +_ by a first loss function>

A degree of difference from the first contrast image.

Preferably, the smoke removal and definition module introduces a first discriminator D during training ¹ Through the first discriminator D ¹ Judging the first sub-network G ¹ The output image being generated without smoke interferenceIs also a true image without smoke interference and drives the first subnetwork in a loss-resistant manner to produce a more realistic image.

Preferably, the second sub-network implementing the sharpness of the low resolution image through the encoder-decoder network structure includes:

Setting the second sub-network as G ² Upsampling the feature map processed by the first sub-network to obtain a feature map f consistent with the size of the first high-resolution sharpened image ² ；

The feature map f is then mapped using the encoder-decoder network architecture ² Further encoding and decoding are performed to obtain the first high resolution sharpened image.

Preferably, the smoke removal and definition module acquires a high-resolution clear real image containing the object to be tracked as the second contrast image in the training process, and convolves the first high-resolution clear image to obtain the predicted high-resolution image

Calculating said predicted high resolution image +.>

A degree of difference from the second contrast image.

Preferably, the smoke removal and definition module introduces a second discriminator D during training ² Through the first discriminator D ² Judging the second sub-network G ² The output image is a generated high-resolution clear image or a real high-resolution clear image, and the second sub-network is driven by a form of countering loss to generate a more realistic image.

Preferably, the obtaining, by the target object tracking module, the position coordinates of the target object matching the target object specified in the first high-resolution sharpened image in each of the second high-resolution sharpened images includes:

Setting a target object appointed in the first high-resolution definition image as a target object template Z, and setting the current second high-resolution definition image as an image X to be tracked;

the target object tracking module uses a lightweight network model as a backbone network to extract image features of the target object template Z and the image X to be tracked;

the target object tracking module uses a multi-scale feature fusion unit to fuse the image features of different scales extracted from the target object template Z and the image X to be tracked;

the target object tracking module uses a cross-attention fusion unit to implicitly calculate the correlation between the target object template Z and the image X to be tracked through a single-branch structure, so as to detect the position of the target object in the image X to be tracked.

Preferably, the target object tracking module uses a multi-scale feature fusion unit to fuse image features of different scales extracted from the target object template Z and the image X to be tracked, including:

the multi-scale feature fusion unit divides input image features into n groups, learns the image features of n scales through n groups of separable convolution networks with different depths, restores the dimension of the image features in a tensor splicing mode, fuses the n groups of multi-scale features by using a convolution kernel of n x n, and inputs the n groups of multi-scale features to the cross attention fusion unit.

Preferably, the target object tracking module implicitly calculates the correlation between the target object template Z and the image X to be tracked through a single branch structure using a cross-attention fusion unit includes:

the image features of the target object template Z are subjected to the multi-scale feature fusion unit to obtain a target object template feature map Z ^l The image features of the image X to be tracked are subjected to the multi-scale feature fusion unit to obtain an image feature map X to be tracked ^l ；

The cross attention fusion unit is used for fusing the target object template characteristic graph Z transmitted from the corresponding multi-scale characteristic fusion unit ^l And to be trackedImage feature map X ^l Fusion is carried out, and the image feature map X to be tracked is implicitly obtained ^l Each pixel position of the target object template feature map Z ^l Is a correlation of (3).

Preferably, the target object tracking module predicts using two detection branches, the two detection branches being the first detection branch respectively

And a second detection branch->

Said first detection branch->

For outputting the probability that each pixel position in the image X to be tracked contains a target object, the second detection branch +.>

For outputting the size of the target object detected in the image to be tracked X.

Preferably, in the training process, the target object tracking module introduces a third loss function and a fourth loss function to supervise training, and calculates the first detection branch through the third loss function and the fourth loss function respectively

And a second detection branch->

And outputting the difference degree of the result and the real result.

Preferably, the acquiring an image including the target object to be tracked is using a low resolution camera to capture; the image sequence is obtained by shooting the target object by using the same camera under a dynamic environment.

In a second aspect of the present invention, there is provided an apparatus for tracking a specific target object in a low resolution image in a complex scene, including:

the smoke removing and sharpening module is used for obtaining a first high-resolution sharpened image corresponding to the template frame; the method comprises the steps of obtaining a second high-resolution sharpened image corresponding to each image frame in an image sequence to be tracked;

and the target object tracking module is used for obtaining the position coordinates of the target object matched with the target object appointed in the first high-resolution clear image in each second high-resolution clear image.

Preferably, the smoke removing and sharpening module comprises a first sub-network and a second sub-network, two sub-tasks are realized through cascade connection of the two sub-networks, the first sub-network removes smoke interference in the template frame through fusion of the multi-scale image, the second sub-network realizes sharpening of the low-resolution image through an encoder-decoder network structure, and finally the first high-resolution sharpened image is obtained.

Preferably, the target object tracking module uses a lightweight network model as a backbone network to extract image features of the target object template Z and the image X to be tracked;

and a multi-scale feature fusion unit and a cross attention fusion unit are arranged in the target object tracking module, the multi-scale feature fusion unit fuses the image features of different scales extracted from the target object template Z and the image X to be tracked, and the cross attention fusion unit implicitly calculates the correlation between the target object template Z and the image X to be tracked through a single branch structure, so that the position of the target object in the image X to be tracked is detected.

Preferably, the target object tracking module includes a first detection branch and a second detection branch, where the first detection branch is used to output a probability that each pixel position in the image X to be tracked includes a target object, and the second detection branch is used to output a size of the target object detected in the image X to be tracked.

A third aspect of the invention is:

there is provided an apparatus for tracking a specific target object in a low resolution image of a complex scene comprising at least one processor and a memory communicatively coupled to the processor, the memory having stored therein instructions executable by the processor for execution by the processor to enable the processor to perform the method of the first aspect of the invention.

Compared with the prior art, the invention has the beneficial effects that: removing smoke interference in a template frame and each image in an image sequence to be tracked through fusion of multi-scale images, and realizing definition of a low-resolution image through a network structure of an encoder-decoder to finally obtain resolution definition images of the template frame and each image in the image sequence to be tracked, so as to lay a cushion for accurately tracking a target object in the template frame in the image sequence; the lightweight network model is used as a backbone network to extract image features of a target object template and an image to be tracked, multi-scale feature fusion is carried out, so that the model can identify a target object with larger size span in the image, and then the correlation between the target object template and the image to be detected is calculated in a cross attention fusion mode, so that the position of the target object in any image frame is rapidly detected, and the tracking of the target object in an image sequence is realized.

Drawings

The accompanying drawings are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate the invention and together with the embodiments of the invention, serve to explain the invention. In the drawings:

FIG. 1 is a flowchart of a method for tracking a specific target object in a low resolution image in a complex scene according to an embodiment of the present invention.

Fig. 2 is a schematic diagram of tracking an embodiment of a method for tracking a specific target object in a low resolution image in a complex scene according to the present invention.

FIG. 3 is a flowchart illustrating a method for tracking a specific target object in a low-resolution image in a complex scene according to an embodiment of the present invention.

Fig. 4 is a schematic diagram of a smoke removal and sharpening module in an embodiment of a method for tracking a specific target object in a low resolution image in a complex scene according to the present invention.

FIG. 5 is a flowchart of a specific object tracking module in an embodiment of a method for tracking a specific object in a low resolution image in a complex scene according to the present invention.

Fig. 6 is a schematic diagram of a target object tracking module in an embodiment of a method for tracking a specific target object in a low resolution image in a complex scene according to the present invention.

FIG. 7 is a block diagram illustrating an embodiment of an apparatus for tracking a specific target object in a low resolution image in a complex scene according to the present invention.

Fig. 8 is a block diagram of a smoke removal and sharpening module in an embodiment of an apparatus for tracking a specific target object in a low resolution image in a complex scene according to the present invention.

FIG. 9 is a block diagram of a target object tracking module in an embodiment of an apparatus for tracking a specific target object in a low resolution image in a complex scenario of the present invention.

FIG. 10 is a block diagram of a first detection branch in an embodiment of an apparatus for tracking a specific target object in a low resolution image in a complex scenario according to the present invention.

FIG. 11 is a block diagram of one embodiment of an apparatus for tracking a specific target object in a low resolution image in a complex scene of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

In one embodiment, a method for tracking a specific target object in a low-resolution image in a complex scene is provided, where the method for tracking the specific target object in the low-resolution image in the complex scene can track dynamic target objects such as vehicles running on the ground and air flyers, and in the dynamic scene, smoke interference, low resolution and unclear phenomena exist in each frame of image in a video stream including the target object.

As shown in fig. 1, the method for tracking a specific target object in a low-resolution image in a complex scene comprises the following steps:

s100, acquiring an image containing a target object to be tracked as a template frame.

S200, obtaining a first high-resolution sharpened image corresponding to the template frame through the smoke removal and sharpening module.

The template frame in step S100 is a low-resolution image captured by a low-resolution camera with low cost, the image includes a target object to be tracked, and the smoke interference existing in the image is removed and the resolution of the image is improved by the smoke removing and sharpening module, so as to obtain a sharpened image.

S300, a target object is specified in the first high-resolution sharpened image.

The target object may be specified in the first high resolution sharpened image by manually framing the electronic device.

S400, acquiring an image sequence needing target object tracking, and obtaining a second high-resolution sharpened image corresponding to each image frame in the image sequence through a smoke removal and sharpening module.

The image sequence is obtained by shooting a target object by using the same camera for shooting the template frame in a dynamic environment, and the obtained image sequence is a video stream which contains multiple frames of images with the same specification as the template frame so as to track the image sequence based on the target object in the template frame.

S500, obtaining position coordinates of the target object matched with the target object appointed in the first high-resolution clear image in each second high-resolution clear image through the target object tracking module.

Tracking of the object passing through the object in the video stream can be achieved by obtaining in each second high-resolution sharpened image the position coordinates of the object matching the object specified in the first high-resolution sharpened image.

Referring to fig. 2, in the method for tracking a specific target object in a low-resolution image in a complex scene, smoke removal and definition processing are performed on a template frame, smoke removal and definition processing are also performed on each image frame in an image sequence needing to be tracked by the target object tracking module, and when tracking processing is performed subsequently by the target object tracking module, tracking difficulty can be reduced and tracking accuracy can be improved.

Further, in an embodiment, as shown in fig. 3, in the method for tracking a specific target object in a low-resolution image in a complex scene, a first high-resolution sharpened image corresponding to a template frame is obtained by a smoke removal and sharpening module, and two sub-tasks are implemented by cascading two sub-networks, where the two sub-networks are respectively a first sub-network G ¹ And a second subnetwork G ² First subnetwork G ¹ Removing smoke interference in a template frame through fusion of multi-scale images to obtain a feature map f ¹ Second subnetwork G ² Obtaining a feature map f through encoder-decoder network structure pairs ¹ And realizing the definition of the low-resolution image, and finally obtaining the first high-resolution definition image.

Fig. 4 is a schematic diagram of an architecture of a smoke removal and visualization module in an embodiment of a method for tracking a specific target object in a low-resolution image in a complex scene according to the present invention, and in fig. 4, a dashed box portion is a model for deploying the target object tracking module, and other portions are only used in a training phase for the model. Referring to fig. 4, a first sub-network G ¹ And a second subnetwork G ² The specific working mode of the device is as follows:

first subnetwork G ¹ : let the input low resolution template frame contaminated by smoke be image I ^LR，0 First, I is extracted using classical deep convolutional network models such as ResNet18, VGGNet-19, etc ^LR，0 Considering that the image resolution of the template frame is lower and the image size is smaller, only three-level downsampling is adopted to ensure the image feature resolution; second, starting from the last layer of image features obtained by downsamplingThe method comprises the steps of up-sampling and fusing with the image features of the upper stage, wherein a simple addition mode is adopted to improve the deducing speed during the fusion, the second, third and fourth stage image features obtained by down-sampling are fused to capture the local features and the global features of the image of the template frame, and the obtained feature image is defined as a feature image f ¹ 。

Second subnetwork G ² : first, the feature map f ¹ Upsampling to obtain a feature map f with a size consistent with that of the corresponding high-resolution image to be obtained ² The method comprises the steps of carrying out a first treatment on the surface of the Second, using classical encoder-decoder network structures, such as UNet, segNet, etc., for input feature map f ² Further encoding and decoding to obtain a first high resolution sharpened image.

Further, in one embodiment, in conjunction with the method shown in fig. 4, the smoke removal and sharpening module used in the method for tracking the specific target object in the low resolution image in the complex scene is formed by training, and during the training, the first subnetwork G is subjected to a loss function ¹ And a second subnetwork G ² The difference between the predicted image and the real image is calculated and fed back to improve the prediction accuracy of the smoke removal and sharpening module, so that the predicted image and the real image are as close as possible.

For the first sub-network G ¹ Firstly, acquiring a real image which contains a target object to be tracked and has no smoke interference as a first comparison image, shooting the first comparison image in a scene without the smoke interference by adopting the same camera as a shooting template frame, and fusing the obtained characteristic image f ¹ Performing convolution operation to obtain an intermediate prediction feature map

(image after smoke removal) calculating an intermediate prediction feature map by means of a first loss function (loss function 1)>

The degree of difference from the first contrast image.

Also, as shown in FIG. 4, for the second subNetwork G ² In the training process, the smoke removal and definition module firstly acquires a high-resolution clear real image containing a target object to be tracked as a second contrast image, the second contrast image is shot by a high-resolution camera under a scene without smoke interference, and the first high-resolution definition image is subjected to convolution operation to obtain a predicted high-resolution image

(predicted high resolution image) calculating the predicted high resolution image by means of a second loss function (loss function 2)>

Degree of difference from the second contrast image.

The following formula (1) is a formula of a loss function, as shown in the following formula (1), a final loss function used by the smoke removal and definition module is defined as a sum of a first loss function and a second loss function, the first loss function and the second loss function adopt minimum mean square error loss, training of a smoke removal and definition module model is realized by adopting a gradient back propagation algorithm, parameters of the model are corrected through feedback of the loss function, and an image generated by a loss detection driving model is as close as possible to a corresponding real image.

Further, in one embodiment, to obtain a more realistic result, the training of the model of the supervised smoke removal and clarification module is performed using the countermeasures against losses, where two discriminators are introduced, implemented using a small convolutional classification network such as AlexNet, which are respectively denoted as first discriminators D ¹ Second discriminator D ² The role of the discriminator is to determine whether the output image is a generated clear image or a true clear image and to drive the first subnetwork G in the form of countermeasures against losses ¹ With a second sub-network G ² A more realistic image is generated.

The following formula (2) is introduced into the first discriminator D ¹ Second discriminator D ² The following equation of the countermeasures loss function, in the following equation (2),

representing the transmission through the first subnetwork G ¹ This model predicts a low resolution image of the smokeless interference,

representing the transmission through the second subnetwork G ² The model predicts a high resolution image, and the final anti-smoke and sharpness module uses a final anti-loss function defined as a first sub-network G ¹ With a second sub-network G ² The sum of the antagonism loss functions is used for realizing the training of the smoke removal and clearing module model by adopting a gradient back propagation algorithm, the parameters of the model are corrected through the feedback of the antagonism loss functions, and the image generated by the antagonism driving model is as close as possible to the corresponding real image.

The smoke removal and definition module used for obtaining the second high-resolution definition image corresponding to each image frame in the image sequence is the same as the smoke removal and definition module used for obtaining the first high-resolution definition image, and will not be described again here.

Further, in one embodiment, as shown in fig. 5, in the method for tracking a specific target object in a low resolution image in the complex scene, obtaining, by the target object tracking module, a position coordinate of a target object in each second high resolution sharpened image, the position coordinate matching the target object specified in the first high resolution sharpened image includes:

s510, marking a target object designated in the first high-resolution sharpened image as a target object template Z, and marking the current second high-resolution sharpened image as an image X to be tracked.

S520, the target object tracking module uses the lightweight network model as a backbone network to extract image features of the target object template Z and the image X to be tracked.

The lightweight network model used by the target object tracking module can adopt a SheffeNetV 2, and the lightweight network model can improve the model deducing speed.

S530, the target object tracking module fuses the image features of different scales extracted from the target object template Z and the image X to be tracked by using a multi-scale feature fusion unit.

The purpose of the multi-scale feature fusion unit to fuse the different scale features is to enable the model of the target object tracking module to identify the target object with larger size span in the image. As shown in fig. 6, the multi-scale feature fusion unit uses a grouping convolution and a depth separable convolution mode, on one hand, reduces the operand to improve the model inference speed, and on the other hand, fuses multi-scale image features, adapts to a target object with larger size change, the unit divides the input image features into three groups, learns the three-scale image features through three groups of separable convolution networks with different depths, restores feature dimensions through a tensor stitching mode, fuses the three groups of multi-scale image features through a 3x3 convolution kernel, and inputs the three groups of multi-scale image features to the next-stage network.

S540, the target object tracking module uses the cross attention fusion unit to implicitly calculate the correlation between the target object template Z and the image X to be tracked through a single branch structure, so as to detect the position of the target object in the image X to be tracked.

Referring to fig. 6, the cross-attention fusion unit fuses the image features of the target object template Z and the image features of the image X to be tracked based on the attention mechanism commonly used in the transducer model, implicitly obtains the correlation between each pixel position in the image X to be tracked and the target object, and then fuses the correlation result with the original image features.

the working mechanism in the transducer model is as follows, the present cross-attention fusion unit is set as a layer 1 network (the target object tracking module is provided with two branch networks, each branch network is provided with the cross-attention fusion unit and the multi-scale feature fusion unit), and the target object templateZ and the input image characteristic of the image X to be tracked at the layer are set as Z ^l And X is ^l Using tensor dimension operations to transform Z ^l And x ^l Is adjusted to one-dimensional characteristics, and then two groups of characteristics (K) are respectively obtained through three full connection layers (K, Q, V) _Z ，Q _Z ，V _Z ) And (K) _X ，Q _X ，V _X ) The correlation between the target object template Z and the image X to be tracked is calculated using two sets of features, as shown in the following equation (3):

performing dimension change on the correlation of the implicit expression (3) to recover Z ^l And X ^l Consistent dimensions and fused with the input image features as per equation (4):

Z ^l+1 ＝Z ^l +f _Z，X ，X ^l+1 ＝X ^l +f _X，Z (4)

in the target detection process, the classical twin network model uses a two-branch structure to calculate the correlation between the target object template and the image to be detected, the method has large calculated amount and low calculated speed, and the cross attention fusion unit used in the embodiment of the invention implicitly calculates the correlation between the target object template and the image to be detected through a single-branch structure, so that the position of the target object in any image frame can be rapidly detected.

Further, in one embodiment, as shown in fig. 6, the target object tracking module in the method for tracking a specific target object in the low resolution image in the complex scene predicts using two detection branches, the two detection branches being the first detection branch respectively

And a second detection branch->

As shown in FIG. 6, the target object tracking module comprises Conv3x3-BN-RELU, maxPooling. The method comprises the steps of a network layer of a ShuffeetV 2 convolution group 1, a multi-scale feature fusion unit, a cross attention fusion unit, a MaxPooling, shufflenetV convolution group 2, a multi-scale feature fusion unit and a cross attention fusion unit, wherein the MaxPooling, shufflenetV convolution group 1, the multi-scale feature fusion unit and the cross attention fusion unit are first detection branches->

The following MaxPooling, shufflenetV convolution group 2, the multiscale feature fusion unit, the cross-attention fusion unit being the second detection branch +.>

Is used for extracting and normalizing input image data and activating Conv3x 3-BN-RELU.

The first detection branch above

For outputting the probability that each pixel position in the image X to be tracked contains the target object, a second detection branch +. >

For outputting the size of the target object detected in the image X to be tracked.

Further, in an embodiment, during the training process, the target object tracking module introduces a third loss function and a fourth loss function to supervise training, and calculates the first detection branch through the third loss function and the fourth loss function respectively

And a second detection branch->

Degree of difference of output result from real result, first detection branch +.>

And a second detection branch

The final loss is the sum of the third and fourth loss functions, as shown in equation (5):

in the above formula, the super parameter lambda regulates the weight between two losses; x is X ^L The characteristics of the last layer of the model are output; c ^* C is a true value (1-belonging to a target object, 0-not being the target object) and a model predictive value (probability value of 0-1) of any pixel position of the training image belonging to the target object; r is (r) ^* R is the real size and the predicted size of the target object of any pixel position of the training image; the target object position r= (x, y, w, h), (x, y) represents the target object center point offset to be predicted, and (w, h) represents the target object frame size offset to be predicted. Classification loss

Position regression loss +. >

A smoothl 1 loss function is used.

In one embodiment, an apparatus for tracking a specific target object in a low resolution image of a complex scene is provided, as shown in FIG. 7, the apparatus 600 comprising a smoke removal and visualization module 610 and a target object tracking module 620.

Firstly, an image containing a target object to be tracked is taken as a template frame, an image sequence needing target object tracking is obtained, and image data of the template frame and image data of the image sequence are input into the smoke removal and definition module 610.

The smoke removal and sharpening module 610 is configured to obtain a first high resolution sharpened image corresponding to the template frame; and the second high-resolution definition image corresponding to each image frame in the image sequence is obtained.

A target object is designated as a target object template in the first high resolution sharpened image, and image data of the target object template and a second high resolution sharpened image corresponding to each image frame in the image sequence are input to the target object tracking module 620.

The target object tracking module 620 is configured to obtain, in each of the second high resolution sharpened images, a position coordinate of a target object that matches the target object template specified in the first high resolution sharpened image.

After the target object tracking module 620 obtains the position coordinates of the target object matched with the target object template in each second high-resolution sharpened image, the position coordinate data are stored and marked, and the marked second high-resolution sharpened images are sequentially played, so that the tracking of the target object is realized.

Further, in one embodiment, as shown in fig. 8, the smoke removing and sharpening module 610 in the apparatus for tracking a specific target object in a low resolution image in the complex scene includes a first sub-network 611 and a second sub-network 612 to implement two sub-tasks through cascade connection of the two sub-networks, where the first sub-network 611 is used to remove smoke interference in a template frame through fusion of multi-scale images, and the second sub-network 612 is used to implement sharpening of the low resolution image through a network structure of an encoder-decoder, so as to finally obtain a first high resolution sharpened image.

For example, taking the example of processing the template frame by the smoke removal and sharpening module to obtain a first high-resolution sharpened image, the first subnetwork 611 can use classical deep convolution network models such as ResNet18, VGGNet-19 and the like to extract the image characteristics of the template frame, and considering that the image resolution of the template frame is lower and the image size is smaller, only three-level downsampling is adopted to ensure the image characteristic resolution; then pair through the first subnetwork 611 The image features obtained by sampling are fused, for example, the image features of the last layer obtained by downsampling can be up-sampled and fused with the image features of the last layer, wherein, a simple addition mode is adopted to improve the deducing speed during the fusion, the image features of the second, third and fourth layers obtained by the previous downsampling are fused to capture the local features and the global features of the image of the template frame, and the obtained feature image is defined as a feature map f ¹ The method comprises the steps of carrying out a first treatment on the surface of the Next, feature map f is processed by second subnetwork 612 ¹ Upsampling to obtain a feature map f with a size consistent with that of the corresponding high-resolution image to be obtained ² The second subnetwork 612 can then use classical encoder-decoder network structures, such as UNet, segNet, etc., for inputting the feature map f ² Further encoding and decoding to obtain a first high resolution sharpened image.

Further, in one embodiment, the target object tracking module 620 in the apparatus for tracking a specific target object in a low resolution image in a complex scene uses a lightweight network model as a backbone network to extract image features of the target object template Z and the image X to be tracked, where the lightweight network model may use ShuffleNetV2, and the lightweight network model may increase the model inference speed.

The target object tracking module 620 is provided with a multi-scale feature fusion unit and a cross attention fusion unit, wherein the multi-scale feature fusion unit fuses the image features of different scales extracted from the target object template Z and the image X to be tracked, and the cross attention fusion unit implicitly calculates the correlation between the target object template Z and the image X to be tracked through a single-branch structure, so that the position of the target object in the image X to be tracked is detected.

Further, as shown in fig. 9, the target object tracking module 620 in the apparatus for tracking a specific target object in a low resolution image under the complex scene includes a first detection branch 621 and a second detection branch 622, where the first detection branch 621 and the second detection branch 622 are both provided with a multi-scale feature fusion unit and a cross attention fusion unit, the first detection branch 621 is used for outputting a probability that each pixel position in the image X to be tracked includes the target object, and the second detection branch 622 is used for outputting a size of the target object detected in the image X to be tracked.

Taking the first detection branch 621 as an example, the first detection branch is used for reducing the operation amount and improving the model deducing speed by using a grouping convolution and depth separable convolution mode through a multi-scale feature fusion unit 6211, and is used for fusing multi-scale image features on the one hand and adapting to a target object with larger size change on the other hand, specifically, the first detection branch is used for dividing the input image features into three groups, learning the image features of the three scales through three groups of separable convolution networks with different depths, recovering the feature dimensions through a tensor stitching mode, fusing the three groups of multi-scale image features through a 3X3 convolution kernel, inputting the three groups of multi-scale image features into a cross attention fusion unit 6212, and the cross attention fusion unit 6212 is used for fusing the image features of a target object template Z and the image features of an image X to be tracked on the basis of a common attention mechanism in the trans-former model, implicitly obtaining the correlation between each pixel position in the image X to be tracked and the target object, and then fusing the correlation result with the original image features.

Each functional module or unit in the device for tracking the specific target object in the low-resolution image in the complex scene is a program code module for correspondingly realizing the related function, and the program code module can be provided for a computer, a handheld electronic device or a cloud server to execute so as to realize the tracking of the specific target object.

In one embodiment, an apparatus for tracking a specific target object in a low resolution image in a complex scene is provided, and as shown in fig. 11, the apparatus 700 includes a processor 710, and a memory 720 communicatively connected to the processor 710, where the memory 720 stores instructions executable by the processor, and the instructions are executed by the processor to enable the processor 710 to perform some or all of the steps in the method in the above embodiment.

The device 700 is an electronic device, and the processor 710 may employ a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a Digital Signal Processor (DSP), and other processors that can meet the requirements. The memory 720 includes an inner layer and an outer memory, wherein the memory includes a Read Only Memory (ROM) and a Random Access Memory (RAM). The device 700 further comprises a bus 730, an I/O interface 740 and a peripheral unit 750, wherein the peripheral unit 750 is communicatively connected to the bus 730 via the I/O interface 740, and the processor 710 and the memory 720 are also communicatively connected to the bus 730, enabling communication of the memory 720, the peripheral unit 750 and the processor 710. The peripheral unit 750 may include a keyboard, a mouse, a communication module, a display, and the like. Although embodiments of the present invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made therein without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. A method for tracking a specific target object in a low resolution image in a complex scene, comprising:

specifying a target object in the first high resolution sharpened image;

2. The method for tracking a specific target object in a low-resolution image in a complex scene according to claim 1, wherein the obtaining, by a smoke removal and sharpening module, a first high-resolution sharpened image corresponding to the template frame comprises:

3. The method of claim 2, wherein the first sub-network removing smoke interference in the template frame by fusion of multi-scale images comprises:

fusing the acquired feature images, and capturing an image I ^LR，0 Is fused to obtain a feature map f ¹ ；

In the training process, the smoke removal and definition module acquires a real image which contains a target object to be tracked and has no smoke interference as a first comparison image, and fuses the obtained characteristic images f ¹ Performing convolution operation to obtain an intermediate prediction feature map

A degree of difference from the first contrast image;

the smoke removal and definition module introduces a first discriminator D in the training process ¹ Through the first discriminator D ¹ Judging the first sub-network G ¹ Whether the output image is a generated image without smoke interference or is true And to drive the first subnetwork in a manner that resists loss to produce a more realistic image.

4. The method of claim 2, wherein the second sub-network implementing the sharpening of the low resolution image via the encoder-decoder network structure comprises:

The feature map f is then mapped using the encoder-decoder network architecture ² Further encoding and decoding to obtain the first high-resolution sharpened image;

the smoke removal and definition module acquires a high-resolution clear real image containing a target object to be tracked as a second contrast image in the training process, and convolves the first high-resolution definition image to obtain a predicted high-resolution image

Calculating said predicted high resolution image +.>

A degree of difference from the second contrast image;

the smoke removal and definition module introduces a second discriminator D in the training process ² Through the first discriminator D ² Judging the second sub-network G ² The output image is a generated high-resolution clear image or a real high-resolution clear image, and the second sub-network is driven by a form of countering loss to generate a more realistic image.

5. The method for tracking a specific target object in a low resolution image in a complex scene according to claim 1, wherein the obtaining, by the target object tracking module, the position coordinates of the target object in each of the second high resolution sharpened images, the position coordinates of the target object matching the target object specified in the first high resolution sharpened image comprises:

The target object tracking module uses a cross attention fusion unit to implicitly calculate the correlation between the target object template Z and the image X to be tracked through a single branch structure, so as to detect the position of the target object in the image X to be tracked;

the target object tracking module uses a multi-scale feature fusion unit to fuse the image features of different scales extracted from the target object template Z and the image X to be tracked, and the method comprises the following steps:

the multi-scale feature fusion unit divides input image features into n groups, learns the image features of n scales through n groups of separable convolution networks with different depths, restores the dimension of the image features in a tensor splicing mode, fuses the n groups of multi-scale features by using a n-by-n convolution kernel, and inputs the n groups of multi-scale features to the cross attention fusion unit;

the target object tracking module implicitly calculates the correlation of the target object template Z and the image X to be tracked through a single-branch structure by using a cross-attention fusion unit, which comprises the following steps:

the image features of the target object template Z are subjected to the multi-scale feature fusion unit to obtain a target object template feature map Z ^l Image characteristics of the image X to be tracked Obtaining the image feature image X to be tracked after passing through the multi-scale feature fusion unit ^l ；

The cross attention fusion unit is used for fusing the target object template characteristic graph Z transmitted from the corresponding multi-scale characteristic fusion unit ^l And image feature map X to be tracked ^l Fusion is carried out, and the image feature map X to be tracked is implicitly obtained ^l Each pixel position of the target object template feature map Z ^l Is a correlation of (3).

6. The method of claim 5, wherein the target object tracking module predicts using two detection branches, the two detection branches being the first detection branch respectively

And a second detection branch->

Said first detection branch->

For outputting the size of the target object detected in the image X to be tracked;

in the training process, the target object tracking module introduces a third loss function and a fourth loss function to supervise training, and calculates the first detection branch through the third loss function and the fourth loss function respectively

And a second detection branch- >

And outputting the difference degree of the result and the real result.

7. The method for tracking a specific target object in a low resolution image in a complex scene according to claim 1, wherein the acquiring an image containing the target object to be tracked is performed by using a low resolution camera; the image sequence is obtained by shooting the target object in a dynamic environment by using the same camera in the dynamic environment.

8. An apparatus for tracking a specific target object in a low resolution image in a complex scene, comprising:

a target object tracking module, configured to obtain, in each of the second high-resolution sharpened images, a position coordinate of a target object that matches a target object specified in the first high-resolution sharpened image;

the smoke removing and sharpening module comprises a first sub-network and a second sub-network, two sub-tasks are realized through cascade connection of the two sub-networks, the first sub-network removes smoke interference in the template frame through fusion of multi-scale images, the second sub-network realizes sharpening of a low-resolution image through an encoder-decoder network structure, and finally the first high-resolution sharpened image is obtained;

9. The apparatus for tracking a specific target object in a low-resolution image under a complex scene according to claim 8, wherein the target object tracking module comprises a first detection branch and a second detection branch, the first detection branch is used for outputting a probability that each pixel position in the image X to be tracked contains the target object, and the second detection branch is used for outputting a size of the target object detected in the image X to be tracked.

10. An apparatus for tracking a specific target object in a low resolution image of a complex scene, comprising at least one processor, and a memory communicatively coupled to the processor, the memory having stored therein instructions executable by the processor to enable the processor to perform the method of any one of claims 1 to 7.