CN114241274A

CN114241274A - Small target detection method based on super-resolution multi-scale feature fusion

Info

Publication number: CN114241274A
Application number: CN202111473712.1A
Authority: CN
Inventors: 徐洁; 叶娅兰; 刘紫奇
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2021-11-30
Filing date: 2021-11-30
Publication date: 2022-03-25
Anticipated expiration: 2041-11-30
Also published as: CN114241274B

Abstract

The invention discloses a small target detection method based on super-resolution multi-scale feature fusion, and belongs to the technical field of image processing. Inputting a low-resolution image to be identified into a feature extractor to obtain a first feature map, performing data enhancement processing on the low-resolution image, and inputting the low-resolution image and noise disturbance superposition into a generator to obtain superposition quantity; the superposition result of the first feature map and the superposition quantity is used as a first reconstruction feature and is input into a decoder to obtain second reconstruction features with different sizes and is input into a feature fusion network; the feature fusion network samples all the second reconstruction features to the same size for superposition to obtain third reconstruction features and inputs the third reconstruction features into the image target detection network; and obtaining the category of the small target and the position of the detection frame thereof based on the output of the image target detection network. The invention achieves the effects of short training time, quick reasoning and high precision while carrying out small target detection, and has the industry-leading small target detection effect.

Description

Small target detection method based on super-resolution multi-scale feature fusion

Technical Field

The invention relates to the technical field of image processing, in particular to a small target detection method based on super-resolution multi-scale feature fusion.

Background

The target detection is a hot direction of computer vision and digital image processing, and is widely applied to various fields of robot navigation, intelligent video monitoring, industrial detection, aerospace and the like. Today, the trend of intelligence in various fields is that the realization of target detection has important practical significance to the reduction of human power capital consumption. The small target detection is a crucial link in a downstream task of target detection. For example, detecting small targets or distant objects in a high-resolution scene shot of an automobile is a necessary condition for deploying automated driving safely; also, for example, in satellite image analysis, it is important to efficiently annotate objects such as cars, ships, and houses. Small target detection is therefore receiving increasing attention.

With the recent progress of deep learning, target detection has made great progress in both performance and speed. Currently, some of the most advanced target detectors have achieved extremely high precision on large and medium size targets, which can meet the requirements of many practical applications. These object detectors usually do not distinguish between small and medium sized objects, and both objects are processed and identified in the same way. However, these detectors neglect the common difficult problems of low resolution, blurred picture, less information, much noise, etc. existing in the small target itself, and may cause that when these methods are used for detecting the small target, only half of the average accuracy of the detection of the medium and large size targets is obtained.

In order to improve the accuracy of small target detection, researchers first try to adjust the feature extraction link of a general detector, and hopefully solve the problem of low resolution of small target features. For example, some methods reduce the compression ratio of image data processing, and it is desirable that small objects have higher resolution in the extracted features. However, these methods do not take into account that the resolution of many target detection data per se is not high, and the small target features have the problems of low resolution and little information before extraction.

In recent years, some researchers have chosen to design detectors specifically for small target objects. Researchers find that shallow features are more beneficial to distinguishing small target objects, and choose to directly extract features from shallow convolution to improve the detection accuracy of the small target objects. The method relieves the problem of insufficient characteristic information of the small target to a certain extent. However, such detectors have a relatively large loss of semantic information for the image, and have poor generalization capability in general target detection involving large-sized objects.

Furthermore, most existing small target detectors use a general target detection data set. Most of the data in these data sets are medium and large objects, and only a few images contain small target objects, so that the detection model cannot learn the characteristics of the small targets in half the time. At the same time, the small target objects cover a much smaller area than the large targets, which causes an imbalance of detectors with few small target matches and many large target matches, resulting in the specialized small target detectors still paying more attention to the large and medium sized objects.

Disclosure of Invention

The invention provides a small target detection method based on super-resolution multi-scale feature fusion, which is used for solving the problem of low resolution of a small target object so as to improve the detection performance of the small target during image target detection processing.

The technical scheme adopted by the invention is as follows:

a small target detection method based on super-resolution multi-scale feature fusion comprises the following steps:

network model configuration and training:

acquiring a high-resolution image pair and a low-resolution image pair as training images to obtain a training image set;

configuring a network model, comprising: encoder-decoder network for high resolution images, feature extractor G for low resolution images_LThe device comprises a generator G, a feature fusion network and an image target detection network;

the encoder part of the encoder-decoder network is marked as an encoder GH, and the decoder part is marked as a decoder D_HThe encoder GH comprises a plurality of convolution layers and pooling layers, and is an alternating structure of the convolution layers and the pooling layers; the decoder D_HComprising a plurality of deconvolution layers, said deconvolution layers and an encoder G_HOf a rollThe number of the lamination layers is corresponding, and the characteristic dimension and the size are corresponding;

the low resolution image LR image in the high and low resolution image pair is input to the feature extractor G_LBased on the feature extractor G_LIs output to obtain the characteristic f_L(ii) a And inputs the high resolution image HR of the high and low resolution image pair into the encoder G_HDeriving the feature f based on its output_H(ii) a The loss function used in the encoder-decoder network training is:

where HR' represents decoder D_HAn output of (d);

the feature extractor G_LThe method comprises a multi-layer feature extraction block, wherein the feature extraction block consists of a multi-scale feature fusion network and local residual learning;

the input of the generator G is: performing data enhancement processing on the low-resolution image LR to obtain an image LR ', and disturbing the image LR' and randomly generated noise

As input to generator G; the output of the generator G is recorded as the superposition p, and the loss function adopted by the generator G during training is as follows: l is_p＝||p||；

The output of the generator G is superposed with the output of the feature extractor GL to obtain a first reconstruction feature, and the first reconstruction feature is input into a decoder D_HDecoder D_HThe output of each deconvolution layer is used as the input of a feature fusion network, the feature fusion network is used for sampling the input feature graphs with different sizes to the same size and superposing the feature graphs, and then the superposition result is input into an image target detection network;

the image target detection network comprises a classification branch and a positioning branch, and the classification branch of the image target detection network classifies targets based on an attention mechanism when the classification branch of the image target detection network classifies the targets;

the total loss adopted during the training of the configured network model is as follows: l ═ λ_Lr+μL_loc+η_LregWherein L is_rRepresents a loss of super-resolution reconstruction, and L_r＝L_rc1+L_rc2+L_p，L_rc2Represents the first reconstruction loss as:

L_loc、L_regrespectively representing the classification loss of the classification branch of the image target detection network and the positioning loss (i.e. regression loss) of the positioning branch, wherein lambda, mu and eta are respectively the loss L_r、L_locAnd L_regThe weighting factor of (1);

a step of detecting a low-resolution image to be identified:

inputting a low-resolution image to be recognized into a feature extractor GL, and obtaining a first feature map of the low-resolution image to be recognized based on the output of the feature extractor GL;

after data enhancement processing is carried out on the low-resolution image, the low-resolution image is input into a generator G after being disturbed and superposed with randomly generated noise, and the superposition quantity is obtained based on the output of the generator G; taking the superposition result of the first feature map and the superposition amount as a first reconstruction feature of the low-resolution image to be identified;

inputting the first reconstruction characteristics into a decoder D_HBased on a decoder D_HGenerating second reconstruction features of different sizes from the output of each deconvolution layer and inputting the second reconstruction features into a feature fusion network;

the feature fusion network samples all the second reconstruction features to the same size for superposition to obtain third reconstruction features and inputs the third reconstruction features into the image target detection network;

and obtaining the category of the small target and the position of the detection frame thereof based on the output of the image target detection network.

The technical scheme provided by the invention at least has the following beneficial effects:

compared with the traditional small target detection mode, the detection method of the invention has the advantages that the most advanced real-time detection performance is kept under the condition that the detection of the small target object meets the balance of training time, reasoning time and detection precision.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is an xx diagram of a xxx method provided by embodiments of the present invention;

FIG. 2 is an xx diagram of a xxx method provided by embodiments of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention will be described in detail with reference to the accompanying drawings.

In order to solve the problem of insufficient precision caused by insufficient characteristics of a small target (a detected target in an image to be detected is smaller than a specified size) detector, the embodiment of the invention provides a super-resolution technology on a characteristic level, improves semantic information of deep characteristics by combining a characteristic fusion technology, and improves the detection performance of target detection by using an attention mechanism.

Referring to fig. 1 and fig. 2, the method for detecting a small target based on super-resolution multi-scale feature fusion provided by the embodiment of the invention comprises:

inputting a low-resolution image to be recognized into a configured feature extractor GL, and obtaining a first feature map of the low-resolution image to be recognized based on the output of the feature extractor GL; after data enhancement processing is carried out on the low-resolution image, the low-resolution image is input into a configured generator G after being disturbed and superposed with randomly generated noise, and the superposition amount is obtained based on the output of the generator G; taking the superposition result of the first feature map and the superposition amount as a first reconstruction feature of the low-resolution image to be identified;

decoder D configured with first reconstruction characteristics input_HSequentially generating second reconstruction features of different sizes, and sampling all the second reconstruction featuresStacking the samples to the same size to obtain a third reconstruction characteristic; the decoder D_HIncluding a plurality of deconvolution layers, each layer of deconvolution layer outputting a second reconstruction feature of one dimension;

inputting the third reconstruction characteristics into a configured image target detection network to perform target detection processing on the small target, wherein the image target detection network comprises a classification branch and a positioning branch, and the class of the small target and the position of a detection frame thereof are obtained based on the output of the image target detection network, and the classification branch of the image target detection network realizes target classification based on an attention mechanism when performing target classification processing.

Wherein the decoder D_HAnd a feature extractor G_LThe generator G, the third reconstruction characteristic and the specific implementation of the image target detection network comprise

(1) The conversion of the image from low resolution to high resolution is realized to enhance the semantic information of the subsequent low resolution input. Respectively taking the high-resolution image pair LR and the low-resolution image pair HR as network inputs, and obtaining corresponding features f through different feature extractors GL and GH_LAnd f_H(ii) a Obtaining the feature f from the low-resolution image by the generator G_LConversion into high resolution image features f_HThe superposition amount p of the characteristic layer, and realizing a super-resolution technology on the characteristic layer; deep features f of high resolution images_HThe original high-resolution image is restored through a decoder to ensure the validity of the deep feature semantic information.

(1-1): taking the high resolution image HR as input to the encoder-decoder part of the network, where G_HI.e. the encoder part, the decoder is denoted as D_HPerforming convolution pooling for multiple times to obtain deep layer characteristic f_H。

In the embodiment of the present invention, the encoder-decoder may adopt any conventional network structure, and specifically, the encoder G may be used_HIs set to be 7 layers, and is subjected to convolution pooling by adopting three convolution kernels of 7 multiplied by 7, 5 multiplied by 5 and 3 multiplied by 3 and a pooling kernel of 2 multiplied by 2 to obtain f_H. For example, each convolution pooling process is first run through three convolution layers (which may typically include convolution operations, batch normalizationProcessing and activation function mapping), and then through a pooling layer.

(1-2): decoder D_HComposed of multiple deconvolution layers, and deep layer characteristics f_HAs input to the decoder, the deconvolution layers correspond to the number of convolution layers and the characteristic dimensions and sizes, for f_HPerforming dimensionality raising to obtain an output HR'; the HR' and the HR have the same resolution size and the same channel number; i.e. encoder G_HHas the function of generating a characteristic image with semantic information, which is then passed through a decoder D_HMapping low resolution feature images output by encoder GH back to the size of the input image

(1-3): taking the L2 distance as the reconstruction loss of HR and HR ', optimizing the L2 loss (L2 norm loss function) to make HR' and HR closer, and making the decoder part possess the deep characteristic f_HAbility to reconstruct original image, only deep features f_HContains the necessary semantic information to ensure that the slave f_HReverting to the original image.

Specifically, the reconstruction loss is as follows:

(1-4): using the low resolution image LR as the feature extractor G_LThe feature f is obtained by multi-scale feature fusion and local residual error learning_L；

Specifically, the feature extractor G_LThe number of the characteristic layers is set to be 5, each layer is formed by multi-scale characteristic fusion and local residual error learning, and image characteristics of different scales can be obtained, so that the image characteristics are fully extracted.

In the nth layer, M is added_i-1As input of next multi-scale residual block, obtaining output M thereof_iRepeating the steps until M is obtained_nIn the embodiment of the present invention, each layer includes three convolutional layers.

M_i-1As input to the first convolutional layer, the output S is obtained by convolution with 3 × 3 and 5 × 5, respectively, and by the ReLU function₁、P₁. Will S₁And P₁Connected in series as input to a second convolutional layer, by 3X 3 and 5X 5 convolutions, respectively, and by the ReLU function to obtain an output S₂、P₂. Will S₂And P₂Concatenated as an input to a third convolutional layer, and convolved by 1 × 1 to obtain an output S'. Will M_i-1Connecting the residual error to the output, and combining with S' to obtain the final output M_i。

Will M₀To M_nAll outputs are used as the input of the hierarchical feature fusion structure to obtain the extracted features M₅；

All inputs of the hierarchical feature fusion structure are connected in series, and the fused feature channels are compressed to the required channel number by using 1 multiplied by 1 convolution to obtain the extracted feature M₅I.e. characteristic f_L；

(1-5): LR' is obtained by LR data enhancement processing, and noise disturbance is randomly generated

Mixing LR' with

The superposition of p is used as the input of a generator G, the superposition quantity p is obtained, and an L1 regular term about p is calculated to ensure the sparsity of p;

in particular, data enhancement generally improves the display quality of a quantized coarse image by adjusting or varying the amplitude values of the image. The dithering technology can eliminate a part of false contours generated by too few gray levels, and the effect is more obvious when the superimposed dithering value is larger. However, the superposition of the jitter values also brings noise to the image, and the larger the jitter value is, the larger the noise influence is. Dithering is generally achieved by adding a random small noise d (x, y) to the original image f (x, y), the value of d (x, y) generally not being linked to f (x, y) in any regular way. And the generalization capability and the robustness of the trained model are improved by color dithering and noise data addition.

The regularization term is as follows:

L_p＝||p||

(1-6): will f is_LThe result of superposition of p

As a reconstruction feature, calculate

And f_HThe L2 distance of (a) is taken as a reconstruction loss, let G_LAnd G possesses the ability to increase image resolution at the feature level.

Specifically, the reconstruction loss is as follows:

the overall loss of the super-resolution part at the feature level is realized as follows.

L_r＝L_rc1+L_rc2+L_p

(2): by reconstruction features

And a decoder DH generates depth features of different scales, and retains semantic information of small targets in different feature layers through multi-scale feature fusion. Generating a class-dependent feature map

The attention mechanism is utilized to increase the loss specific gravity of the interested target so as to improve the performance of target detection.

In particular, the amount of the solvent to be used,

wherein C, H, W, r represents the number of categories, the height and width of the input image, and the output stride, respectively;

(2-1): will be provided with

Input to a decoder D_HPerforming up-sampling to generate reconstruction features with different sizes in sequenced₁、d₂、d₃、d₄、d₅Due to D_HFinally, the features are restored to the original image, so that the generated features can be regarded as depth features of a super-resolution image, namely the reconstructed features are more than the low-resolution image features f_LMore semantic information is contained.

(2-2): feature d is reconstructed₁、d₂、d₃、d₄、d₅All upsampled to the same size for superposition. Generally, a small target retains more semantic information in a shallow feature, but as the network goes deeper, the semantic information of the small target is gradually lost, and the semantic information of the large target is gradually abstracted to meet the application requirements of the network. Therefore, the semantic information of the small target can be kept while the abstract semantic information of the large target is obtained through the fusion of the features under different levels. Recording the final feature superposition result as d;

specifically, the feature superposition is a feature pyramid model which combines multi-level features to solve the multi-scale problem, and the whole structure is composed of a bottom-up down-sampling, a self-term-down up-sampling and a transverse connection structure. E.g. for low resolution feature maps d₁2 times of upsampling is carried out to obtain d'₁Adding the two, i.e. combining the up-sampling mapping with the corresponding feature bottom-up mapping to obtain an intermediate feature d_tThe following formula is given.

d_1t＝d₁+d′₁

This process is iterated until the final resolution map d is generated.

(2-3): feature d obtains a class-related feature map through a convolution layer

The method comprises C channels corresponding to the number of the classes of the target to be identified, wherein each channel is used for extracting the characteristics of the object corresponding to the class and ignoring the characteristics of other classes. Generating channel weights W using a soft attention mechanism_cAnd the loss ratio of the category to be identified is further improved.

In particular toIn the attention mechanism, weighting operation is performed on channel dimensions, and the attention mechanism enables a model to pay more attention to channel characteristics with the largest information quantity, namely, pay more attention to the category of the target to be identified rather than other categories. Firstly, compressing the features d obtained by convolution to obtain global features d' of channel layers, wherein the number of channels C is equal to the number of categories to be identified, and then learning the relation among the channels by using the global features to obtain the weights W of different channels_cFinally, multiplying the original feature d' to obtain the final class-related feature map

The following formula is shown.

Secondly, regarding the feature classification of each channel as a two-classification problem, namely whether the extracted features belong to the classes to be identified or not, calculating a two-classification cross entropy loss for each channel, balancing the proportion of the loss of each channel by paying attention to the weight of a force mechanism, and finally, the network tends to extract the features of the feature class objects from the specific channels, so as to optimize the target as the following formula.

(2-4): similarly, feature d is mapped by convolutional layers

It contains 4 channels for the subsequent target size regression task.

In particular, the amount of the solvent to be used,

wherein H, W, r represents the number of categories, the height and width of the input image, and the output stride, respectively;

(3): training using two-dimensional Gaussian kernels and labeledData generation thermodynamic diagram H for supervised training, characterization

For a central positioning task. The target center is used as a positive sample, other pixel points are used as negative samples, the problem of the number unbalance of the positive and negative samples is solved through Focal local, and the Loss L is obtained_loc；

The general structure of the network is shown in fig. 2, and the extracted features are used for performing a centering task. The feature pyramid structure amplifies feature maps of different depths to the size of the last layer, and the feature pyramid structure is directly added, so that high-resolution information of shallow features and semantic information of deep features can be reserved, the target detection effect is enhanced, and researches show that the shallow features are more suitable for small target detection. The extracted features are

For a central positioning task. Where C, H, W, r is the number of categories, the height and width of the input image, and the output stride. In this embodiment, C-80 and r-4 are set, a gaussian kernel is used for both centering and detection box regression, and scalars α and β are defined to control the size of the kernel, respectively;

given the genus C_mThe mth label box of a class is first linearly mapped to the scale of the feature map. Then, 2-dimensional Gaussian kernel is adopted

To generate

Wherein

Finally, by applying H_mMaximum value of element(s) in H to update C in H_mA channel. Generation of H_mM is marked with the center of the box as (x) determined by the parameter alpha₀，y₀) m, the size of the mark frame is (h, w)_m. By using

To ensure that the center is located in the pixel. In the network setting, α may be made 0.54.

The peak of the gaussian distribution, i.e. the pixel in the center of the box, is considered as a positive sample, while any other pixel is considered as a negative sample. And solving the problem of unbalanced number of positive and negative samples by adopting Focal local.

Given a predicted value

And a positioning target H, as shown in the following formula,

wherein alpha is_fAnd beta_fRespectively, the hyper-parameters, M represents the number of the labeled boxes, in this embodiment, α is set_f＝2，β_f＝4。

Representation characteristic diagram

(predicted value), c denotes a channel number, i, j denotes a spatial position, H_ijcThe element representing the localization object H, i.e., the corresponding tag value.

(4): thermodynamic diagrams H and features

For the size regression task, calculating the effectiveness of the prediction frame by overlapping the positions of the prediction frame and the real frame to obtain the loss L_reg；

For size regression, given the mth label box on the scale of the feature map, another Gaussian kernel is used to generate

The kernel size is determined by the parameter β. Note that when α and β are the same, the same kernel can be used to save computation.S_mIs named as Gaussian region A_m. Due to A_mAlways within the m-mark box, and therefore in the rest of the embodiments of the present invention it is also named sub-area.

Each pixel point in the sub-region is considered as a regression sample. Given area A_mThe regression target is defined as the distance from (ir, jr) to the four sides of the mth frame, and is expressed as a four-dimensional vector

I.e. w_l、w_rRespectively, the left and right side distances, h_t、h_bRepresenting the distance of the top and bottom edges, the prediction box at pixel point (i, j) can be represented as

Where s is a fixed scalar used to scale up the prediction results for optimization. In the present embodiment, s-16 is set. Note, the prediction box

At the image scale, rather than the feature map scale, i.e., the prediction box is typically located based on two vertices on the diagonal of the rectangle,

respectively show the predicted values of wl and wr,

respectively represent h_t、h_bThe predicted value of (2).

If a pixel is not contained by any sub-region, it is ignored during training. If a pixel is contained in multiple sub-regions, it is an ambiguous sample whose training target is set to be the target with smaller area.

Given a predicted value

And a regression target S from which training targets are collected

From

Collecting corresponding prediction results

Wherein N is_regThe number of regression samples is indicated. For all samples, the prediction box and the corresponding label box of the sample are decoded as above, and the overlap GIoU of the positions of the prediction box and the real box is used as an optimization target, as below.

Wherein the content of the first and second substances,

representing a decoding box

Is the mth label box corresponding to the image in proportion. W_ijIs the sample weight to balance the loss caused by each sample.

Due to the size scale variation of the target, a large target (with a size larger than a specified size) may generate thousands of samples, while a small target may generate only a small number of samples. The losses caused by the small targets are even negligible after normalizing the losses of all sample allocations, which will impair the detection performance of the small targets. Thus, the sample weight W_ijPlays an important role in the balance loss. Assume that (i, j) is annotated at the m-thSub-area A of the frame_mIn the interior, there are:

wherein G is_m(i, j) is the Gaussian probability at (i, j), G_m(x, y) then denotes the Gaussian probability at (x, y), a_mIs the area of the m-th detection frame. The processing mode can fully utilize more annotation information contained in the large target and reserve the annotation information of the small target. It may also emphasize these samples near the center of the target, reducing the effects of blurring and low quality samples.

Finally, the reconstruction loss L is calculated_rc1、L_rc2Term of regularization L_pCenter positioning loss L_locAnd size regression loss L_regThe small target detection method comprises the steps of calculating the total loss L of small target detection as input, optimizing the network weight according to the total loss L, and realizing speed and precision balance after the optimization is completed;

specifically, the formula for the total loss L is:

L＝λL_r+μL_loc+ηL_rea

wherein, λ, μ, η are weighting factors of super-resolution reconstruction loss, center positioning loss and size regression loss respectively.

The embodiment of the invention provides a small target detection method based on super-resolution multi-scale feature fusion, aiming at the problem of insufficient precision caused by insufficient small target detection features of most of the current detectors. And then, multi-scale image feature fusion is realized by means of the feature pyramid structure, so that semantic information of small target objects is prevented from being lost. The feature extractor is focused on extracting features that identify the class to which the object belongs using an attention mechanism. And finally, performing center positioning and size regression by using the extracted features so as to achieve the effect of target detection. The invention achieves the effects of short training time, quick reasoning and high precision while carrying out small target detection, and has the industry-leading small target detection effect.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

What has been described above are merely some embodiments of the present invention. It will be apparent to those skilled in the art that various changes and modifications can be made without departing from the inventive concept thereof, and these changes and modifications can be made without departing from the spirit and scope of the invention.

Claims

1. A small target detection method based on super-resolution multi-scale feature fusion is characterized by comprising the following steps:

network model configuration and training:

the encoder part of the encoder-decoder network is denoted as encoder G_HThe decoder part is denoted as decoder D_HSaid encoder G_HThe multilayer film comprises a plurality of convolution layers and pooling layers, and is an alternating structure of the convolution layers and the pooling layers; the decoder D_HComprising a plurality of deconvolution layers, said deconvolution layers and an encoder G_HThe number of the convolution layers is corresponding, and the characteristic dimension and the size are corresponding;

divide the height intoLow resolution image LR image input feature extractor G in resolution image pair_LBased on feature extractor G_LIs output to obtain the characteristic f_L(ii) a And inputs the high resolution image HR of the high and low resolution image pair into the encoder G_HDeriving the feature f based on its output_H(ii) a The loss function used in the encoder-decoder network training is:

where HR' represents decoder D_HAn output of (d);

Output of the generator G and a feature extractor G_LThe outputs of which are superimposed to obtain a first reconstruction characteristic and input into a decoder D_HDecoder D_HThe output of each deconvolution layer is used as the input of a feature fusion network, the feature fusion network is used for sampling the input feature graphs with different sizes to the same size and superposing the feature graphs, and then the superposition result is input into an image target detection network;

the total loss adopted during the training of the configured network model is as follows: l ═ λ L_r+μL_loc+ηL_regWherein L is_rRepresents a loss of super-resolution reconstruction, and L_r＝L_rc1+L_rc2+L_p，L_rc2Represents the first reconstruction loss as:

L_loc、L_regrespectively representing the classification loss of the classification branch of the image target detection network and the positioning loss of the positioning branch, wherein lambda, mu and eta are respectively the loss L_r、L_locAnd L_regThe weighting factor of (1);

a step of detecting a low-resolution image to be identified:

inputting the low resolution image to be recognized into a feature extractor G_LBased on feature extractor G_LObtaining a first feature map of the low-resolution image to be identified through the output of the image recognition module;

2. The method of claim 1, wherein the feature extractor G_LThe network structure of the feature extraction block comprises two parallel branches, wherein one branch comprises two layers of first convolution blocks which are connected in sequence, the first convolution block comprises a convolution layer and a ReLU layer, convolution kernels of which are connected in sequence are 5 multiplied by 5, the other branch comprises two layers of second convolution blocks which are connected in sequence, and the second convolution blocks comprise a convolution layer and a ReLU layer which are connected in sequenceThe convolution kernels of the secondary connection are a convolution layer of 3 multiplied by 3 and a ReLU layer, and the output of the first convolution block is also connected into the second convolution block; the output of the first second convolution block is also connected to the second first convolution block, and the outputs of the two branches are merged into a convolution layer with convolution kernel of 1 x 1.

3. The method of claim 2, wherein the number of layers of the feature extraction block is 5.

4. The method according to claim 1, wherein when the classification branch of the image target detection network performs target classification based on an attention mechanism, firstly, the compression operation is performed on the superposition result output by the feature fusion network to obtain the global feature d' of the channel hierarchy, the number of channels C is equal to the number of categories to be identified, and then, based on the weights W of different channels_cObtaining the characteristic graph related to the final category

5. The method of claim 4, wherein the training treats the feature classification of each channel of the classification branch as a binary classification problem, computing a binary cross-entropy penalty for each channel.

6. The method of claim 1, wherein the validity of the prediction box is calculated using the overlap of the positions of the prediction box and the real box, resulting in a loss L_reg：

Wherein N is_regIndicating the number of samples to locate the branch,

prediction boxes representing the outcome of positioning branches, B_mRepresenting the mth label frame corresponding to the image proportion, (i, j) representing the spatial position of the pixel point, A_mSub-region, W, representing a given mth comment box_ijRepresents the sample weight:

wherein G is_m(i, j) denotes the Gaussian probability at (i, j), G_m(x, y) denotes the Gaussian probability at (x, y), a_mIndicates the area of the mth comment box.