CN114241274A - Small target detection method based on super-resolution multi-scale feature fusion - Google Patents

Small target detection method based on super-resolution multi-scale feature fusion Download PDF

Info

Publication number
CN114241274A
CN114241274A CN202111473712.1A CN202111473712A CN114241274A CN 114241274 A CN114241274 A CN 114241274A CN 202111473712 A CN202111473712 A CN 202111473712A CN 114241274 A CN114241274 A CN 114241274A
Authority
CN
China
Prior art keywords
feature
image
network
target detection
convolution
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111473712.1A
Other languages
Chinese (zh)
Other versions
CN114241274B (en
Inventor
徐洁
叶娅兰
刘紫奇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Electronic Science and Technology of China
Original Assignee
University of Electronic Science and Technology of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Electronic Science and Technology of China filed Critical University of Electronic Science and Technology of China
Priority to CN202111473712.1A priority Critical patent/CN114241274B/en
Publication of CN114241274A publication Critical patent/CN114241274A/en
Application granted granted Critical
Publication of CN114241274B publication Critical patent/CN114241274B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/254Fusion techniques of classification results, e.g. of results related to same input data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Abstract

The invention discloses a small target detection method based on super-resolution multi-scale feature fusion, and belongs to the technical field of image processing. Inputting a low-resolution image to be identified into a feature extractor to obtain a first feature map, performing data enhancement processing on the low-resolution image, and inputting the low-resolution image and noise disturbance superposition into a generator to obtain superposition quantity; the superposition result of the first feature map and the superposition quantity is used as a first reconstruction feature and is input into a decoder to obtain second reconstruction features with different sizes and is input into a feature fusion network; the feature fusion network samples all the second reconstruction features to the same size for superposition to obtain third reconstruction features and inputs the third reconstruction features into the image target detection network; and obtaining the category of the small target and the position of the detection frame thereof based on the output of the image target detection network. The invention achieves the effects of short training time, quick reasoning and high precision while carrying out small target detection, and has the industry-leading small target detection effect.

Description

Small target detection method based on super-resolution multi-scale feature fusion
Technical Field
The invention relates to the technical field of image processing, in particular to a small target detection method based on super-resolution multi-scale feature fusion.
Background
The target detection is a hot direction of computer vision and digital image processing, and is widely applied to various fields of robot navigation, intelligent video monitoring, industrial detection, aerospace and the like. Today, the trend of intelligence in various fields is that the realization of target detection has important practical significance to the reduction of human power capital consumption. The small target detection is a crucial link in a downstream task of target detection. For example, detecting small targets or distant objects in a high-resolution scene shot of an automobile is a necessary condition for deploying automated driving safely; also, for example, in satellite image analysis, it is important to efficiently annotate objects such as cars, ships, and houses. Small target detection is therefore receiving increasing attention.
With the recent progress of deep learning, target detection has made great progress in both performance and speed. Currently, some of the most advanced target detectors have achieved extremely high precision on large and medium size targets, which can meet the requirements of many practical applications. These object detectors usually do not distinguish between small and medium sized objects, and both objects are processed and identified in the same way. However, these detectors neglect the common difficult problems of low resolution, blurred picture, less information, much noise, etc. existing in the small target itself, and may cause that when these methods are used for detecting the small target, only half of the average accuracy of the detection of the medium and large size targets is obtained.
In order to improve the accuracy of small target detection, researchers first try to adjust the feature extraction link of a general detector, and hopefully solve the problem of low resolution of small target features. For example, some methods reduce the compression ratio of image data processing, and it is desirable that small objects have higher resolution in the extracted features. However, these methods do not take into account that the resolution of many target detection data per se is not high, and the small target features have the problems of low resolution and little information before extraction.
In recent years, some researchers have chosen to design detectors specifically for small target objects. Researchers find that shallow features are more beneficial to distinguishing small target objects, and choose to directly extract features from shallow convolution to improve the detection accuracy of the small target objects. The method relieves the problem of insufficient characteristic information of the small target to a certain extent. However, such detectors have a relatively large loss of semantic information for the image, and have poor generalization capability in general target detection involving large-sized objects.
Furthermore, most existing small target detectors use a general target detection data set. Most of the data in these data sets are medium and large objects, and only a few images contain small target objects, so that the detection model cannot learn the characteristics of the small targets in half the time. At the same time, the small target objects cover a much smaller area than the large targets, which causes an imbalance of detectors with few small target matches and many large target matches, resulting in the specialized small target detectors still paying more attention to the large and medium sized objects.
Disclosure of Invention
The invention provides a small target detection method based on super-resolution multi-scale feature fusion, which is used for solving the problem of low resolution of a small target object so as to improve the detection performance of the small target during image target detection processing.
The technical scheme adopted by the invention is as follows:
a small target detection method based on super-resolution multi-scale feature fusion comprises the following steps:
network model configuration and training:
acquiring a high-resolution image pair and a low-resolution image pair as training images to obtain a training image set;
configuring a network model, comprising: encoder-decoder network for high resolution images, feature extractor G for low resolution imagesLThe device comprises a generator G, a feature fusion network and an image target detection network;
the encoder part of the encoder-decoder network is marked as an encoder GH, and the decoder part is marked as a decoder DHThe encoder GH comprises a plurality of convolution layers and pooling layers, and is an alternating structure of the convolution layers and the pooling layers; the decoder DHComprising a plurality of deconvolution layers, said deconvolution layers and an encoder GHOf a rollThe number of the lamination layers is corresponding, and the characteristic dimension and the size are corresponding;
the low resolution image LR image in the high and low resolution image pair is input to the feature extractor GLBased on the feature extractor GLIs output to obtain the characteristic fL(ii) a And inputs the high resolution image HR of the high and low resolution image pair into the encoder GHDeriving the feature f based on its outputH(ii) a The loss function used in the encoder-decoder network training is:
Figure BDA0003382451610000021
where HR' represents decoder DHAn output of (d);
the feature extractor GLThe method comprises a multi-layer feature extraction block, wherein the feature extraction block consists of a multi-scale feature fusion network and local residual learning;
the input of the generator G is: performing data enhancement processing on the low-resolution image LR to obtain an image LR ', and disturbing the image LR' and randomly generated noise
Figure BDA0003382451610000023
As input to generator G; the output of the generator G is recorded as the superposition p, and the loss function adopted by the generator G during training is as follows: l isp=||p||;
The output of the generator G is superposed with the output of the feature extractor GL to obtain a first reconstruction feature, and the first reconstruction feature is input into a decoder DHDecoder DHThe output of each deconvolution layer is used as the input of a feature fusion network, the feature fusion network is used for sampling the input feature graphs with different sizes to the same size and superposing the feature graphs, and then the superposition result is input into an image target detection network;
the image target detection network comprises a classification branch and a positioning branch, and the classification branch of the image target detection network classifies targets based on an attention mechanism when the classification branch of the image target detection network classifies the targets;
the total loss adopted during the training of the configured network model is as follows: l ═ λLr+μLlocLregWherein L isrRepresents a loss of super-resolution reconstruction, and Lr=Lrc1+Lrc2+Lp,Lrc2Represents the first reconstruction loss as:
Figure BDA0003382451610000022
Lloc、Lregrespectively representing the classification loss of the classification branch of the image target detection network and the positioning loss (i.e. regression loss) of the positioning branch, wherein lambda, mu and eta are respectively the loss Lr、LlocAnd LregThe weighting factor of (1);
a step of detecting a low-resolution image to be identified:
inputting a low-resolution image to be recognized into a feature extractor GL, and obtaining a first feature map of the low-resolution image to be recognized based on the output of the feature extractor GL;
after data enhancement processing is carried out on the low-resolution image, the low-resolution image is input into a generator G after being disturbed and superposed with randomly generated noise, and the superposition quantity is obtained based on the output of the generator G; taking the superposition result of the first feature map and the superposition amount as a first reconstruction feature of the low-resolution image to be identified;
inputting the first reconstruction characteristics into a decoder DHBased on a decoder DHGenerating second reconstruction features of different sizes from the output of each deconvolution layer and inputting the second reconstruction features into a feature fusion network;
the feature fusion network samples all the second reconstruction features to the same size for superposition to obtain third reconstruction features and inputs the third reconstruction features into the image target detection network;
and obtaining the category of the small target and the position of the detection frame thereof based on the output of the image target detection network.
The technical scheme provided by the invention at least has the following beneficial effects:
compared with the traditional small target detection mode, the detection method of the invention has the advantages that the most advanced real-time detection performance is kept under the condition that the detection of the small target object meets the balance of training time, reasoning time and detection precision.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 is an xx diagram of a xxx method provided by embodiments of the present invention;
FIG. 2 is an xx diagram of a xxx method provided by embodiments of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention will be described in detail with reference to the accompanying drawings.
In order to solve the problem of insufficient precision caused by insufficient characteristics of a small target (a detected target in an image to be detected is smaller than a specified size) detector, the embodiment of the invention provides a super-resolution technology on a characteristic level, improves semantic information of deep characteristics by combining a characteristic fusion technology, and improves the detection performance of target detection by using an attention mechanism.
Referring to fig. 1 and fig. 2, the method for detecting a small target based on super-resolution multi-scale feature fusion provided by the embodiment of the invention comprises:
inputting a low-resolution image to be recognized into a configured feature extractor GL, and obtaining a first feature map of the low-resolution image to be recognized based on the output of the feature extractor GL; after data enhancement processing is carried out on the low-resolution image, the low-resolution image is input into a configured generator G after being disturbed and superposed with randomly generated noise, and the superposition amount is obtained based on the output of the generator G; taking the superposition result of the first feature map and the superposition amount as a first reconstruction feature of the low-resolution image to be identified;
decoder D configured with first reconstruction characteristics inputHSequentially generating second reconstruction features of different sizes, and sampling all the second reconstruction featuresStacking the samples to the same size to obtain a third reconstruction characteristic; the decoder DHIncluding a plurality of deconvolution layers, each layer of deconvolution layer outputting a second reconstruction feature of one dimension;
inputting the third reconstruction characteristics into a configured image target detection network to perform target detection processing on the small target, wherein the image target detection network comprises a classification branch and a positioning branch, and the class of the small target and the position of a detection frame thereof are obtained based on the output of the image target detection network, and the classification branch of the image target detection network realizes target classification based on an attention mechanism when performing target classification processing.
Wherein the decoder DHAnd a feature extractor GLThe generator G, the third reconstruction characteristic and the specific implementation of the image target detection network comprise
(1) The conversion of the image from low resolution to high resolution is realized to enhance the semantic information of the subsequent low resolution input. Respectively taking the high-resolution image pair LR and the low-resolution image pair HR as network inputs, and obtaining corresponding features f through different feature extractors GL and GHLAnd fH(ii) a Obtaining the feature f from the low-resolution image by the generator GLConversion into high resolution image features fHThe superposition amount p of the characteristic layer, and realizing a super-resolution technology on the characteristic layer; deep features f of high resolution imagesHThe original high-resolution image is restored through a decoder to ensure the validity of the deep feature semantic information.
(1-1): taking the high resolution image HR as input to the encoder-decoder part of the network, where GHI.e. the encoder part, the decoder is denoted as DHPerforming convolution pooling for multiple times to obtain deep layer characteristic fH
In the embodiment of the present invention, the encoder-decoder may adopt any conventional network structure, and specifically, the encoder G may be usedHIs set to be 7 layers, and is subjected to convolution pooling by adopting three convolution kernels of 7 multiplied by 7, 5 multiplied by 5 and 3 multiplied by 3 and a pooling kernel of 2 multiplied by 2 to obtain fH. For example, each convolution pooling process is first run through three convolution layers (which may typically include convolution operations, batch normalizationProcessing and activation function mapping), and then through a pooling layer.
(1-2): decoder DHComposed of multiple deconvolution layers, and deep layer characteristics fHAs input to the decoder, the deconvolution layers correspond to the number of convolution layers and the characteristic dimensions and sizes, for fHPerforming dimensionality raising to obtain an output HR'; the HR' and the HR have the same resolution size and the same channel number; i.e. encoder GHHas the function of generating a characteristic image with semantic information, which is then passed through a decoder DHMapping low resolution feature images output by encoder GH back to the size of the input image
(1-3): taking the L2 distance as the reconstruction loss of HR and HR ', optimizing the L2 loss (L2 norm loss function) to make HR' and HR closer, and making the decoder part possess the deep characteristic fHAbility to reconstruct original image, only deep features fHContains the necessary semantic information to ensure that the slave fHReverting to the original image.
Specifically, the reconstruction loss is as follows:
Figure BDA0003382451610000051
(1-4): using the low resolution image LR as the feature extractor GLThe feature f is obtained by multi-scale feature fusion and local residual error learningL
Specifically, the feature extractor GLThe number of the characteristic layers is set to be 5, each layer is formed by multi-scale characteristic fusion and local residual error learning, and image characteristics of different scales can be obtained, so that the image characteristics are fully extracted.
In the nth layer, M is addedi-1As input of next multi-scale residual block, obtaining output M thereofiRepeating the steps until M is obtainednIn the embodiment of the present invention, each layer includes three convolutional layers.
Mi-1As input to the first convolutional layer, the output S is obtained by convolution with 3 × 3 and 5 × 5, respectively, and by the ReLU function1、P1. Will S1And P1Connected in series as input to a second convolutional layer, by 3X 3 and 5X 5 convolutions, respectively, and by the ReLU function to obtain an output S2、P2. Will S2And P2Concatenated as an input to a third convolutional layer, and convolved by 1 × 1 to obtain an output S'. Will Mi-1Connecting the residual error to the output, and combining with S' to obtain the final output Mi
Will M0To MnAll outputs are used as the input of the hierarchical feature fusion structure to obtain the extracted features M5
All inputs of the hierarchical feature fusion structure are connected in series, and the fused feature channels are compressed to the required channel number by using 1 multiplied by 1 convolution to obtain the extracted feature M5I.e. characteristic fL
(1-5): LR' is obtained by LR data enhancement processing, and noise disturbance is randomly generated
Figure BDA0003382451610000055
Mixing LR' with
Figure BDA0003382451610000056
The superposition of p is used as the input of a generator G, the superposition quantity p is obtained, and an L1 regular term about p is calculated to ensure the sparsity of p;
in particular, data enhancement generally improves the display quality of a quantized coarse image by adjusting or varying the amplitude values of the image. The dithering technology can eliminate a part of false contours generated by too few gray levels, and the effect is more obvious when the superimposed dithering value is larger. However, the superposition of the jitter values also brings noise to the image, and the larger the jitter value is, the larger the noise influence is. Dithering is generally achieved by adding a random small noise d (x, y) to the original image f (x, y), the value of d (x, y) generally not being linked to f (x, y) in any regular way. And the generalization capability and the robustness of the trained model are improved by color dithering and noise data addition.
The regularization term is as follows:
Lp=||p||
(1-6): will f isLThe result of superposition of p
Figure BDA0003382451610000052
As a reconstruction feature, calculate
Figure BDA0003382451610000053
And fHThe L2 distance of (a) is taken as a reconstruction loss, let GLAnd G possesses the ability to increase image resolution at the feature level.
Specifically, the reconstruction loss is as follows:
Figure BDA0003382451610000054
the overall loss of the super-resolution part at the feature level is realized as follows.
Lr=Lrc1+Lrc2+Lp
(2): by reconstruction features
Figure BDA0003382451610000061
And a decoder DH generates depth features of different scales, and retains semantic information of small targets in different feature layers through multi-scale feature fusion. Generating a class-dependent feature map
Figure BDA0003382451610000062
The attention mechanism is utilized to increase the loss specific gravity of the interested target so as to improve the performance of target detection.
In particular, the amount of the solvent to be used,
Figure BDA0003382451610000063
wherein C, H, W, r represents the number of categories, the height and width of the input image, and the output stride, respectively;
(2-1): will be provided with
Figure BDA0003382451610000064
Input to a decoder DHPerforming up-sampling to generate reconstruction features with different sizes in sequenced1、d2、d3、d4、d5Due to DHFinally, the features are restored to the original image, so that the generated features can be regarded as depth features of a super-resolution image, namely the reconstructed features are more than the low-resolution image features fLMore semantic information is contained.
(2-2): feature d is reconstructed1、d2、d3、d4、d5All upsampled to the same size for superposition. Generally, a small target retains more semantic information in a shallow feature, but as the network goes deeper, the semantic information of the small target is gradually lost, and the semantic information of the large target is gradually abstracted to meet the application requirements of the network. Therefore, the semantic information of the small target can be kept while the abstract semantic information of the large target is obtained through the fusion of the features under different levels. Recording the final feature superposition result as d;
specifically, the feature superposition is a feature pyramid model which combines multi-level features to solve the multi-scale problem, and the whole structure is composed of a bottom-up down-sampling, a self-term-down up-sampling and a transverse connection structure. E.g. for low resolution feature maps d12 times of upsampling is carried out to obtain d'1Adding the two, i.e. combining the up-sampling mapping with the corresponding feature bottom-up mapping to obtain an intermediate feature dtThe following formula is given.
d1t=d1+d′1
This process is iterated until the final resolution map d is generated.
(2-3): feature d obtains a class-related feature map through a convolution layer
Figure BDA0003382451610000065
The method comprises C channels corresponding to the number of the classes of the target to be identified, wherein each channel is used for extracting the characteristics of the object corresponding to the class and ignoring the characteristics of other classes. Generating channel weights W using a soft attention mechanismcAnd the loss ratio of the category to be identified is further improved.
In particular toIn the attention mechanism, weighting operation is performed on channel dimensions, and the attention mechanism enables a model to pay more attention to channel characteristics with the largest information quantity, namely, pay more attention to the category of the target to be identified rather than other categories. Firstly, compressing the features d obtained by convolution to obtain global features d' of channel layers, wherein the number of channels C is equal to the number of categories to be identified, and then learning the relation among the channels by using the global features to obtain the weights W of different channelscFinally, multiplying the original feature d' to obtain the final class-related feature map
Figure BDA0003382451610000066
The following formula is shown.
Figure BDA0003382451610000067
Secondly, regarding the feature classification of each channel as a two-classification problem, namely whether the extracted features belong to the classes to be identified or not, calculating a two-classification cross entropy loss for each channel, balancing the proportion of the loss of each channel by paying attention to the weight of a force mechanism, and finally, the network tends to extract the features of the feature class objects from the specific channels, so as to optimize the target as the following formula.
Figure BDA0003382451610000071
(2-4): similarly, feature d is mapped by convolutional layers
Figure BDA0003382451610000072
It contains 4 channels for the subsequent target size regression task.
In particular, the amount of the solvent to be used,
Figure BDA0003382451610000073
wherein H, W, r represents the number of categories, the height and width of the input image, and the output stride, respectively;
(3): training using two-dimensional Gaussian kernels and labeledData generation thermodynamic diagram H for supervised training, characterization
Figure BDA0003382451610000074
For a central positioning task. The target center is used as a positive sample, other pixel points are used as negative samples, the problem of the number unbalance of the positive and negative samples is solved through Focal local, and the Loss L is obtainedloc
The general structure of the network is shown in fig. 2, and the extracted features are used for performing a centering task. The feature pyramid structure amplifies feature maps of different depths to the size of the last layer, and the feature pyramid structure is directly added, so that high-resolution information of shallow features and semantic information of deep features can be reserved, the target detection effect is enhanced, and researches show that the shallow features are more suitable for small target detection. The extracted features are
Figure BDA0003382451610000075
For a central positioning task. Where C, H, W, r is the number of categories, the height and width of the input image, and the output stride. In this embodiment, C-80 and r-4 are set, a gaussian kernel is used for both centering and detection box regression, and scalars α and β are defined to control the size of the kernel, respectively;
given the genus CmThe mth label box of a class is first linearly mapped to the scale of the feature map. Then, 2-dimensional Gaussian kernel is adopted
Figure BDA0003382451610000076
To generate
Figure BDA0003382451610000077
Wherein
Figure BDA0003382451610000078
Finally, by applying HmMaximum value of element(s) in H to update C in HmA channel. Generation of HmM is marked with the center of the box as (x) determined by the parameter alpha0,y0) m, the size of the mark frame is (h, w)m. By using
Figure BDA0003382451610000079
To ensure that the center is located in the pixel. In the network setting, α may be made 0.54.
The peak of the gaussian distribution, i.e. the pixel in the center of the box, is considered as a positive sample, while any other pixel is considered as a negative sample. And solving the problem of unbalanced number of positive and negative samples by adopting Focal local.
Given a predicted value
Figure BDA00033824516100000710
And a positioning target H, as shown in the following formula,
Figure BDA00033824516100000711
wherein alpha isfAnd betafRespectively, the hyper-parameters, M represents the number of the labeled boxes, in this embodiment, α is setf=2,βf=4。
Figure BDA0003382451610000081
Representation characteristic diagram
Figure BDA0003382451610000082
(predicted value), c denotes a channel number, i, j denotes a spatial position, HijcThe element representing the localization object H, i.e., the corresponding tag value.
(4): thermodynamic diagrams H and features
Figure BDA0003382451610000083
For the size regression task, calculating the effectiveness of the prediction frame by overlapping the positions of the prediction frame and the real frame to obtain the loss Lreg
For size regression, given the mth label box on the scale of the feature map, another Gaussian kernel is used to generate
Figure BDA0003382451610000084
The kernel size is determined by the parameter β. Note that when α and β are the same, the same kernel can be used to save computation.SmIs named as Gaussian region Am. Due to AmAlways within the m-mark box, and therefore in the rest of the embodiments of the present invention it is also named sub-area.
Each pixel point in the sub-region is considered as a regression sample. Given area AmThe regression target is defined as the distance from (ir, jr) to the four sides of the mth frame, and is expressed as a four-dimensional vector
Figure BDA00033824516100000817
I.e. wl、wrRespectively, the left and right side distances, ht、hbRepresenting the distance of the top and bottom edges, the prediction box at pixel point (i, j) can be represented as
Figure BDA0003382451610000085
Figure BDA0003382451610000086
Where s is a fixed scalar used to scale up the prediction results for optimization. In the present embodiment, s-16 is set. Note, the prediction box
Figure BDA0003382451610000087
At the image scale, rather than the feature map scale, i.e., the prediction box is typically located based on two vertices on the diagonal of the rectangle,
Figure BDA0003382451610000088
respectively show the predicted values of wl and wr,
Figure BDA0003382451610000089
respectively represent ht、hbThe predicted value of (2).
If a pixel is not contained by any sub-region, it is ignored during training. If a pixel is contained in multiple sub-regions, it is an ambiguous sample whose training target is set to be the target with smaller area.
Given a predicted value
Figure BDA00033824516100000810
And a regression target S from which training targets are collected
Figure BDA00033824516100000811
From
Figure BDA00033824516100000812
Collecting corresponding prediction results
Figure BDA00033824516100000813
Wherein N isregThe number of regression samples is indicated. For all samples, the prediction box and the corresponding label box of the sample are decoded as above, and the overlap GIoU of the positions of the prediction box and the real box is used as an optimization target, as below.
Figure BDA00033824516100000814
Wherein the content of the first and second substances,
Figure BDA00033824516100000815
representing a decoding box
Figure BDA00033824516100000816
Is the mth label box corresponding to the image in proportion. WijIs the sample weight to balance the loss caused by each sample.
Due to the size scale variation of the target, a large target (with a size larger than a specified size) may generate thousands of samples, while a small target may generate only a small number of samples. The losses caused by the small targets are even negligible after normalizing the losses of all sample allocations, which will impair the detection performance of the small targets. Thus, the sample weight WijPlays an important role in the balance loss. Assume that (i, j) is annotated at the m-thSub-area A of the framemIn the interior, there are:
Figure BDA0003382451610000091
wherein G ism(i, j) is the Gaussian probability at (i, j), Gm(x, y) then denotes the Gaussian probability at (x, y), amIs the area of the m-th detection frame. The processing mode can fully utilize more annotation information contained in the large target and reserve the annotation information of the small target. It may also emphasize these samples near the center of the target, reducing the effects of blurring and low quality samples.
Finally, the reconstruction loss L is calculatedrc1、Lrc2Term of regularization LpCenter positioning loss LlocAnd size regression loss LregThe small target detection method comprises the steps of calculating the total loss L of small target detection as input, optimizing the network weight according to the total loss L, and realizing speed and precision balance after the optimization is completed;
specifically, the formula for the total loss L is:
L=λLr+μLloc+ηLrea
wherein, λ, μ, η are weighting factors of super-resolution reconstruction loss, center positioning loss and size regression loss respectively.
The embodiment of the invention provides a small target detection method based on super-resolution multi-scale feature fusion, aiming at the problem of insufficient precision caused by insufficient small target detection features of most of the current detectors. And then, multi-scale image feature fusion is realized by means of the feature pyramid structure, so that semantic information of small target objects is prevented from being lost. The feature extractor is focused on extracting features that identify the class to which the object belongs using an attention mechanism. And finally, performing center positioning and size regression by using the extracted features so as to achieve the effect of target detection. The invention achieves the effects of short training time, quick reasoning and high precision while carrying out small target detection, and has the industry-leading small target detection effect.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.
What has been described above are merely some embodiments of the present invention. It will be apparent to those skilled in the art that various changes and modifications can be made without departing from the inventive concept thereof, and these changes and modifications can be made without departing from the spirit and scope of the invention.

Claims (6)

1. A small target detection method based on super-resolution multi-scale feature fusion is characterized by comprising the following steps:
network model configuration and training:
acquiring a high-resolution image pair and a low-resolution image pair as training images to obtain a training image set;
configuring a network model, comprising: encoder-decoder network for high resolution images, feature extractor G for low resolution imagesLThe device comprises a generator G, a feature fusion network and an image target detection network;
the encoder part of the encoder-decoder network is denoted as encoder GHThe decoder part is denoted as decoder DHSaid encoder GHThe multilayer film comprises a plurality of convolution layers and pooling layers, and is an alternating structure of the convolution layers and the pooling layers; the decoder DHComprising a plurality of deconvolution layers, said deconvolution layers and an encoder GHThe number of the convolution layers is corresponding, and the characteristic dimension and the size are corresponding;
divide the height intoLow resolution image LR image input feature extractor G in resolution image pairLBased on feature extractor GLIs output to obtain the characteristic fL(ii) a And inputs the high resolution image HR of the high and low resolution image pair into the encoder GHDeriving the feature f based on its outputH(ii) a The loss function used in the encoder-decoder network training is:
Figure FDA0003382451600000011
where HR' represents decoder DHAn output of (d);
the feature extractor GLThe method comprises a multi-layer feature extraction block, wherein the feature extraction block consists of a multi-scale feature fusion network and local residual learning;
the input of the generator G is: performing data enhancement processing on the low-resolution image LR to obtain an image LR ', and disturbing the image LR' and randomly generated noise
Figure FDA0003382451600000013
As input to generator G; the output of the generator G is recorded as the superposition p, and the loss function adopted by the generator G during training is as follows: l isp=||p||;
Output of the generator G and a feature extractor GLThe outputs of which are superimposed to obtain a first reconstruction characteristic and input into a decoder DHDecoder DHThe output of each deconvolution layer is used as the input of a feature fusion network, the feature fusion network is used for sampling the input feature graphs with different sizes to the same size and superposing the feature graphs, and then the superposition result is input into an image target detection network;
the image target detection network comprises a classification branch and a positioning branch, and the classification branch of the image target detection network classifies targets based on an attention mechanism when the classification branch of the image target detection network classifies the targets;
the total loss adopted during the training of the configured network model is as follows: l ═ λ Lr+μLloc+ηLregWherein L isrRepresents a loss of super-resolution reconstruction, and Lr=Lrc1+Lrc2+Lp,Lrc2Represents the first reconstruction loss as:
Figure FDA0003382451600000012
Lloc、Lregrespectively representing the classification loss of the classification branch of the image target detection network and the positioning loss of the positioning branch, wherein lambda, mu and eta are respectively the loss Lr、LlocAnd LregThe weighting factor of (1);
a step of detecting a low-resolution image to be identified:
inputting the low resolution image to be recognized into a feature extractor GLBased on feature extractor GLObtaining a first feature map of the low-resolution image to be identified through the output of the image recognition module;
after data enhancement processing is carried out on the low-resolution image, the low-resolution image is input into a generator G after being disturbed and superposed with randomly generated noise, and the superposition quantity is obtained based on the output of the generator G; taking the superposition result of the first feature map and the superposition amount as a first reconstruction feature of the low-resolution image to be identified;
inputting the first reconstruction characteristics into a decoder DHBased on a decoder DHGenerating second reconstruction features of different sizes from the output of each deconvolution layer and inputting the second reconstruction features into a feature fusion network;
the feature fusion network samples all the second reconstruction features to the same size for superposition to obtain third reconstruction features and inputs the third reconstruction features into the image target detection network;
and obtaining the category of the small target and the position of the detection frame thereof based on the output of the image target detection network.
2. The method of claim 1, wherein the feature extractor GLThe network structure of the feature extraction block comprises two parallel branches, wherein one branch comprises two layers of first convolution blocks which are connected in sequence, the first convolution block comprises a convolution layer and a ReLU layer, convolution kernels of which are connected in sequence are 5 multiplied by 5, the other branch comprises two layers of second convolution blocks which are connected in sequence, and the second convolution blocks comprise a convolution layer and a ReLU layer which are connected in sequenceThe convolution kernels of the secondary connection are a convolution layer of 3 multiplied by 3 and a ReLU layer, and the output of the first convolution block is also connected into the second convolution block; the output of the first second convolution block is also connected to the second first convolution block, and the outputs of the two branches are merged into a convolution layer with convolution kernel of 1 x 1.
3. The method of claim 2, wherein the number of layers of the feature extraction block is 5.
4. The method according to claim 1, wherein when the classification branch of the image target detection network performs target classification based on an attention mechanism, firstly, the compression operation is performed on the superposition result output by the feature fusion network to obtain the global feature d' of the channel hierarchy, the number of channels C is equal to the number of categories to be identified, and then, based on the weights W of different channelscObtaining the characteristic graph related to the final category
Figure FDA0003382451600000025
Figure FDA0003382451600000021
5. The method of claim 4, wherein the training treats the feature classification of each channel of the classification branch as a binary classification problem, computing a binary cross-entropy penalty for each channel.
6. The method of claim 1, wherein the validity of the prediction box is calculated using the overlap of the positions of the prediction box and the real box, resulting in a loss Lreg
Figure FDA0003382451600000022
Wherein N isregIndicating the number of samples to locate the branch,
Figure FDA0003382451600000023
prediction boxes representing the outcome of positioning branches, BmRepresenting the mth label frame corresponding to the image proportion, (i, j) representing the spatial position of the pixel point, AmSub-region, W, representing a given mth comment boxijRepresents the sample weight:
Figure FDA0003382451600000024
wherein G ism(i, j) denotes the Gaussian probability at (i, j), Gm(x, y) denotes the Gaussian probability at (x, y), amIndicates the area of the mth comment box.
CN202111473712.1A 2021-11-30 2021-11-30 Small target detection method based on super-resolution multi-scale feature fusion Active CN114241274B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111473712.1A CN114241274B (en) 2021-11-30 2021-11-30 Small target detection method based on super-resolution multi-scale feature fusion

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111473712.1A CN114241274B (en) 2021-11-30 2021-11-30 Small target detection method based on super-resolution multi-scale feature fusion

Publications (2)

Publication Number Publication Date
CN114241274A true CN114241274A (en) 2022-03-25
CN114241274B CN114241274B (en) 2023-04-07

Family

ID=80753196

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111473712.1A Active CN114241274B (en) 2021-11-30 2021-11-30 Small target detection method based on super-resolution multi-scale feature fusion

Country Status (1)

Country Link
CN (1) CN114241274B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115546473A (en) * 2022-12-01 2022-12-30 珠海亿智电子科技有限公司 Target detection method, apparatus, device and medium
CN116309274A (en) * 2022-12-12 2023-06-23 湖南红普创新科技发展有限公司 Method and device for detecting small target in image, computer equipment and storage medium
CN116309431A (en) * 2023-03-14 2023-06-23 中国人民解放军空军军医大学 Visual interpretation method based on medical image
CN117542105A (en) * 2024-01-09 2024-02-09 江西师范大学 Facial super-resolution and expression recognition method for low-resolution images under classroom teaching
CN117576488A (en) * 2024-01-17 2024-02-20 海豚乐智科技(成都)有限责任公司 Infrared dim target detection method based on target image reconstruction

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3171297A1 (en) * 2015-11-18 2017-05-24 CentraleSupélec Joint boundary detection image segmentation and object recognition using deep learning
CN108564109A (en) * 2018-03-21 2018-09-21 天津大学 A kind of Remote Sensing Target detection method based on deep learning
CN109344821A (en) * 2018-08-30 2019-02-15 西安电子科技大学 Small target detecting method based on Fusion Features and deep learning
CN110009679A (en) * 2019-02-28 2019-07-12 江南大学 A kind of object localization method based on Analysis On Multi-scale Features convolutional neural networks
CN112183203A (en) * 2020-08-26 2021-01-05 北京工业大学 Real-time traffic sign detection method based on multi-scale pixel feature fusion

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3171297A1 (en) * 2015-11-18 2017-05-24 CentraleSupélec Joint boundary detection image segmentation and object recognition using deep learning
CN108564109A (en) * 2018-03-21 2018-09-21 天津大学 A kind of Remote Sensing Target detection method based on deep learning
CN109344821A (en) * 2018-08-30 2019-02-15 西安电子科技大学 Small target detecting method based on Fusion Features and deep learning
CN110009679A (en) * 2019-02-28 2019-07-12 江南大学 A kind of object localization method based on Analysis On Multi-scale Features convolutional neural networks
CN112183203A (en) * 2020-08-26 2021-01-05 北京工业大学 Real-time traffic sign detection method based on multi-scale pixel feature fusion

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
YALAN YE等: "end-to-end versatile human activity recognition with activity image transfer learning" *
刘颖;刘红燕;范九伦;公衍超;李莹华;王富平;卢津;: "基于深度学习的小目标检测研究与应用综述" *
李希;徐翔;李军;: "面向航空飞行安全的遥感图像小目标检测" *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115546473A (en) * 2022-12-01 2022-12-30 珠海亿智电子科技有限公司 Target detection method, apparatus, device and medium
CN116309274A (en) * 2022-12-12 2023-06-23 湖南红普创新科技发展有限公司 Method and device for detecting small target in image, computer equipment and storage medium
CN116309274B (en) * 2022-12-12 2024-01-30 湖南红普创新科技发展有限公司 Method and device for detecting small target in image, computer equipment and storage medium
CN116309431A (en) * 2023-03-14 2023-06-23 中国人民解放军空军军医大学 Visual interpretation method based on medical image
CN116309431B (en) * 2023-03-14 2023-10-27 中国人民解放军空军军医大学 Visual interpretation method based on medical image
CN117542105A (en) * 2024-01-09 2024-02-09 江西师范大学 Facial super-resolution and expression recognition method for low-resolution images under classroom teaching
CN117576488A (en) * 2024-01-17 2024-02-20 海豚乐智科技(成都)有限责任公司 Infrared dim target detection method based on target image reconstruction
CN117576488B (en) * 2024-01-17 2024-04-05 海豚乐智科技(成都)有限责任公司 Infrared dim target detection method based on target image reconstruction

Also Published As

Publication number Publication date
CN114241274B (en) 2023-04-07

Similar Documents

Publication Publication Date Title
CN114241274B (en) Small target detection method based on super-resolution multi-scale feature fusion
Rahmon et al. Motion U-Net: Multi-cue encoder-decoder network for motion segmentation
CN111563507A (en) Indoor scene semantic segmentation method based on convolutional neural network
CN114724155A (en) Scene text detection method, system and equipment based on deep convolutional neural network
Li et al. Toward in situ zooplankton detection with a densely connected YOLOV3 model
Han et al. L-Net: lightweight and fast object detector-based ShuffleNetV2
CN112270366A (en) Micro target detection method based on self-adaptive multi-feature fusion
Fan et al. A novel sonar target detection and classification algorithm
Li et al. A survey on deep-learning-based real-time SAR ship detection
Irfan et al. A novel feature extraction model to enhance underwater image classification
Yang et al. Side-scan sonar image segmentation based on multi-channel CNN for AUV navigation
Patel et al. A novel approach for semantic segmentation of automatic road network extractions from remote sensing images by modified UNet
Mehran et al. An effective deep learning model for ship detection from satellite images
Li et al. Adaptive fusion nestedUNet for change detection using optical remote sensing images
CN116935044A (en) Endoscopic polyp segmentation method with multi-scale guidance and multi-level supervision
CN115186804A (en) Encoder-decoder network structure and point cloud data classification and segmentation method adopting same
Liu et al. Learning to refine object contours with a top-down fully convolutional encoder-decoder network
Aung et al. Multitask learning via pseudo-label generation and ensemble prediction for parasitic egg cell detection: IEEE ICIP Challenge 2022
Wang et al. Dunhuang mural line drawing based on multi-scale feature fusion and sharp edge learning
Wang et al. A Novel Neural Network Based on Transformer for Polyp Image Segmentation
Burugupalli Image classification using transfer learning and convolution neural networks
Liu et al. A Novel Improved Mask RCNN for Multiple Targets Detection in the Indoor Complex Scenes
Haque et al. Multi scale object detection based on single shot multibox detector with feature fusion and inception network
Huang et al. Under water object detection based on convolution neural network
CN116051984B (en) Weak and small target detection method based on Transformer

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant