CN113139896A - Target detection system and method based on super-resolution reconstruction - Google Patents

Target detection system and method based on super-resolution reconstruction Download PDF

Info

Publication number
CN113139896A
CN113139896A CN202010052220.4A CN202010052220A CN113139896A CN 113139896 A CN113139896 A CN 113139896A CN 202010052220 A CN202010052220 A CN 202010052220A CN 113139896 A CN113139896 A CN 113139896A
Authority
CN
China
Prior art keywords
super
image data
resolution
resolution reconstruction
detection
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010052220.4A
Other languages
Chinese (zh)
Inventor
李永
明悦
张高鑫
刘莹
丰·石
叶翔
李慧
王凡
何子航
王伟刚
李凤男
赵家凤
李婉婷
胡嘉豪
李博瀚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Posts and Telecommunications
Boeing Co
Original Assignee
Beijing University of Posts and Telecommunications
Boeing Co
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Posts and Telecommunications, Boeing Co filed Critical Beijing University of Posts and Telecommunications
Priority to CN202010052220.4A priority Critical patent/CN113139896A/en
Publication of CN113139896A publication Critical patent/CN113139896A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformation in the plane of the image
    • G06T3/40Scaling the whole image or part thereof
    • G06T3/4053Super resolution, i.e. output image resolution higher than sensor resolution
    • G06T3/4076Super resolution, i.e. output image resolution higher than sensor resolution by iteratively correcting the provisional high resolution image using the original low-resolution image
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformation in the plane of the image
    • G06T3/40Scaling the whole image or part thereof
    • G06T3/4046Scaling the whole image or part thereof using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/50Image enhancement or restoration by the use of more than one image, e.g. averaging, subtraction
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/269Analysis of motion using gradient-based methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20212Image combination
    • G06T2207/20221Image fusion; Image merging

Abstract

The application discloses a target detection system and method based on super-resolution reconstruction. The system comprises: the data acquisition module is configured to acquire image data to be detected; the super-resolution reconstruction module is configured to receive the image data acquired by the data acquisition module and perform super-resolution reconstruction processing on the image data; a target detection module configured to perform target detection on the image data subjected to the super-resolution reconstruction processing; and the dividing and fusing module is configured to cut the image data subjected to target detection into a plurality of sub-image data, and map the detection result of each sub-image data into the combined image data for coordinate fusion, so as to obtain a target detection result.

Description

Target detection system and method based on super-resolution reconstruction
Technical Field
The present invention relates to the field of image processing. In particular, the present invention relates to a system and method for object detection based on super-resolution reconstruction.
Background
The image super-resolution technology is a signal processing technology for improving the spatial resolution of an image or a target on the basis of the existing imaging equipment, and the technology solves the problem that the imaging resolution of a scene or a target is too low in certain video and image-based applications. The image super-resolution technology comprises the following steps: single frame image super resolution techniques that improve their resolution using only a single image itself, such as SRCNN and EDSR, etc.; and multi-frame image super-resolution techniques that use adjacent multi-frame images to increase the image resolution of a particular frame, such as sub-pixel convolutional neural networks and ESPCN. In addition, the image super-resolution technology also relates to an image quality evaluation algorithm, which mainly comprises the following steps: an image quality evaluation algorithm using a convolutional neural network; and an image quality assessment algorithm that utilizes image gradient features. The present invention generally relates to an image super-resolution technique for performing super-resolution reconstruction from a plurality of image frames.
In performing video super-resolution processing, three problems are mainly focused on: 1) how to fully utilize the associated information among the multiple frames; 2) how to efficiently fuse image details into a high resolution image; and 3) how to increase the computation speed. In the video super-resolution processing, it is sometimes necessary to first map a low-resolution image onto a high-definition grid by upsampling, but this operation may lead to an increase in computational complexity. To address the above problem, Shi, Wenzhe et al, published in 2016, "Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel conditional Neural Network" (2016): 1874-. Front of network
Figure BDA0002371583680000021
The layers are all ordinary (integer pixel) convolution layers with activation function layers, a new sub-pixel convolution layer is arranged at the last layer of the network, and the pixels are rearranged according to channels instead of convolution operation, namely H multiplied by W multiplied by C multiplied by r2The feature map of (d) is rearranged to (r × H) × (r × W) × C as an output. To further increase the speed, "Real-Time Video Super-Resolution with spatial-Temporal Networks and Motion Compensation" was published by Cabillero, Jose et al in 2016 (2016):2848-2857, which proposes an end-to-end joint training Motion Compensation and Video Super-Resolution algorithm, and introduces a Spatio-Temporal sub-pixel convolution network to achieve Real-Time Video image Super-Resolution. The network mainly utilizes early fusion (early fusion) and slow fusion (slow fusion) to process the time dimension, then establishes a motion compensation framework based on spatial transformation, and then is connected with ESPC (electronic stability program)And the N space-time networks are combined to realize real-time calculation of the video super-resolution reconstruction. In the prior art for realizing super-resolution reconstruction of video, the following problems exist: (1) the performance of the traditional super-resolution reconstruction (interpolation) and target detection method is surpassed by the method based on deep learning, the reconstruction quality of the traditional super-resolution reconstruction (interpolation) and target detection method is lower, the description capability of the detection method based on the template is limited, and the describable semantic information is less; (2) dependence of prior knowledge: the algorithm relies on the accuracy of the prior knowledge (target image template), which is reduced when the actual application scenario does not match the introduced prior knowledge. In order to solve the above problems of the conventional super-resolution reconstruction (interpolation) and target detection methods, a method has been proposed for improving the accuracy of target detection by increasing the information of context, which is an attempt to improve the performance of SSD by adding context, but this method also has the following problems: (1) the parameter (calculation) amount is large, and the algorithm speed is low; (2) the larger number of parameters makes the model occupy larger storage space.
As an important field of image processing using artificial intelligence, the object detection technology is essentially multi-object positioning, that is, object detection (object detection) is a combination of classification (classification) and positioning (localization) tasks, which aims to give a picture, accurately find the position (coordinates) of an object in the picture, and mark the type of the object. The main performance indicators of the target detection model are detection accuracy and speed. At present, the mainstream target detection algorithm is mainly based on a deep learning model, and can be divided into two categories: (1) a Two-stage detection algorithm, which divides the detection problem into Two stages, first generates candidate regions, and then classifies the candidate regions, and a typical representative of such algorithms is an R-CNN system algorithm based on region suggestion (region pro posal); (2) one-stage detection algorithm, which does not require a region proposal (region proposal) stage, directly generates class probability and position coordinate values of objects, comparing typical algorithms such as YOLO and SSD.
Scale invariance (scale invariant) representation is crucial for identifying and locating objects, since the deeper layers of modern CNNs have large step sizes (32 pixels), which results in very coarse representations of the input image, which makes small target detection very challenging, and when faced with small target problems (in essence scale invariance problems), the above-mentioned several methods all have insufficient detection results, resulting in the problem that convolutional network structures have the contradiction that the feature map (feature map) of the shallow layers of the network is large but semantic (context) information is insufficient, and the semantic information of the deep layers of the network is sufficient but the feature map (feature map) is too small. In order to detect multi-scale objects, various solutions have been proposed, such as:
(1) expansion/shrinkage (scaled/atrous) convolution is used to increase the resolution of the feature map, which preserves the weights and receptive field of the pre-trained network and does not suffer from large object performance degradation;
(2) based on the fact that the shallow layer and the deep layer contain complementary information, the shallow layer feature and the deep layer feature (context information) are fused for prediction;
(3) directly and independently predicting on feature maps (feature maps) of a shallow network layer and a deep network layer;
(4) the network input image is up-sampled during training.
Therefore, how to perform multi-frame image super-resolution reconstruction to obtain an image with higher resolution, and meanwhile, the integrity of a small target in the image is maintained, so that the realization of rapid and accurate target detection is a very worthy of research.
Disclosure of Invention
The embodiment of the invention provides an end-to-end target detection method, which introduces scale invariance through a multi-frame super-resolution module, designs a learnable data division module for adaptively cutting a super-resolution reconstructed image, keeps the integrity of small targets in sub-images, and finally inputs the small targets into a target detection module for detection, so that the detection effect of the small targets is improved under the application scene of detecting the targets in videos (multi-frames).
According to an aspect of the embodiments of the present invention, there is provided a target detection system based on super-resolution reconstruction, the system comprising: the data acquisition module is configured to acquire image data to be detected; the super-resolution reconstruction module is configured to receive the image data acquired by the data acquisition module and perform super-resolution reconstruction processing on the image data; a target detection module configured to perform target detection on the image data subjected to the super-resolution reconstruction processing; and the dividing and fusing module is configured to cut the image data subjected to target detection into a plurality of sub-image data, and map the detection result of each sub-image data into the combined image data for coordinate fusion, so as to obtain a target detection result.
In the object detection module, a single point multiple box detection (SSD) algorithm is used.
In the dividing and fusing module, the step length for clipping the image data subjected to target detection is a numerical value obtained based on edge detection.
The super-resolution reconstruction module is a trained spatio-temporal sub-pixel convolutional network comprising a motion estimation portion and a super-resolution portion, wherein the spatio-temporal sub-pixel convolutional network is trained by:
the Loss formula of the super-resolution part is as follows:
Figure BDA0002371583680000051
the Loss formula of the motion estimation part is as follows:
Figure BDA0002371583680000052
wherein the content of the first and second substances,
Figure BDA0002371583680000057
is approximated to
Figure BDA0002371583680000053
ε=0.01
The total Loss formula of the spatio-temporal sub-pixel convolution network during end-to-end training is as follows:
Figure BDA0002371583680000054
wherein theta isΔIs a parameter of the motion estimation part, theta is a parameter of the super-resolution part,
Figure BDA0002371583680000055
which represents an image frame or frames of an image,
Figure BDA0002371583680000056
representing the image frame that has undergone the warping process.
Employing an objective loss function L in the objective detection moduleDetTarget loss function LDetObtained by the following equation:
Figure BDA0002371583680000061
wherein: n is the number of default boxes matching the real box, LlocIs a smooth 1-norm loss function in Fast R-CNN, LconfFor Softmax Loss, c is the confidence for each class, and α is a weight term and is set to 1.
According to another aspect of the embodiments of the present invention, there is also provided a target detection method based on super-resolution reconstruction, wherein the method includes the following steps: a data acquisition step, which is to acquire image data to be detected; a super-resolution reconstruction step of receiving the acquired image data and performing super-resolution reconstruction processing on the image data; a target detection step of performing target detection on the image data subjected to the super-resolution reconstruction processing; and a dividing and fusing step, namely cutting the image data subjected to target detection into a plurality of sub-image data, and mapping the detection result of each sub-image data to the combined image data for coordinate fusion so as to obtain a target detection result.
In the object detection step, a single point multi-box detection (SSD) algorithm is used.
In the dividing and fusing step, the step length for clipping the image data subjected to the target detection is a numerical value obtained based on edge detection.
In the super-resolution reconstruction step, a trained spatio-temporal sub-pixel convolutional network is used, the spatio-temporal sub-pixel convolutional network comprising a motion estimation part and a super-resolution part, wherein the spatio-temporal sub-pixel convolutional network is trained by:
the Loss formula of the super-resolution part is as follows:
Figure BDA0002371583680000071
the Loss formula of the motion estimation part is as follows:
Figure BDA0002371583680000078
wherein the content of the first and second substances,
Figure BDA0002371583680000072
is approximated to
Figure BDA0002371583680000073
ε=0.01
The total Loss formula of the spatio-temporal sub-pixel convolution network during end-to-end training is as follows:
Figure BDA0002371583680000074
wherein theta isΔIs a parameter of the motion estimation part, and theta is a parameter of the super-resolution partThe number of the first and second groups is,
Figure BDA0002371583680000075
which represents an image frame or frames of an image,
Figure BDA0002371583680000076
representing the image frame that has undergone the warping process.
Employing an objective loss function L in the objective detection stepDetThe target loss function is obtained by the following equation:
Figure BDA0002371583680000077
wherein: n is the number of default boxes matching the real box, LlocIs a smooth 1-norm loss function in Fast R-CNN, LconfFor Softmax Loss, c is the confidence for each class, and α is a weight term and is set to 1.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention and not to limit the invention. In the drawings:
fig. 1 shows a schematic diagram of a super-resolution reconstruction based object detection system according to an embodiment of the present invention.
Fig. 2 shows a schematic diagram of a super-resolution reconstruction module in a super-resolution reconstruction based object detection system according to an embodiment of the present invention.
Fig. 3 shows a schematic diagram of an object detection module in an object detection system based on super-resolution reconstruction according to an embodiment of the present invention.
Fig. 4 shows a schematic diagram of a partitioning and fusion module in a target detection system based on super-resolution reconstruction according to an embodiment of the present invention.
Fig. 5 shows a flowchart of a target detection method based on super-resolution reconstruction according to an embodiment of the present invention.
Detailed Description
In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that the terms "comprises" and "comprising," and any variations thereof, in the description and claims of this invention and the above-described drawings, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or modules or elements is not necessarily limited to those steps or modules or elements expressly listed, but may include other steps or modules or elements not expressly listed or inherent to such process, method, article, or apparatus.
In order to facilitate the following description of the present invention, a few basic concepts will be described first.
Deep neural network: one type of neural network belongs to one branch of machine learning.
Feature (feature): a method of representing an image. Conventional methods represent an image with RGB three-channel pixels. In order to better utilize a computer for recognition, redundant information in RGB needs to be filtered out, and more semantic features need to be extracted. Image features contain some salient information in the image, such as contour edges, color, etc.
Dimension: the size of the image.
Super-resolution: Super-Resolution (Super-Resolution) is to improve the Resolution of the original image by a hardware or software method, and the process of obtaining a high-Resolution image by a series of low-Resolution images is Super-Resolution reconstruction.
Target detection: object detection (object detection) is a combination of classification (classification) and localization (localization) tasks, which consists in giving a picture, finding exactly where (the coordinates) of objects in the picture are located, and labeling the classes of the objects.
And (3) flaw detection: the method is a specific application scene of the target detection problem, and the damage and the category (the category comprises damaged rivets, scratches, cracks, paint falling and the like) of the damage appearing in the image are identified by giving an image of the surface of a certain material.
Scale invariance (scale invariant): the method refers to that certain characteristics of a system are unchanged after the system is subjected to scale conversion.
The target detection system and method based on super-resolution reconstruction provided by the invention can be used in a scene for detecting (small) targets in a video in practical application, the core of the system is to introduce scale invariance into the target detection scene aiming at the video to improve the detection effect of the small targets, and as long as enough training samples are provided, an algorithm model obtained by learning has excellent resolution capability and strong robustness. The algorithm can learn the characteristics of various images, and can be widely applied to scenes such as fine scar recognition of various material surfaces, for example, rivet damage detection of airplane surface materials.
Fig. 1 shows a schematic diagram of a super-resolution reconstruction based object detection system according to an embodiment of the present invention. As shown in fig. 1, the object detection system 100 based on super-resolution reconstruction includes: a data acquisition module 102 configured to acquire image data to be detected; a super-resolution reconstruction module 104 configured to receive the image data acquired by the data acquisition module 102 and perform super-resolution reconstruction processing on the image data; a target detection module 106 configured to perform target detection on the image data subjected to the super-resolution reconstruction processing; and a dividing and fusing module 108 configured to cut the image data subjected to the target detection into a plurality of sub-image data, and map the detection result of each of the sub-image data into the combined image data to perform coordinate fusion, thereby obtaining a target detection result.
Fig. 2 is a schematic diagram illustrating a super-resolution reconstruction module in a super-resolution reconstruction based object detection system according to an embodiment of the present invention. The super-resolution reconstruction module 200 shown in fig. 2, after being trained, may serve as the super-resolution reconstruction module 104 shown in fig. 1. Through training, the super-resolution reconstruction module can become a trained space-time sub-pixel convolution network. As shown in fig. 2, the spatio-temporal sub-pixel convolutional network includes a motion estimation portion 202 and a super-resolution portion 204. The multi-frame super-resolution module of the network structure is based on real-time video super-resolution with spatio-temporal network and motion compensation. The network can process video image super resolution and achieve real-time speed. An algorithm combining motion compensation and video super-resolution is also proposed and can be trained end-to-end. Compared with a single-frame model, the spatio-temporal network can reduce calculation and maintain output quality. As shown in fig. 2, the spatiotemporal sub-pixel convolutional network is trained by:
in the motion estimation section 202, it is,
Figure BDA0002371583680000111
and
Figure BDA0002371583680000112
is distinguished in that
Figure BDA0002371583680000113
And
Figure BDA0002371583680000114
two different frames, the position of the object in the image may have changed, which may be achieved by warping (warp)
Figure BDA0002371583680000115
And
Figure BDA0002371583680000116
the position of the object is almost the same (slightly different) and then fed into the super-resolution section 204.
In training, the loss of the motion estimation portion 202 is MSE loss plus Huber loss. Huber Loss was added to make the flow spatially smooth. The Loss formula is as follows:
Figure BDA0002371583680000117
wherein the latter term is approximately
Figure BDA0002371583680000118
ε=0.01
The Loss formula of the super-resolution portion 204 is as follows:
Figure BDA0002371583680000119
finally, when the end-to-end training is performed through the motion estimation part 202 and the super-resolution part 204, the whole Loss is
Figure BDA0002371583680000121
Wherein, thetaΔIs a parameter of the motion estimation section 202, and θ is a parameter of the super-resolution section 204.
Fig. 3 is a schematic diagram illustrating an object detection module in an object detection system based on super-resolution reconstruction according to an embodiment of the present invention. The object detection module 300 shown in FIG. 3, after being trained, may function as the object detection module 106 shown in FIG. 1. In the target detection module, the SSD algorithm is mainly used, and other algorithm modules (such as YOLOv3) may be used instead.
Included in the object detection module 300 are: an image input unit 302, a first set of convolutional layers 304, a second set of convolutional layers 306, and a detection output unit 308. In the second convolutional layer 306, a feature pyramid structure is used for detection, that is, feature maps (feature maps) with different sizes, such as conv4-3, conv-7(FC7), conv6-2, conv7-2, conv8_2 and conv9_2, are used for detection, and object class classification and position regression are simultaneously performed on a plurality of feature maps. The very beginning of the SSD model, referred to herein as the base network (VGG-16 is used herein, and lightweight base networks such as MobileNet, ShuffleNet, etc. are used to speed up the algorithm), is a common network used for image classification. After the base network, additional auxiliary network structures are added, which mainly comprise the following three parts: (1) multi-scale feature maps for detection (Multi-scale feature maps for detection): after the basic network structure, additional convolutional layers are added, the sizes of the convolutional layers are gradually reduced layer by layer, and prediction can be carried out under multiple scales. (2) Convolution predictors for detection (Convolutional predictors for detection): each newly added layer (or feature layer in the infrastructure) can use a series of convolution kernels to produce a series of fixed size predictions. (3) Default boxes and aspect ratios (Default boxes and aspect ratios): the position of each frame with respect to its corresponding feature map cell is fixed. In each feature map unit, the offset between the prediction box and the default box (default box) needs to be predicted, and the score of each box containing the object needs to be predicted. The prediction box actually predicts the offset (offsets) with respect to the default box.
The SSD trained objective function (Training objective) can handle multiple target classes. By using
Figure BDA0002371583680000131
Indicating that the ith default box matches the jth real box of the category p, and if not, then
Figure BDA0002371583680000132
According to this matching strategy, there must be
Figure BDA0002371583680000133
Meaning that for the jth real box, there are likely to be multiple default boxes matching it. The total objective loss function (objective loss function) is obtained by weighted summation of the localization loss (loc) and the confidence loss (conf):
Figure BDA0002371583680000134
the following explains the meaning of the parameters, in which: n is the number of default boxes that match the real box. localization Loss (loc) is the Smooth L1 Loss function in Fast R-CNN, used in the predicted frame (L) and real frame (g) parameters (i.e., center coordinate position, width, height). confidence Loss (conf) is Softmax Loss, the input being the confidence c for each class. The weight term α, set to 1.
After a series of predictions is generated, many prediction blocks that fit the real block are generated, but at the same time, there are many prediction blocks that do not fit the real block, and there are far more negative samples than positive samples. Leading to difficulty in convergence during training, and carrying out Hard negative minning. Therefore, the frames corresponding to the predicted frames (default boxes) at each object position are sorted according to the confidence of the default boxes. The top ones are selected to ensure that the ratio of the last negative to positive samples is 3: 1. (function minehardextensions of this algorithm in bbox _ util. cpp.) experiments herein have found that such a ratio can be optimized faster and the training is more stable. Data augmentation (data augmentation) is performed on training data during the training process. In order to make the model more robust to the scale and size of the target, the paper makes data augmentation on training images. Each training image was randomly generated by the following method: (1) the original image is used. (2) A patch (patch) is sampled with a minimum Jaccard overlap (jaccard overlap) of 0.1,0.3,0.5,0,7,0.9 with the object. (3) A patch (patch) is sampled randomly.
Fig. 4 shows a schematic diagram of a partitioning and fusion module in a target detection system based on super-resolution reconstruction according to an embodiment of the present invention. The partitioning and merging module 400 shown in fig. 4 may be implemented as the partitioning and merging module 106 shown in fig. 1. The present invention adds a cropping and fusion operation on data, the upper part of fig. 4 shows a data partitioning unit 402 in a partitioning and fusion module 400, which is a learnable module for partitioning an image into n sub-images using a sliding window, the step size of which is changeable at each step, the purpose of cropping is to preserve the reconstructed resolution and adapt to the fixed input of the object detection module, and the purpose of the changeable step size is to preserve the integrity of the object to be detected in each of the segmented sub-images. In the data dividing unit 402, the step size of cropping the image data subjected to the object detection is a numerical value based on the edge detection. Therefore, the step size for performing cropping is learnable, specifically, a numerical value is predicted through network learning, the numerical value is used for specifying the cropping size, the goal is to keep the complete detection result (sub-image) in the image, the learning label is the edge coordinate of the object in each image (the value is predicted, the relatively complete object can be kept), and meanwhile, the benefit (higher resolution) brought by super-resolution can be kept.
The lower part of fig. 4 shows the data fusion unit 404 in the dividing and fusing module 400, and the data fusion unit 404 is responsible for recombining the sliced and detected images into one image, and mapping the detection result of each sub-image into the combined large image for coordinate fusion to obtain the final detection result. In the data fusion unit 404, a Bounding Box (Bounding Box) coordinate recovery process is shown, which is exemplified by a fixed step size and is divided into four subgraphs. Finally, after the data fusion is completed in the data fusion unit 404, the target detection result is output.
Fig. 5 shows a flowchart of a target detection method based on super-resolution reconstruction according to an embodiment of the present invention. The target detection method based on the super-resolution reconstruction comprises the following steps: a data acquisition step S502, acquiring image data to be detected; a super-resolution reconstruction step S504, wherein the super-resolution reconstruction step is used for receiving the acquired image data and performing super-resolution reconstruction processing on the image data; a target detection step S506 of performing target detection on the image data subjected to the super-resolution reconstruction processing; and a dividing and fusing step S508 of cutting the image data subjected to the target detection into a plurality of sub-image data, and mapping the detection result of each sub-image data to the combined image data for coordinate fusion, thereby obtaining a target detection result.
The target detection technology and method based on super-resolution reconstruction provided by the invention can realize the following effects:
1) the accuracy is as follows: under the scene of detecting the target in the video, the resolution ratio of the input image is improved by using the super-resolution module, so that the scale invariance is introduced, the detection result of the small target is improved, and the accuracy is improved on the whole.
2) Flexibility: the whole network is end-to-end, and is easy to train. The lightweight basic network and the used hyper-parameters in the model sub-module and the sub-module can be replaced according to the needs of users.
3) The application range is wide: the method can be applied to real-time detection tasks with small-size targets in various scenes, and has a wide application range.
4) Strong generalization ability: the learned algorithm model has excellent accuracy (generalization ability) in practical application as long as there are enough training samples.

Claims (10)

1. A target detection system based on super-resolution reconstruction, the system comprising:
the data acquisition module is configured to acquire image data to be detected;
the super-resolution reconstruction module is configured to receive the image data acquired by the data acquisition module and perform super-resolution reconstruction processing on the image data;
a target detection module configured to perform target detection on the image data subjected to the super-resolution reconstruction processing; and
and the dividing and fusing module is configured to cut the image data subjected to target detection into a plurality of sub-image data, and map the detection result of each sub-image data into the combined image data for coordinate fusion, so as to obtain a target detection result.
2. The super-resolution reconstruction based object detection system according to claim 1, wherein in the object detection module, a single point multi-box detection (SSD) algorithm is used.
3. The system of claim 1, wherein the step size for cropping the image data subjected to object detection in the segmentation and fusion module is based on a value obtained by edge detection.
4. The super resolution reconstruction based object detection system of claim 1, wherein the super resolution reconstruction module is a trained spatiotemporal sub-pixel convolution network comprising a motion estimation part and a super resolution part, wherein the spatiotemporal sub-pixel convolution network is trained by:
the Loss formula of the super-resolution part is as follows:
Figure FDA0002371583670000011
the Loss formula of the motion estimation part is as follows:
Figure FDA0002371583670000021
wherein the content of the first and second substances,
Figure FDA0002371583670000022
is approximated to
Figure FDA0002371583670000023
ε=0.01
The total Loss formula of the spatio-temporal sub-pixel convolution network during end-to-end training is as follows:
Figure FDA0002371583670000024
wherein theta isΔIs a parameter of the motion estimation part, theta is a parameter of the super-resolution part,
Figure FDA0002371583670000025
which represents an image frame or frames of an image,
Figure FDA0002371583670000026
representing the image frame that has undergone the warping process.
5. The system of claim 1, wherein an object loss function L is employed in the object detection moduleDetTarget loss function LDetObtained by the following equation:
Figure FDA0002371583670000027
wherein: n is the number of default boxes matching the real box, LlocIs a smooth 1-norm loss function in Fast R-CNN, LconfFor Softmax Loss, c is the confidence for each class, and α is a weight term and is set to 1.
6. A target detection method based on super-resolution reconstruction is characterized by comprising the following steps:
a data acquisition step, which is to acquire image data to be detected;
a super-resolution reconstruction step of receiving the acquired image data and performing super-resolution reconstruction processing on the image data;
a target detection step of performing target detection on the image data subjected to the super-resolution reconstruction processing; and
and a dividing and fusing step, namely cutting the image data subjected to target detection into a plurality of sub-image data, and mapping the detection result of each sub-image data to the combined image data for coordinate fusion so as to obtain a target detection result.
7. The method for object detection based on super-resolution reconstruction according to claim 6, wherein in the step of object detection, a single-point multi-box detection (SSD) algorithm is used.
8. The method of claim 6, wherein in the step of dividing and fusing, the step size for cropping the image data subjected to object detection is based on a value obtained by edge detection.
9. The super resolution reconstruction-based object detection method according to claim 8, wherein in the super resolution reconstruction step, a trained spatio-temporal sub-pixel convolution network is used, the spatio-temporal sub-pixel convolution network comprising a motion estimation part and a super resolution part,
wherein the spatiotemporal sub-pixel convolutional network is trained by:
the Loss formula of the super-resolution part is as follows:
Figure FDA0002371583670000031
the Loss formula of the motion estimation part is as follows:
Figure FDA0002371583670000041
wherein the content of the first and second substances,
Figure FDA0002371583670000042
is approximated to
Figure FDA0002371583670000043
ε=0.01
The total Loss formula of the spatio-temporal sub-pixel convolution network during end-to-end training is as follows:
Figure FDA0002371583670000044
wherein theta isΔIs a parameter of the motion estimation part, theta is a parameter of the super-resolution part,
Figure FDA0002371583670000045
which represents an image frame or frames of an image,
Figure FDA0002371583670000046
representing the image frame that has undergone the warping process.
10. The method of claim 6, wherein an object loss function L is used in the step of detecting the objectDetThe target loss function is obtained by the following equation:
Figure FDA0002371583670000047
wherein: n is the number of default boxes matching the real box, LlocIs a smooth 1-norm loss function in Fast R-CNN, LconfFor Softmax Loss, c is the confidence for each class, and α is a weight term and is set to 1.
CN202010052220.4A 2020-01-17 2020-01-17 Target detection system and method based on super-resolution reconstruction Pending CN113139896A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010052220.4A CN113139896A (en) 2020-01-17 2020-01-17 Target detection system and method based on super-resolution reconstruction

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010052220.4A CN113139896A (en) 2020-01-17 2020-01-17 Target detection system and method based on super-resolution reconstruction

Publications (1)

Publication Number Publication Date
CN113139896A true CN113139896A (en) 2021-07-20

Family

ID=76808361

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010052220.4A Pending CN113139896A (en) 2020-01-17 2020-01-17 Target detection system and method based on super-resolution reconstruction

Country Status (1)

Country Link
CN (1) CN113139896A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113420745A (en) * 2021-08-25 2021-09-21 江西中业智能科技有限公司 Image-based target identification method, system, storage medium and terminal equipment
CN115115611A (en) * 2022-07-21 2022-09-27 明觉科技(北京)有限公司 Vehicle damage identification method and device, electronic equipment and storage medium
WO2023060746A1 (en) * 2021-10-14 2023-04-20 中国科学院深圳先进技术研究院 Small image multi-object detection method based on super-resolution
WO2023123924A1 (en) * 2021-12-30 2023-07-06 深圳云天励飞技术股份有限公司 Target recognition method and apparatus, and electronic device and storage medium

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113420745A (en) * 2021-08-25 2021-09-21 江西中业智能科技有限公司 Image-based target identification method, system, storage medium and terminal equipment
WO2023060746A1 (en) * 2021-10-14 2023-04-20 中国科学院深圳先进技术研究院 Small image multi-object detection method based on super-resolution
WO2023123924A1 (en) * 2021-12-30 2023-07-06 深圳云天励飞技术股份有限公司 Target recognition method and apparatus, and electronic device and storage medium
CN115115611A (en) * 2022-07-21 2022-09-27 明觉科技(北京)有限公司 Vehicle damage identification method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN110097568B (en) Video object detection and segmentation method based on space-time dual-branch network
CN112884064B (en) Target detection and identification method based on neural network
CN113139896A (en) Target detection system and method based on super-resolution reconstruction
CN111951212A (en) Method for identifying defects of contact network image of railway
CN112507777A (en) Optical remote sensing image ship detection and segmentation method based on deep learning
Han et al. Methods for small, weak object detection in optical high-resolution remote sensing images: A survey of advances and challenges
CN110555420B (en) Fusion model network and method based on pedestrian regional feature extraction and re-identification
CN110110755B (en) Pedestrian re-identification detection method and device based on PTGAN region difference and multiple branches
CN109919223B (en) Target detection method and device based on deep neural network
CN114648665A (en) Weak supervision target detection method and system
CN110705412A (en) Video target detection method based on motion history image
CN114155527A (en) Scene text recognition method and device
CN111652240A (en) Image local feature detection and description method based on CNN
CN115409789A (en) Power transmission line engineering defect detection method based on image semantic segmentation
CN115376028A (en) Target detection method based on dense feature point splicing and improved YOLOV5
Li et al. Method research on ship detection in remote sensing image based on Yolo algorithm
CN110688512A (en) Pedestrian image search algorithm based on PTGAN region gap and depth neural network
CN111428730A (en) Weak supervision fine-grained object classification method
CN114463800A (en) Multi-scale feature fusion face detection and segmentation method based on generalized intersection-parallel ratio
Chaturvedi et al. Small object detection using retinanet with hybrid anchor box hyper tuning using interface of Bayesian mathematics
CN116363535A (en) Ship detection method in unmanned aerial vehicle aerial image based on convolutional neural network
EP4332910A1 (en) Behavior detection method, electronic device, and computer readable storage medium
CN114943888B (en) Sea surface small target detection method based on multi-scale information fusion
CN112508848B (en) Deep learning multitasking end-to-end remote sensing image ship rotating target detection method
CN115035429A (en) Aerial photography target detection method based on composite backbone network and multiple measuring heads

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination