CN113221925B

CN113221925B - Target detection method and device based on multi-scale image

Info

Publication number: CN113221925B
Application number: CN202110679907.5A
Authority: CN
Inventors: 单纯; 王曦; 宫英慧; 周彦哲; 李金泽
Original assignee: Beijing Institute of Technology BIT
Current assignee: Beijing Institute of Technology BIT
Priority date: 2021-06-18
Filing date: 2021-06-18
Publication date: 2022-11-11
Anticipated expiration: 2041-06-18
Also published as: CN113221925A

Abstract

The invention provides a target detection method and a target detection device based on a multi-scale image, wherein the method comprises the steps of inputting an original image to obtain a candidate region; acquiring an original characteristic map of the candidate region; comparing the original characteristic diagram of the candidate region with a preset resolution, inputting the original characteristic diagram lower than the preset resolution into an image reconstruction network model, and performing image enhancement; and inputting the image features after image enhancement and the original feature map of the candidate region into YOLOV3 for target detection and classification. According to the scheme of the invention, the target detection performance of the target detection network on the low-resolution images is enhanced by utilizing the output of the trained image reconstruction network, the detection of small targets is emphasized, and the detection effect is good.

Description

Target detection method and device based on multi-scale image

Technical Field

The invention relates to the technical field of computer vision, in particular to a target detection method and device based on multi-scale images.

Background

Matching and detection of image objects have always been a very important problem in the field of computer vision. The target detection technology has a wide application range, and how to develop an effective, accurate and widely applicable target detection algorithm is particularly important. In the process of target detection, we usually encounter four classical errors, (a) class is misidentified; (b) a positioning error, in which only a part of the body of the sample is positioned; (c) occlusion induced target unrecognized error; (d) And (4) small target errors, because the occupied area of the target is too small, the characteristics of the target are not effectively recognized, and the classification is wrong.

In recent years, image target detection algorithms based on deep learning have achieved a breakthrough development. The detection is carried out through the convolutional neural network, and the precision is greatly improved. For the above errors, many excellent algorithms are proposed to optimize the target detection, so as to improve the accuracy and speed of the target detection. The improvement of the target detection algorithm mainly aims at the following aspects: (1) The first method mainly aims at the model infrastructure of the algorithm, namely, the structure of the deep network is improved, such as deepening the basic network. (2) The second method mainly aims at improving the characteristics, and methods for increasing the context information and multi-scale information of the characteristics are popular improvement modes at present, so that the detection capability of the algorithm on small targets is improved. (3) A third category of methods is directed to improvements in data enhancement methods. Data enhancement is the simplest and most effective method to improve model robustness and reduce overfitting. In addition, the target detection algorithm is mainly improved in the following stages of (1) image processing. And (2) a detection stage. And (3) a classification stage.

The target detection algorithm based on deep learning is mainly divided into two main categories, namely a regression-based detection algorithm and a region proposal-based detection algorithm according to the structural difference. The regression-based target detection algorithm mainly comprises algorithms such as YOLO, SSD, retinaNet, refineDet and the like, and the algorithm mainly obtains a result by performing primary regression and multi-classification calculation through features extracted from a main network. The detection algorithm based on the regional proposal mainly comprises algorithms such as R-CNN, SPPNET, fast-RCNN, faster-RCNN, R-FCN and FPN, the algorithms are detected in two stages, the first stage is mainly responsible for carrying out rough regression and classification on an initial frame anchor on the characteristics extracted from the image to obtain a proposal frame, the second stage is mainly used for carrying out further regression and classification calculation on the proposal frame (proposal) obtained by the detection in the first stage to obtain a result, all the results obtained by the network are subjected to post-processing operations such as non-maximum value inhibition and border crossing prevention, and finally all the obtained detection frames are marked on the original image to complete the detection.

However, the above two algorithms completely depend on the scale change of the anchor for the problem of the target scale change, and cannot well solve the problem of the scale change in the target detection, especially the problem of the small target detection.

Disclosure of Invention

In order to solve the technical problems, the invention provides a target detection method and a target detection device based on a multi-scale image, and the method and the device are used for solving the problem that the target detection in the prior art cannot well solve the scale change in the target detection, especially the technical problem of small target detection.

According to a first aspect of the present invention, there is provided a method for multi-scale image-based object detection, the method comprising the steps of:

step S101: inputting an original image to obtain a candidate region;

step S102: acquiring an original characteristic map of the candidate region;

step S103: comparing the original characteristic diagram of the candidate region with a preset resolution, inputting the original characteristic diagram lower than the preset resolution into an image reconstruction network model, and performing image enhancement;

step S104: and inputting the image features after image enhancement and the original feature map of the candidate region into YOLOV3 for target detection and classification.

According to a second aspect of the present invention, there is provided a multi-scale image-based object detection apparatus, the apparatus comprising:

a candidate region acquisition module: the method comprises the steps of inputting an original image to obtain a candidate region;

an original characteristic diagram acquisition module: the method comprises the steps of obtaining an original feature map of the candidate region;

an image enhancement module: comparing the original characteristic diagram of the candidate region with a preset resolution, inputting the original characteristic diagram lower than the preset resolution into an image reconstruction network model, and performing image enhancement;

a target detection module: and inputting the image features after image enhancement and the original feature map of the candidate region into YOLOV3 for target detection and classification.

According to a third aspect of the present invention, there is provided a multi-scale image-based object detection system, comprising:

a processor for executing a plurality of instructions;

a memory to store a plurality of instructions;

wherein the instructions are stored by the memory, and loaded and executed by the processor to perform the multi-scale image-based object detection method as described above.

According to a fourth aspect of the present invention, there is provided a computer readable storage medium having a plurality of instructions stored therein; the instructions are used for loading and executing the multi-scale image-based target detection method by the processor.

According to the scheme of the invention, some improved algorithms are provided based on the aspect of multi-scale features, multi-scale feature expression is realized by methods such as feature fusion, feature enhancement and the like, and a new end-to-end network structure is provided for the purpose: and detecting the object in the low-resolution image through the cooperative learning of two deep neural networks, namely an image reconstruction network and an object detection network. Firstly, a target detection network is trained, secondly, an image reconstruction network enhances a low-resolution image into a high-resolution image through the assistance of the target detection network, and finally, the target detection performance of the target detection network on the low-resolution image is enhanced by utilizing the output of the trained image reconstruction network. According to the scheme of the invention, the detection of the small target is emphasized, and the low-resolution picture can be detected by using the IRN.

The foregoing description is only an overview of the technical solutions of the present invention, and in order to make the technical solutions of the present invention more clearly understood and to make the technical solutions of the present invention practical in accordance with the contents of the specification, the following detailed description is given of preferred embodiments of the present invention with reference to the accompanying drawings.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and, together with the description, serve to explain the principles of the invention. In the drawings:

FIG. 1 is a flowchart of a multi-scale image-based target detection method according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of an overall structure of a multi-scale image-based target detection method according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of an overall structure of an image reconstruction network model according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of image reconstruction for an image reconstruction network according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of an upsampling process according to an embodiment of the present invention;

FIG. 6 is a schematic view of a downsampling according to one embodiment of the present invention;

fig. 7 is a block diagram of a multi-scale image-based object detection apparatus according to an embodiment of the present invention.

Detailed Description

First, a flow of a multi-scale image-based target detection method according to an embodiment of the present invention is described with reference to fig. 1. As shown in fig. 1-2, the method comprises the steps of:

step S101: inputting an original image to obtain a candidate region;

step S102: acquiring an original feature map of the candidate region;

The step S101: an original image is input to obtain a candidate region, and in the present embodiment, the candidate region is obtained by a region generation network (RPN).

The step S102: obtaining the original feature map of the candidate region, in this embodiment, rolPooling obtains the original feature map of the candidate region.

The step S103: comparing the original characteristic diagram of the candidate region with a preset resolution, inputting the original characteristic diagram lower than the preset resolution into an image reconstruction network model, and performing image enhancement, wherein:

and carrying out no processing on the original feature map of the candidate region higher than or equal to the preset resolution.

In this embodiment, as shown in fig. 3 to 4, the image reconstruction network model includes an Image Reconstruction Network (IRN) and a target detection network, an input of the image reconstruction network is an image with a resolution lower than a preset resolution, an output of the image reconstruction network is a reconstructed image RLR, and a pixel size of the reconstructed image RLR is the same as a pixel size of an HR image output by the image reconstruction network; with the reconstructed image RLR as an input of the object detection network, a loss is calculated based on the reconstructed image RLR and a feature map HR acquired via an up-sampling operation of the image reconstruction network, thereby adjusting a parameter of the image reconstruction network.

The image reconstruction network comprises a plurality of convolution layers and a plurality of branches with different levels; for an input original feature map lower than a preset resolution, after convolution operation of the plurality of convolution layers, obtaining a branch with the lowest feature vector input level; each branch comprises a plurality of sampling blocks, each sampling block comprises an up sampling block and a down sampling block, and the transmission characteristics of each branch are enhanced in a certain proportion in the forward propagation process of each branch through the sampling blocks; for each branch of the plurality of branches: transmitting the upsampling characteristics of the branch to the corresponding sampling block in the branch with the higher level than the sampling block through each sampling block; the downsampling characteristic of the branch is transmitted to the corresponding sample block in the branch with the lower level than the branch through each sample block.

In this embodiment, the upsampling operation of the upsampling block and the downsampling operation of the downsampling block may be performed concurrently.

The image reconstruction network adopting the structure has the advantages that:

(1) The overall architecture starts with the low resolution profile as a first stage, adding step by step low to high resolution operations to form more stages, connecting subnets with different resolutions in parallel.

(2) Multi-scale fusion is performed, which is performed with the help of low resolution representations of the same depth and similar level to boost the high resolution representation. I.e. each subnet repeatedly accepts information from other parallel subnets.

For the operation expanded by 4 times, a total of three branches are provided in the embodiment, and the size of the feature map remains unchanged during the forward propagation of each branch. The three branches are different, but there is communication of information between each branch. For example, in the forward process, the lowest branch in the figure, i.e., branch 1, will expand its feature map by an upsampling block, which comprises 3 units (as shown in fig. 5), and then pass into branches 2 and 3, while branch 2 will also pass through a downsampling block (as shown in fig. 6), and send the reduced feature map to branch 1. In this embodiment, the upsampling operation of the upsampling block and the downsampling operation of the downsampling block may be performed at the same stage.

In this embodiment, the feature map lower than the threshold is input into the network, and through image reconstruction of the feature map, even if three different branches are provided, each branch is subjected to a parallel processing mode of upsampling and downsampling to obtain target features of different pixels, and meanwhile, through convolution, automatic noise reduction and feature enhancement, communication among the branches is subjected to a feature fusion mode, and finally a fused single feature map is obtained. The loss is calculated using the acquired enhanced low resolution image (RLR) and the target detection results of the upsampled acquired feature map (HR).

As shown in fig. 5, the upsampling block is composed of a first sub-pixel convolution unit, a first convolution unit, and a second sub-pixel convolution unit; the low-resolution feature map L0 is subjected to sub-pixel convolution of a first sub-pixel convolution unit to generate a high-resolution feature map H0; the high-resolution feature map H0 is converted into a low-resolution feature map L1 through the convolution operation of a first convolution unit; subtracting pixels from L0 to L1 to find the difference between the low-resolution feature maps; and the low-resolution feature map L1 is subjected to sub-pixel convolution of a second sub-pixel convolution unit to generate a high-resolution feature map H1, and the two high-resolution feature maps H0 and H1 are added pixel by pixel to output a high-resolution feature map HR.

As shown in fig. 6, the downsampling block is composed of a first convolution unit, a first sub-pixel convolution unit, and a second convolution unit; the high-resolution feature map H0 'is convolved by a first convolution unit to generate a low-resolution feature map L0'; the low-resolution feature map L0 'is converted into a high-resolution feature map H1' through the convolution operation of the first sub-pixel convolution unit; adding H0 'and H1' pixel by pixel for fusion; and the high-resolution feature map H1 'generates a low-resolution feature map L1' through convolution of the second convolution unit, pixel-by-pixel subtraction is carried out on the low high-resolution feature maps L0 'and L1', and a low-resolution feature map LR is output.

The loss function of the Image Reconstruction Network (IRN) is:

the task of IRN is to reconstruct high resolution images from low resolution images, and it is important to design an appropriate loss function in order to obtain the desired enhancement effect. Since our ultimate goal is to improve the accuracy of object detection, we wish to focus on the information related to the object to reconstruct a high resolution image. Based on the typical reconstruction loss in super-resolution, we add three auxiliary loss functions that play a secondary role in reconstructing the image.

RLoss is the error between the RLR image generated by the image reconstruction network and the HR output by the upsampling module.

Eloss-the edges between RLR and HR were extracted separately using the classical Sobel operator and then the average of the pixel differences was calculated.

Ploss extracts perceptual features from Frozen layers in the object recognition network, respectively, and then calculates the perceptual loss using the euclidean distance between the two extracted feature vectors.

4. Total loss of image reconstruction network:

Total Loss＝w ₁ RLoss+w ₂ ELoss+w ₃ PLoss

wherein w ₁ ，w ₂ ，w ₃ The weight coefficient is specifically set according to experiments.

As shown in fig. 3, the training process of the image reconstruction network model includes:

step S301: training the target detection network by using the HR image output by the up-sampling module, and keeping the parameters of certain layers of YOLOV3 unchanged;

in this embodiment, parameters of some layers of YOLOV3 are kept unchanged to retain general feature extraction capability of the target detection network, and the network is used to guide an Image Reconstruction Network (IRN).

Step S302: fixing parameters of the target detection network, training an Image Reconstruction Network (IRN) in a supervised manner using training samples and the target recognition network;

in this embodiment, the training sample is an image with a resolution lower than a preset threshold as a low resolution image (LR), the output of the image reconstruction network is a reconstructed image (RLR) with the same size as HR pixels of the image reconstruction network, and a reconstruction loss and an edge loss are calculated by using a difference between the RLR and the HR image; the total loss of detection is calculated by using the target detection network with the RLR image as input. By using recombination losses, image Reconstruction Networks (IRNs) focus on using information useful for object detection to accomplish the reconstruction of images. The Total Loss of the image reconstruction network is Total Low = w ₁ RLoss+w ₂ ELoss+w ₃ PLoss。

Step S303: training the target detection network using reconstructed images (RLR) generated by the Image Reconstruction Network (IRN), at which stage the parameters of the Image Reconstruction Network (IRN) are fixed and the parameters of the target detection network are trained.

In this embodiment, all layers of the object detection network are not Frozen (Frozen) to enhance the object detection capability. After training is complete, the entire process can be applied to the new LR images. The LR image is input to the IRN to generate a reconstructed image, which is then input as input into the object detection network. The final result is then predicted by the target detection network.

An embodiment of the present invention further provides a target detection apparatus based on a multi-scale image, and as shown in fig. 7, the apparatus includes:

The embodiment of the invention further provides a target detection system based on multi-scale images, which comprises the following steps:

a processor for executing a plurality of instructions;

a memory for storing a plurality of instructions;

Embodiments of the present invention further provide a computer-readable storage medium having a plurality of instructions stored therein; the instructions are used for loading and executing the multi-scale image-based target detection method by the processor.

It should be noted that the embodiments and features of the embodiments of the present invention may be combined with each other without conflict.

In the several embodiments provided in the present invention, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one type of logical functional division, and other divisions may be realized in practice, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one position, or may be distributed on multiple network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit may be implemented in the form of hardware, or in the form of hardware plus a software functional unit.

The integrated unit implemented in the form of a software functional unit may be stored in a computer-readable storage medium. The software functional unit is stored in a storage medium and includes several instructions to enable a computer device (which may be a personal computer, a physical machine server, or a network cloud server, etc., and needs to install a Ubuntu operating system) to perform some steps of the method according to various embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, an optical disk, or other various media capable of storing program codes.

The above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention in any way, and any simple modification, equivalent change and modification made to the above embodiment according to the technical spirit of the present invention are still within the scope of the technical solution of the present invention.

Claims

1. A target detection method based on multi-scale images is characterized by comprising the following steps:

step S101: inputting an original image to obtain a candidate region;

step S102: acquiring an original characteristic map of the candidate region;

step S104: inputting the image features after image enhancement and the original feature map of the candidate region into YOLOV3 for target detection and classification;

the image reconstruction network model comprises an image reconstruction network and a target detection network, wherein the input of the image reconstruction network is an image with the resolution lower than the preset resolution, the output of the image reconstruction network is a reconstructed image RLR, and the pixel size of the reconstructed image RLR is the same as the pixel size of an HR image output by the image reconstruction network; calculating a loss based on the reconstructed image RLR and a feature map HR obtained through an up-sampling operation of the image reconstruction network by taking the reconstructed image RLR as an input of a target detection network, so as to adjust parameters of the image reconstruction network;

the image reconstruction network comprises a plurality of convolution layers and a plurality of branches with different levels; for an input original feature map lower than a preset resolution, after convolution operation of the plurality of convolution layers, obtaining a branch with the lowest feature vector input level; each branch comprises a plurality of sampling blocks, each sampling block comprises an up-sampling block and a down-sampling block, and the transmission characteristics of each branch are enhanced in a certain proportion in the forward propagation process of each branch through the sampling blocks; for each branch of the plurality of branches: transmitting the upsampling characteristics of the branch to the corresponding sampling blocks in the branches with the levels higher than the sampling blocks by the sampling blocks; the downsampling characteristics of the branch are transmitted by each sample block to the corresponding sample block in the branch with the lower level than itself.

2. The multi-scale image based object detection method of claim 1, wherein the upsampling block is composed of a first sub-pixel convolution unit, a first convolution unit, and a second sub-pixel convolution unit; the low-resolution feature map L0 is subjected to sub-pixel convolution of a first sub-pixel convolution unit to generate a high-resolution feature map H0; the high-resolution feature map H0 is converted into a low-resolution feature map L1 through the convolution operation of a first convolution unit; subtracting pixels from L0 to L1 to find the difference between the low-resolution feature maps; and the low-resolution feature map L1 generates a high-resolution feature map H1 through the sub-pixel convolution of the second sub-pixel convolution unit, and the two high-resolution feature maps H0 and H1 are added pixel by pixel to output a feature map HR.

3. The multi-scale image-based object detection method of claim 2, wherein the downsampling block is composed of a first convolution unit, a first sub-pixel convolution unit, and a second convolution unit; the high-resolution feature map H0 'is convolved by a first convolution unit to generate a low-resolution feature map L0'; the low-resolution feature map L0 'is converted into a high-resolution feature map H1' through the convolution operation of the first sub-pixel convolution unit; adding H0 'and H1' pixel by pixel for fusion; and the high-resolution feature map H1 'generates a low-resolution feature map L1' through convolution of the second convolution unit, pixel-by-pixel subtraction is carried out on the two low-resolution feature maps L0 'and L1', and a feature map LR is output.

4. The multi-scale image-based object detection method according to claim 3,

the loss function of the Image Reconstruction Network (IRN) is:

Total Loss＝w ₁ RLoss+w ₂ ELoss+w ₃ PLoss

wherein w ₁ ，w ₂ ，w ₃ RLoss is the error between the RLR image generated by the image reconstruction network and HR, which is a weight coefficient; ELoss is to adopt Sobel operator to extract the edge between RLR and HR separately, then, calculate the average value of the pixel difference; PLoss is the perceptual loss calculated by extracting perceptual features from Frozen layers in the object recognition network, respectively, and then using the euclidean distance between the two extracted feature vectors.

5. The multi-scale image-based target detection method of claim 4, wherein the training process of the image reconstruction network model is as follows:

step S302: fixing parameters of the target detection network, and training an image reconstruction network in a supervised manner by using a training sample and the target detection network;

step S303: and training the target detection network by using a reconstructed image RLR generated by the image reconstruction network, and at this stage, fixing the parameters of the image reconstruction network and training the parameters of the target detection network.

6. An apparatus for object detection based on multi-scale images, the apparatus comprising:

a target detection module: inputting the image features after image enhancement and the original feature map of the candidate region into YOLOV3 for target detection and classification;

7. A multi-scale image based object detection system, comprising:

a processor for executing a plurality of instructions;

a memory for storing a plurality of instructions;

wherein the plurality of instructions for storing by the memory and for loading and executing by the processor the method for multi-scale image based object detection according to any one of claims 1-5.

8. A computer-readable storage medium having stored therein a plurality of instructions; the plurality of instructions for loading and executing by a processor the method for multi-scale image based object detection according to any of claims 1-5.