CN111091151B

CN111091151B - Construction method of generation countermeasure network for target detection data enhancement

Info

Publication number: CN111091151B
Application number: CN201911301874.XA
Authority: CN
Inventors: 王智慧; 李豪杰; 刘崇威; 王世杰; 唐涛
Original assignee: Dalian University of Technology
Current assignee: Dalian University of Technology
Priority date: 2019-12-17
Filing date: 2019-12-17
Publication date: 2021-11-05
Anticipated expiration: 2039-12-17
Also published as: CN111091151A

Abstract

The invention belongs to the field of computer image generation, and provides a method for constructing a generation countermeasure network for enhancing target detection data. The method fuses Poisson fusion in traditional digital image processing and generators in a generating countermeasure network, so that the size, the number and the position of detected objects on one picture can be changed by the generating countermeasure network. We also specifically designed a loss function for the generator to allow the generator to better generate the picture. The method can effectively solve the problem of class imbalance in the target detection task, so that the trained detection model can obtain better performance. Meanwhile, a large amount of labor cost is saved by automatically amplifying the small data set into the large data set.

Description

Construction method of generation countermeasure network for target detection data enhancement

Technical Field

The invention belongs to the field of computer image generation, and relates to a method for generating a countermeasure network for enhancing target detection data.

Background

Data enhancement refers to adding more variation in the training data to improve the generalization ability of the training model. Currently, data enhancement strategies are widely applied in training CNNs, such as flipping, scaling, and the like. In recent years, the generation of antagonistic networks GAN has been shown to be excellent in a number of image2image jobs (OrestKupyn, VolodymyBurzan, Mykola Mykhailych, DmbroMishkin, and Jiri Matas. Deblurg GAN: bland Motion Debluring Using Conditional additive networks. arXiv e-prints, page arXiv:1711.07064, Nov 2017.). Thus, there has been work (Sheng-Wei Huang, Che-Tsung Lin, Shu-Ping Chen, Yen-Yi Wu, Po-Hao Hsu, and Shang-Hong Lai. Auggan: Cross do-main adaptation with gate-based data augmentation. in vitamin-rio Ferrari, Martial Hebert, CristianSmith scu, and Yair Weiss, editors, Computer Vision-ECCV 2018, pages 731 and 744, Cham,2018.spring International Publishing). AugGAN (Sheng-Wei Huang, Che-Tsung Lin, Shu-Ping Chen, Yen-Yi Wu, Po-Hao Hsu, and Shang-Hong Lai. Auggan: Cross do main adaptation with gan-based data augmentation. in vitamin-rio Ferrari, Martial Hebert, CristianSminChinese scu, and Yair Weiss, editors, Computer Vision-ECCV 2018, pages 731 and 744, Cham,2018.spring International publication.) has the ability to structure-aware semantic segmentation and soft weight sharing, so the resulting image is sufficiently realistic to be trained. But the ground route used by this method contains an instance mask, which is inconvenient for labeling. CycleGAN (Jun Yan Zhu, Taesung Park, Phillip Isola, and Alexi A. Efront. Innovated image-to-image transformation using cycle-dependent adaptive training networks. in IEEE International Con-reference on Computer Vision,2017.) No pairing of training data is required, and there have been a number of efforts to use CycleGAN to achieve data enhancement (Weijian Deng, Liang Zheng, Guoliang Ka, Yi Yang, Qiax-iang Ye, and Jianbin J. image-added training with derived from RR-dependent simulation and phase-modification 1711.07027,2017). However, we cannot ignore the drawbacks of CycleGAN. Which tends to produce overfitting when generating images, thereby affecting the accuracy of the model.

Moreover, the existing GAN-based data enhancement method only realizes data enhancement by transforming the style of the image. This approach, while helpful for the image classification task, does not work well for the target detection task because the number, size, and location of objects in the image cannot be changed.

Disclosure of Invention

The invention aims to provide a construction method (Poisson GAN) of a generative confrontation network for enhancing target detection data, which can change the size, the number and the position of detected objects on a picture by fusing Poisson fusion in the traditional digital image processing and a generator in the generative confrontation network. We also specifically designed a loss function for the generator to allow the generator to better generate the picture. The method can effectively solve the problem of class imbalance in the target detection task, so that the trained detection model can obtain better performance.

The technical scheme of the invention is as follows:

a method of constructing a spanning confrontation network for target detection data augmentation comprising the steps of:

1) the Poisson fusion part in the generator is constructed, and the flow is shown in FIG. 1 (left). We embed poisson fusion into the generator to alter the number, position or size of objects when generating a picture. We select X, Y and Z objects from the original data set (assuming 3 classes of objects) and then build a set of objects P. Each time a picture is generated, x, y and z objects are randomly selected from the set P to form a subset

P_a＝{A₁,...,A_x,B₁,...,B_y,C₁,...C_z} (1)

Wherein, A, B and C respectively represent the categories in the original data set. Then P is added_aEmbedding into a temporary image T ∈ R³ ^×720×405In the method, a source image S epsilon R is generated^3×720×405. To eliminate sharp boundaries, a mask M (each instance having its own mask) is automatically created from the embedded positions in S, and then compared with the background image B ∈ R^3×720×405And combining the source image S with the source image S to obtain a clone image C.

2) The network learning part in the construction generator has a structure shown in fig. 1 (right). We constructed our network on the basis of the work of Ci et al (Yuanzheng Ci, Xinzhu Ma, Zhuihui Wang, Haojie Li, and ZhongxuanLuo. user-defined deep animal line art discovery with conditional adaptation network.2018 ACM Mul-time Conference on Multimedia Conference-MM' 18,2018). We build the encoder using a 3 x 3 convolutional layer stack, using U-Net as the backbone structure. For the decoder, 4 ResNeXt blocks are used to construct, denoted as block n, n ∈ {1, …, 4 }. In the experiment, we set block n to [20,10,10,5 ].

3) The discriminator was constructed as shown in fig. 1 (top). The discriminator is also made up of a stack of resenext blocks and convolutional layers. The architecture is similar to the setup for SRGAN. We add more layers to process 512 x 512 resolution input.

4) A loss function is set. For the loss function of the discriminator, we follow the function proposed by Ci et al in the paper. For the loss function of the generator, we define:

L_G＝L_cont+λ₁L_adv+L_reg (2)

wherein λ₁Is 1e-4, L_contAnd L_advSame as the setup in Ci et al, L_regIs defined as:

wherein c, h and w are the channel number, height and width of the characteristic diagram, and M is a mask. The fused portion was taken as 100, and the other regions were taken as 0.1.

5) An image pair required for training is generated. The two images in an image pair need only differ in the edge information of the embedded portion to enable the generator to learn the mapping of the fused image to the normal image. Thus, we create an image pair by overlaying objects in the images in the original dataset with objects of the same class automatically cropped from the clone image C. Considering the appearance similarity of the same species, the two images (original image and processed image) are almost the same except for the edge information, so we can directly regard the original image as a real image and the processed image as a false image.

The method can expand the original data sets with the quantity of about 1000 or 2000 data sets to the quantity level of tens of thousands, and can save a great deal of labor labeling cost.

Drawings

Fig. 1 is a diagram of a network architecture of the present invention.

Fig. 2 shows the result of the UDD data set generation of the present invention (a) as the original image, (b) as the image after poisson fusion, and (c) as the final generated image.

FIG. 3 is an enhanced image of the Copy-packing method on UDD, with (a) input; (b) is the output.

FIG. 4 is an enhanced image of another style of transforming GAN on UDD (a) into cycleGAN; (b) is StarGAN.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the following detailed description of the embodiments of the present invention is provided.

Our Poisson GAN was implemented on PyTorch. We train and reason for this using an input size of 512 x 512. We use the Adam optimizer, initializing the learning rate to 1e-4 in the generator and discriminator, and then reducing it to 1e-5 after 125K iterations. Our experiments were performed on a single NVIDIA TITAN XP GPU with a batch size of 4. We use the UDD dataset as the original dataset for data enhancement. UDD is a real marine ranch target detection data set, which comprises 2227 pictures (1827 training and 400 testing) of three detection targets of sea cucumber, sea urchin and scallop.

To construct the object set P, we cut 1000 sea cucumbers, 150 sea cucumbers and 35 scallops from UDD and then fuse them into a background image by Poisson GAN. We generated a Poisson GAN composite image using the image in UDD as the background image. Thus, these images are more realistic and can be used as a complement to UDD. The data set contained 18661 images of 18350 sea cucumbers, 101422 sea urchins, 9624 scallops, which we named AUDD.

We put the example and background image cut from the NSFC-dataset (another underwater target detection dataset) into poisson fusion, generate a clone image and construct a pre-trained dataset. The main purpose of this data set is to make the detector more robust in the automatic grabbing process, including up to 589080 images of different background colors, different viewing angles, different terrain.

FIG. 4 shows images generated by cycleGAN and StarGAN. They all achieve data enhancement by changing the background color. As mentioned previously, they cannot solve the category imbalance problem. Also, some small objects may disappear during the transformation, which is detrimental to training the target detection model. In contrast, Poisson GAN can change the location of objects and retain all objects.

The Copy-paging method is a small target data enhancement method that uses an instance split mask to Copy small objects from their original location to another location, as shown in FIG. 3. Unfortunately, this approach is only applicable to datasets with instance segmentation masks and is therefore difficult to use in UDD. And the method does not carry out extra smoothing processing on the edge of the pasted object, thereby reducing the quality of the generated image and leading the synthesized image to be obviously unnatural.

We used the extended dataset to train YOLOv3 to demonstrate the effectiveness of Poisson GAN. The detector is trained on 70 epochs on a pre-training data set and then on the AUDD using pre-training parameters. The test results of the UDD test set are shown in table 1. Compared with the results in the third row of table 2, the model has significant improvement (improvement of about 30%) in the mAP, and the problem of class imbalance is solved to a great extent.

In addition, we also performed experiments to compare the performance of different initializations. We trained yollov 3 on AUDD using random initialization, ImageNet pre-training model, and pre-training model from the pre-training dataset, and tested on the UDD test set with the results shown in table 2. It is clear that AUDD is advantageous in solving the problems of insufficient training data and class imbalance. Using our pre-trained model, YOLOv3 achieved better accuracy than random initialization (+ 12%) and ImageNet (+ 9%). Poisson GAN improved the accuracy of YOLOv3 by 33.7% compared to the original results on UDD.

TABLE 1 UDD accuracy for different detection networks

TABLE 2 accuracy of YOLOv3 on AUDD with different initialization modes

While the invention has been described in connection with specific embodiments thereof, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A method for constructing a spanning confrontation network for target detection data enhancement, the method comprising the steps of:

1) building a Poisson fusion part in a generator: embedding poisson fusion into a generator to alter the number, position or size of objects when generating a picture; assuming 3 types of objects, respectively selecting X, Y and Z objects from each type of objects in the original data set and then establishing an object set P; each time a picture is generated, x, y, z objects are randomly selected from the set P to form a subset

P_a＝{N₁,...,N_x,O₁,...,O_y,U₁,...U_z}(1)

Wherein N, O and U respectively represent the categories in the original data set; then P is added_aEmbedding into a temporary image T ∈ R^3×720×405In the method, a source image S epsilon R is generated^3×720×405(ii) a To eliminate sharp boundaries, a mask M is automatically created from the embedded positions in S, and then compared with the background image B ∈ R^3×720×405Combining the source image S with the source image S to obtain a clone image C;

2) the network learning part in the construction generator: constructing a network; building an encoder using a 3 x 3 convolutional layer stack using U-Net as a backbone structure; for the decoder, 4 ResNeXt blocks are used for construction, and are marked as a block n, n is equal to {1, …, 4 };

3) building an identifier; the discriminator is also made up of a stack of resenext blocks and convolutional layers; this architecture is the same as the setup in SRGAN, adding more layers to handle 512 x 512 resolution input;

4) setting a loss function; for the loss function of the generator, defined as: l is_G＝L_cont+λ₁L_adv+L_reg(2)

Wherein λ₁Is 1e-4, L_regIs defined as:

wherein c, h and w are the channel number, height and width of the characteristic diagram, and M is a mask; taking 100 as a fusion part and 0.1 as other areas;

5) generating an image pair required for training: the two images in one image pair only need to differ in the edge information of the embedded portion, causing the generator to learn the mapping of the fused image to the normal image, creating an image pair by overlaying the objects in the images in the original dataset with the same class of objects automatically cropped from the clone image C; in consideration of the appearance similarity of the same species, the original image and the processed image are almost the same except for the edge information, so that the original image is directly regarded as a real image and the processed image is regarded as a false image.

2. The method of claim 1, wherein for a decoder, 4 ResNeXt block constructions are used, with block n set to [20,10,10,5 ].