CN111753849A

CN111753849A - Detection method and system based on compact aggregation feature and cyclic residual learning

Info

Publication number: CN111753849A
Application number: CN202010606592.7A
Authority: CN
Inventors: 化春键; 凌艳; 陈莹
Original assignee: Jiangnan University
Current assignee: Jiangnan University
Priority date: 2020-06-29
Filing date: 2020-06-29
Publication date: 2020-10-09
Anticipated expiration: 2040-06-29
Also published as: CN111753849B

Abstract

The invention aims to provide a detection method and a detection system based on compact aggregate characteristics and cyclic residual error learning, and belongs to the technical field of image processing. The system comprises a compact feature extraction module, all feature aggregation modules and a circulating residual error optimization module, and the method comprises the following steps: extracting compact convolution characteristics, combining output characteristics of continuous stages together, and adopting a cavity space pyramid pooling module to realize multi-layer characteristic external information aggregation aiming at the compact convolution characteristics extracted from all layers; under a deep supervision mechanism, the method is continuously optimized in a residual learning mode, the whole cyclic residual network is tested on three visual saliency detection data sets, and the cyclic residual network based on the compact aggregation characteristics can be used for the actual application of the visual saliency detection in natural images after the test is finished. The invention improves the detection effect of visual saliency detection in a complex scene, and enhances the suppression of background noise and the continuity and integrity of a detection area.

Description

Detection method and system based on compact aggregation feature and cyclic residual learning

Technical Field

The invention relates to a detection method and a detection system based on compact aggregate characteristics and cyclic residual error learning, and belongs to the technical field of image processing.

Background

The visual saliency detection technology aims to detect the most distinctive target in a natural image and screen out complete target content. With the advantage of helping to reduce the complexity of computer understanding and analysis of natural images, visual saliency detection techniques have become one of the important pre-processing steps for many computer vision tasks, including image retrieval, visual tracking, scene classification, and pedestrian re-identification.

The traditional algorithm is based on the comparison or statistical information of manual features such as color, brightness and texture in the image and is accumulated by means of priori knowledge of workers. The convolutional neural network can autonomously and quickly learn the effective characteristics of the image, and the future development space in the field of image processing is large. With the continuous stacking of the convolutional layer and the pooling layer, the resolution of the output features of the high layer of the neural network is gradually reduced, and the semantic information is enhanced. However, when the high-level features are directly applied to the task of detecting the visual saliency at the pixel level, although the salient objects can be accurately positioned, the high-level features lack detailed information and are rough in overall appearance. And the shallow high-resolution feature of the convolutional neural network has the advantage of retaining spatial detail information.

Disclosure of Invention

In order to improve the detection effect of visual saliency detection in a complex scene and enhance the suppression of background noise and the continuity and integrity of a detection area, the invention provides a detection system, which comprises: the tight feature extraction module is used for realizing effective information aggregation in a single layer by adopting a tight connection mode aiming at the last convolutional layer feature of the second to the fifth stages of the ResNeXt101 network; all the feature aggregation modules utilize the ASPP module to realize the exchange and fusion of the feature information of the external layers aiming at the features of the different resolutions of the layers; and the cyclic residual optimization module is used for repeatedly utilizing the compact aggregation characteristics to continuously optimize the predicted saliency map under a deep supervision mechanism.

Another objective of the present invention is to provide a method for detecting the visual saliency of the cyclic residual based on the tightly aggregated features; firstly, extracting compact convolution characteristics of different levels from a basic network respectively, then aggregating all the characteristics of multiple resolutions, and finally realizing continuous optimization of a saliency map in a mode of circularly learning residual errors under a deep supervision mechanism; the method comprises the following steps:

s1, extracting compact convolution characteristics from the basic framework ResNeXt101 network, combining output characteristics of continuous stages together in a dense connection mode, covering a larger receptive field and realizing single-layer characteristic internal information fusion;

s2 only carries out tight aggregation on the basic features of a single layer, and neglecting the fusion of information among different depths and resolution features of the deep neural network is not beneficial to visual saliency detection, so that the aggregation of external information of multilayer features is realized by adopting a cavity space pyramid pooling module aiming at the tight convolution features extracted from all layers;

s3, under a deep supervision mechanism, the tight aggregation characteristics are recycled, continuous optimization is performed in a residual error learning mode, and a proper cycle number is determined through experiments;

s4, the whole cyclic residual error network is tested on three visual saliency detection data sets, and after the test is finished, the cyclic residual error network based on the compact aggregation characteristics can be used for the actual application of the visual saliency detection in natural images.

Optionally, the S1 includes:

firstly, aiming at the characteristics of 256 channels, 512 channels, 1024 channels and 2048 channels of the last convolutional layer of the second to fifth stages in the ResNeXt101 network, a convolution operation with the kernel size of 3 and the number of the channels of 128 is used for dimensionality reduction, dimensionality reduction characteristics are multiplexed to each stage which is then closely connected, the fusion of subsequent information is guided, namely the input of each current stage is the result of the characteristic cascade of all the previous stages, the convolution with the kernel size of 3 and the number of the channels of 64 is uniformly used for extracting characteristic information, and finally the output of a tight characteristic extraction module is obtained by cascading the dimensionality reduction characteristics and the intermediate output of a plurality of stages.

Optionally, the S2 includes:

firstly, cascading the compact convolution characteristics extracted from all layers, realizing dimensionality reduction through convolution operation with two kernels of 3 and channels of 256, then sending the compact convolution characteristics into a void space pyramid pooling module to realize information fusion, namely, parallelly performing convolution operation with one kernel of 1 and one channel of 128, three kernels of 3, expansion rates of 2,4 and 6 and channels of 128 respectively, and a combination of global mean pooling, kernel of 1 and channels of 128, and finally cascading the characteristics of 5 paths through convolution operation with kernel of 1 and channel of 256 to realize aggregation dimensionality reduction, namely obtaining compact aggregated characteristics (DAF). The output characteristics tightly aggregated inside a single layer and outside multiple layers have strong characteristic expression capability and contain rich significant clues.

Optionally, the S3 includes:

firstly, S2 is used to obtain compact aggregation characteristic, and an initial saliency map SM is obtained through convolution operation with kernel size of 1 and channel number of 1⁰Then, the aggregation feature and the saliency map are repeatedly input to a Residual Convolution Block (RCB) in a cascade mode to learn the residual, and the saliency map after the k-th cycle is represented as SM^k＝RCB(Cat(SM^k-1,DAF))+SM^k-1Wherein RCB (-) includes convolution operations with two kernel sizes of 3, channel number of 128, and one kernel size of 1, channel number of 1; cat (-) denotes cascading input features across channels. And after the proper cycle number K, obtaining a final saliency map through sigmoid operation.

The whole network training adopts a calculation mode of

Standard cross entropy loss, wherein SM_tAnd GT_tRespectively representing the saliency of the T-th pixel on the saliency map and the true value map, T representing the total number of pixels of the image, GT 1 and GT 0 representing the saliency and the non-saliency pixels, respectively, and SM_t∈[0,1]Representing the significance of the algorithm prediction. The closer the saliency map and the truth map are, the smaller the penalty value. The application of deep supervision mechanisms can place constraints in the middle part of the network, driving the overall trend towards more detailed learning and optimization. In the course of a cycle, a plurality of saliency maps { SM } are generated⁰,SM¹,L,SM^KCalculating cross entropy for each output between significance map and truth map, so totalA loss of

In the process of optimizing each cycle residual error, the input and the output both obtain the constraint of loss, and the learning and the information fusion of the deep aggregation characteristics are more facilitated. By minimizing the total loss, the network parameters are continuously refined to obtain the final model.

The invention has the following beneficial effects:

the invention realizes tight aggregation in the single-layer interior and the multi-layer exterior of the convolutional network, continuously optimizes the obvious prediction result in a cyclic residual error mode, alleviates the problems of regional integrity and continuity in the current deep visual significance detection technology, and improves the detection accuracy and smoothness.

The invention designs a compact characteristic extraction module which is simple, effective and strong in portability aiming at single-layer characteristics in a basic framework, and can effectively enhance the reusability and continuity of the characteristics.

The invention utilizes the cavity space pyramid pooling module to effectively aggregate the closely extracted features of a plurality of levels and different resolutions, directly promotes the fusion of information among the features without layer convolution, and improves the result of visual saliency detection.

Drawings

Fig. 1 is a schematic diagram of a cyclic residual significance detection network structure based on a tight aggregation feature according to the present invention.

Fig. 2 is a compact feature extraction module for the interior of a single layer as proposed by the present invention.

FIG. 3 is an all level feature aggregation module for a multi-layer exterior employed by the present invention.

FIG. 4 is a schematic diagram of the detection results of the present invention and other deep visual saliency detection algorithms on a public data set.

Detailed Description

The first embodiment is as follows:

a detection method based on close aggregation features and cyclic residual learning, see fig. 1, the method comprising the steps of:

step 1: the common data set MSRA10K is set as a training set containing 10000 natural RGB images and corresponding binary true value maps. In order to enhance the robustness of the network to image transformation and to solve the over-fitting problem, the method adopts the modes of random rotation, random cutting and horizontal turnover to realize the expansion of the training sample.

The MSRA10K data set is referred to in Cheng Mingming as "Global Contrast based SalientRegion Detection", published in 2011 at pages 409 and 416 of IEEE Confenre on Computer Vision and Pattern Recognition.

Step 2: referring to fig. 1, training samples are first input into a resenext 101 base network with full link layers removed, and convolution characteristics with channel numbers of 256,512,1024 and 2048 are obtained from the last convolution layer of the second to fifth stages, respectively. The higher the level, the richer the semantic information of the convolution features, and the shallow features better retain the detail and texture information.

The ResNeXt101 network may be referred to as Xie Saining "Aggregated resource transformation for Deep Neural Networks" published in 2017 on IEEE Confenrence on computer Vision and Pattern registration, page 5987. 5995.

And step 3: referring to fig. 2, for the features of different layers obtained in step 2, the feature information of the interior of a single layer is exchanged through the compact feature extraction module. Firstly, using convolution operation with kernel size of 3 x 3 and channel number of 128 to reduce dimension of original feature, then multiplexing the dimension-reduced feature to each stage which is closely connected afterwards, guiding the fusion of subsequent information, then inputting each current stage as the result of feature cascade of all previous stages, uniformly using convolution with kernel size of 3 x 3 and channel number of 64 to extract feature information, and finally outputting the result obtained by cascade connection of dimension-reduced feature and intermediate output of multiple stages.

And 4, step 4: and 4, fusing the multilayer external feature information of the four layers of convolution features with different resolutions obtained in the step 3 through all feature aggregation modules. All the feature aggregation modules are composed of two convolutions with kernel size of 3 × 3 and channel number of 256 and a cavity space pyramid pooling module for operation. The cavity space pyramid pooling module inputs five parallel paths, wherein the first path comprises convolution operation with the kernel size of 1 multiplied by 1 and the number of channels of 128, the middle three paths are convolution operation with the kernel size of 3 multiplied by 3, the cavity expansion rate values of 2,4 and 6 and the number of channels of 128 respectively, the last path is convolution operation with the kernel size of 1 multiplied by 1 and the number of channels of 128 in sequence, and finally the output characteristics of the five paths are subjected to convolution dimensionality reduction with the kernel size of 1 multiplied by 1 and the number of channels of 256. The compact aggregation characteristics obtained by compact aggregation of the single-layer internal characteristic information and the multi-layer external characteristic information have strong characteristic expression capability and contain rich significant clues.

And 5: referring to fig. 1, firstly, an initial saliency map is obtained by using compact aggregation features through convolution operation with a kernel size of 1 × 1 and a channel number of 1, then the saliency map and the compact aggregation features in the previous cycle stage are repeatedly cascaded and input into a residual convolution block formed by convolution operation with two kernels of 3 × 3, a channel number of 128, a kernel size of 1 × 1 and a channel number of 1, a predicted saliency map is continuously optimized, and a saliency map is obtained by performing sigmoid operation on a result after a suitable cycle number.

Step 6: and under a deep learning framework of the Pythrch, training the whole network by adopting a random gradient descent algorithm until loss convergence stops, and storing an optimal network model.

And 7: in order to determine the setting of the cycle number, the whole network is trained under different cycle numbers, and the test is carried out on the public data set DUT-OMRON data set, the results of three objective evaluation indexes of an F value, an MAE value and an S value are given in table 1, and it can be seen that the detection effect is improved to a certain extent along with the increase of the cycle number. The number of cycles was finally determined to be 6.

Table 1: the method and the device have the advantages that the related evaluation index results on the DUT-OMRON data set under different cycles

The DUT-OMRON data set may be referred to as "Frequency-tuned SalientObject Detection" by Radhakrishna Achantay, published in 2009 at IEEE Confenrence on Computer Vision and Pattern Recognition, p 1597. sup. 1604.

And 8: in order to show the superiority of the performance of the cyclic residual visual saliency detection network based on the compact aggregation feature, the application compares the ECSSD, HKU-IS and DUT-OMRON data sets with the currently advanced methods of RFCN, UCF, NLDF, GBR, MPFF, R3Net and RefineNet, respectively. The relevant objective evaluation index values on different test sets are shown in table 2, and it can be seen that the detection effect of the method provided by the invention is in the front.

Table 2: the application and different algorithms have relative evaluation index comparison results on different test sets

Comparison of visual effects of partial test images referring to FIG. 4, the first four rows of images are from the ECSSD dataset, the middle four rows of images are from the HKU-IS dataset, and the last four rows of images are from the DUT-OMRON dataset. In terms of visual effect, different algorithms can effectively detect partial effective salient regions, but the problems of incomplete regions, unclear boundaries, background interference and the like still exist. The detection results of the characters, the flowers and the objects in the first, the fourth and the seventh rows show that the areas detected by the method are more complete and smooth. The detection result of the third row of fishes shows that under the condition of higher similarity with the background, the robustness of the method provided by the invention is better, and the position of the target can be well positioned. The detection result of the fifth element orange shows that under the interference of the same type of targets, the method provided by the invention can still directly position and keep high accuracy. The detection result of the balloon in the twelfth row shows that the method provided by the invention can still accurately detect the obvious target in the dark environment. In a comprehensive view, the use of the close aggregation feature effectively improves the integrity of the detection area, inhibits background noise, and makes the spatial structure of the predicted saliency map closer to that of the true value map.

The ECSSD data set is referred to by Yan Qiong as "Hierarchical sales Detection" published in 2013 on IEEE Confenrence on Computer Vision and Pattern Recognition, pages 1155-1162.

The HKU-IS data set IS referenced to Li Guinbin "Visual science Based on multiscale deep feeds" published in 2015 at pages 5455 and 5463 of IEEE Confenrence on Computer Vision and Pattern Recognition.

RFCN can refer to "Saliency Detection with recovery FullyConvolitional Networks" of Wang Linzhao, published in 2016 at IEEE Confenrence on computer Vision and Pattern Recognition, page 825 and 841.

UCF can refer to Zhang PingPing, "Learning Unsequent software for Accurate Detection, published in 2017 at IEEE Confenrence on computer Vision and Pattern Recognition, page 212 and 221.

NLDF can be referred to the Lun Zhiming "Non-local Deep Features for sales ObjectDetection" published in 2017 at pages 6609 and 6617 of IEEE Confenrence on Computer Vision and Pattern Recognition.

GBR can refer to Tan Xin "Saliency Detection by Deep Network with BoundarryrReference and Global Context", published in 2018 on pages 1-6 of International conference Multimedia and Expo.

MPFF can be referred to as "Multi-Path Feature Fusion Network for sales Detection" by Zhu Hengliang, published in 2018 on International Conference on multimedia and Expo on pages 1-6.

R3Net can refer to "R3 Net: Current resource network for functionality Detection" of Ding bending, which was published in 2018 at 684-.

RefineNet can refer to "Refineet: A Deep Segmentation assistance recovery Network for sales Object Detection" of Keren Fu, published in 2019 on IEEETES ANSACTIONS ON MULTI MEDIA, page 457-469.

Example two

A detection system is applied to the detection method of the first embodiment and comprises a compact feature extraction module, an all-feature aggregation module and a cyclic residual optimization module. The compact feature extraction module is used for realizing effective information aggregation in a single layer by adopting a compact connection mode aiming at the last convolutional layer feature of the second to the fifth stages of the ResNeXt101 network;

all the feature aggregation modules utilize the ASPP module to realize the exchange and fusion of the feature information of the external layers aiming at the features of the layers with different resolutions;

and the loop residual optimization module repeatedly utilizes the compact aggregation characteristics to continuously optimize the predicted saliency map under a deep supervision mechanism.

Although the present invention has been described with reference to the preferred embodiments, it should be understood that various changes and modifications can be made therein by those skilled in the art without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. A detection system, comprising:

the tight feature extraction module is used for realizing effective information aggregation in a single layer by adopting a tight connection mode aiming at the last convolutional layer feature of the second to the fifth stages of the ResNeXt101 network;

all the feature aggregation modules utilize the ASPP module to realize the exchange and fusion of the feature information of the external layers aiming at the features of the different resolutions of the layers;

and the cyclic residual optimization module is used for repeatedly utilizing the compact aggregation characteristics to continuously optimize the predicted saliency map under a deep supervision mechanism.

2. A method of detection, comprising the steps of:

3. The detection method according to claim 2, wherein in S1, firstly, for the last convolution layer of the second to the fifth stages in the resenext 101 network, the last convolution layer has 256,512,1024 and 2048 channels, respectively, dimension reduction is performed by using a convolution operation with a kernel size of 3 and a channel number of 128, the dimension reduction features are multiplexed to each stage which is closely connected afterwards, and fusion of subsequent information is guided, that is, the input of each current stage is the result of feature concatenation of all previous stages, the convolution with a kernel size of 3 and a channel number of 64 is uniformly used for feature information extraction, and the output of the final tight feature extraction module is obtained by cascading the dimension reduction features and intermediate outputs of a plurality of stages.

4. The detection method according to claim 2, wherein in S2, the tight convolution features extracted from all layers are first concatenated, dimension reduction is achieved through two convolution operations with kernel size of 3 and number of channels of 256, then the concatenated features are sent to a void space pyramid pooling module to achieve information fusion, that is, the concatenated features are parallelly passed through a convolution operation with kernel size of 1 and number of channels of 128, three convolution operations with kernel size of 3 and expansion ratios of 2,4,6 respectively and number of channels of 128 and a combination of global mean pooling and kernel size of 1 and number of channels of 128, and finally the feature concatenation of 5 paths is passed through convolution operations with kernel size of 1 and number of channels of 256 to achieve aggregation dimension reduction, that is, the tight aggregation features are obtained.

5. The detection method according to claim 2, wherein in S3, the close clustering feature is obtained first using S2, and the initial saliency map SM is obtained through a convolution operation with kernel size of 1 and channel number of 1⁰Then repeatedly inputting the aggregation characteristics and the saliency map into the residual volume block to learn the residual, wherein the saliency map after the k-th cycle is represented as SM^k＝RCB(Cat(SM^k-1,DAF))+SM^k-1Wherein RCB (-) includes convolution operations with two kernel sizes of 3, channel number of 128, and one kernel size of 1, channel number of 1; cat (-) denotes cascading input features across channels. And after the proper cycle number K, obtaining a final saliency map through sigmoid operation.

6. A detection method according to claim 5, characterized in that the whole network training is calculated as

Standard cross entropy loss of (1), wherein SM_tAnd GT_tRespectively representing the saliency of the T-th pixel on the saliency map and the true value map, T representing the total number of pixels of the image, GT 1 and GT 0 representing the saliency and the non-saliency pixels, respectively, and SM_t∈[0,1]Representing the significance of the algorithm prediction.

7. A method of inspection as claimed in claim 6, characterized in that in the course of a cycle, a plurality of saliency maps { SM } are generated⁰,SM¹,L,SM^KCalculating cross entropy between each output saliency map and truth map with total loss of

8. A detection method according to claim 7, characterized in that during each cycle of residual optimization, both the input and the output get a constraint of loss.