CN113837058A

CN113837058A - Lightweight rainwater grate detection method coupled with context aggregation network

Info

Publication number: CN113837058A
Application number: CN202111102992.5A
Authority: CN
Inventors: 车明亮; 曹鑫亮; 杨帆; 郭有志; 李凯隆
Original assignee: Nantong University
Current assignee: Nantong University
Priority date: 2021-09-17
Filing date: 2021-09-17
Publication date: 2021-12-24
Anticipated expiration: 2041-09-17
Also published as: CN113837058B

Abstract

The invention discloses a lightweight rainwater grate detection method of a coupling context aggregation network, which comprises 7 steps of image preprocessing, parameter initialization, data set generation, rainwater grate detection model construction, model training and prediction, mask processing and post processing. The rainwater grate spatial distribution prior probability is considered in model prediction so as to further improve the rainwater grate detection precision. The detection method has high real-time performance, the light network architecture reduces the consumption of computing resources, the model loading and feedforward time is shortened, and the operation speed is increased.

Description

Lightweight rainwater grate detection method coupled with context aggregation network

Technical Field

The invention relates to the field of image target detection, in particular to a light-weight small target detection method coupling attention and context.

Background

The rainwater grate is a rectangular appliance which is positioned on the impervious layer and used for draining water, can filter rainwater and intercept large-volume dirt such as leaves, plastic bags, paper shells, food waste and the like in the retained rainwater. However, the rainwater grate may be damaged and deformed to different degrees after being used for a long time, so that potential safety hazards are generated, and in continuous heavy rain weather, road rainwater interception is increased, so that the potential safety hazards of rain waterlogging are generated, and regular maintenance and replacement are required. In urban roads, the laying of the rainwater grate is related to local rainfall, road width, road area and other factors. In the later maintenance of the rainwater grates, the positions, the quantity and the states of the rainwater grates in the original road and the newly added road need to be checked regularly. In the face of a large-scale urban road, the target detection technology is a main means for acquiring basic information of the rainwater grate quickly and at low cost.

In the present stage, deep learning is an important direction of a target detection technology, and a two-stage model and a single-stage model appear. The former accomplishes the detection task by extracting and classifying proposed regions, and representative methods thereof include R-CNN series models such as R-CNN, Fast R-CNN (Girshick et al 2014; Girshick et al 2015; Ren et al 2015) and the like, and SPP Net models (He et al 2015). The latter can directly and quickly output the target category and the corresponding position without extracting a suggestion area by using a Single network, and representative models thereof comprise a Single Shot multi-box Detector (SSD) model (Liu et al, 2016) and a YOLO (You Only Look one) series model V1-V5 (Redmon et al, 2016; Redmon et al, 2017; Redmon et al, 2018; Bochkovskiy et al, 2020). The detection precision of the two-stage model is usually high, but the real-time performance is poor, and the detection precision of the single-stage model is just opposite. However, the detection performance of any model in terms of small-size targets is not high, and the performance difference with the performance of detecting large targets is large.

The rainwater grate is a small target, and the area occupation ratio of the rainwater grate in a single image is usually very small in a street view image shot by an unmanned aerial vehicle or a street view car. Taking a street view image of a hundred-degree map as an example, the area percentage of the rain grate is less than 8 per thousand on average and is far smaller than the size of a water Bottle (Bottle) which is the minimum target in the VOC 2007 data set (the average area percentage is about 5%). Therefore, the existing deep learning model has great limitation in the case of the rainwater grate detection with such a small size. In addition, the rainwater grate can be degenerated to different degrees after being put into use, and can be shielded by shadows, covered by leaves and garbage, encroached by stone cracks and weeds, covered by road traffic markings and the like in a real scene. These phenomena weaken the distinguishing characteristics of the rain grate and the background object, and further increase the detection difficulty. Although some advanced techniques such as feature pyramids (Lin et al, 2017), attention mechanisms (Wang et al, 2017; Woo et al, 2018) and context information (Lin et al, 2019) etc. may be used to couple into the model to improve detection accuracy. However, the use of these techniques generally increases the computational complexity of the model, increases the number of parameters of the model, and extends the training and running time of the model. In consideration of reducing computing resources and keeping the same detection precision, particularly for embedded equipment, the light detection method has higher detection efficiency. Therefore, further technical optimization of the prior art is required.

Disclosure of Invention

The purpose of the invention is as follows: the invention aims to solve the defects in the prior art, and provides a lightweight rainwater grate detection method coupled with a context aggregation network, so as to solve the technical problems of high missing detection rate and false detection rate, large model parameter quantity, low operation efficiency and the like in the conventional rainwater grate detection method.

The technical scheme is as follows: the invention relates to a lightweight rainwater grate detection method coupled with a context aggregation network, which comprises the following steps:

(1) image preprocessing: according to the obtained street view image data, marking the position of a rainwater grate in the street view image by using an image marking tool to generate rainwater grate image-label data; enhancing the street view image by using an image processing technology;

(2) initializing parameters: initializing parameters related to the detection method;

(3) generating a data set: performing image-label screening on the image-label data in the step 1 to ensure that the image data and the label data are in one-to-one correspondence; according to the initial parameters set in the step 2, carrying out size adjustment and channel normalization on the image, and converting the label data into grid data; dividing a data set into training data and testing data according to the empirical ratio of the training samples to the testing samples; generating a rainwater grate spatial distribution prior probability graph according to training data;

(4) constructing a rainwater grate detection model: the model framework mainly comprises a framework network and a context aggregation network, wherein the framework network is composed of a lightweight Conv1 module, a series of serially connected Block modules and Regressor modules; the context aggregation network is formed by performing channel aggregation on a shallow context feature map and a deep target feature map; by utilizing the context aggregation network, the detail information of the shallow target object and the semantic information of the deep target object can be fused, so that the attenuation of the target object information in network transmission, particularly the small target object information, is reduced;

(5) model training and prediction: training the detection model by using the training data generated in the step 3 according to the initial parameters set in the step 2 until convergence, and recording and storing the optimal model weight; after training is finished, the optimal model weight is loaded, and the detection model constructed in the step 4 is used for rainwater grate prediction;

(6) mask treatment: according to the result of model prediction, masking the rainwater grate spatial distribution prior probability graph by using the rainwater grate spatial distribution prior probability graph to obtain a masked frame result;

(7) and (3) post-treatment: and (3) according to the initial parameters set in the step (2), carrying out non-maximum suppression and deduplication processing on the frame in the step (6), carrying out space transformation on the coordinates of the frame, restoring the coordinates into image absolute coordinates, and testing by using the test data obtained in the step (3).

Preferably, in step (1), the obtained street view image data includes near 2000 high-definition color images, and the resolution of the street view image data is 1024 × 512 pixels; scenes in the street view image data mainly comprise roads, sidewalks, street trees, road isolation belts, rain grates and buildings; the rainwater grate is mainly distributed on two sides of a road and at the upstream of a pedestrian crosswalk, the area ratio in the whole image is very small, the average value is only 5 per thousand, an open source annotation tool LabelMe is adopted as an image annotation tool, and the tag data storage format is JSON or XML; the image enhancement method mainly relates to image denoising, image sharpening and image equalization operation; the image denoising method mainly adopts a median filter with a convolution kernel of 3 multiplied by 3 to eliminate salt and pepper noise; the image sharpening mainly uses Laplacian Operator in the field of 4 to highlight the contour of the ground object; the image equalization method mainly comprises global histogram equalization, so that the brightness consistency of all areas of an image is kept and the image definition of partial areas is improved.

2. The light rainwater grate detection method coupled with the context aggregation network according to claim 1, wherein the method comprises the following steps: in step (2), the parameters to be initialized mainly include: number of classifications CLS _ NUM =1, grid SIZE S =7, predicted number of frames per grid B =2, image SIZE IMG _ SIZE =448, BATCH SIZE BATCH _ SIZE =8, and learning rate LR =2 × 10^-4LOSS threshold LOSS _ THR =50, confidence threshold CONF _ THR =0.8, and merge ratio threshold IOU _ THR = 0.5.

Preferably, in the step (3), the image data is resampled by a nearest neighbor method according to the set image SIZE IMG _ SIZE and the label data is subjected to SIZE transformation; the image data can be converted into a tensor with the value range of [ -1,1] after being normalized by a channel; converting the label data into grid data according to the set grid number S, and simultaneously converting the frame coordinates of the label from image absolute coordinates into grid relative coordinates; according to the training data, the spatial position information of the rain grate in the image is counted, and a spatial distribution prior probability map of the rain grate is drawn.

Preferably, in the step (4), the skeleton network is mainly formed by connecting a Conv1 module, a Block1 module, a Block2 module, a Block3 module, a Block4 module and a Regressor module in series; the context aggregation network Feature Fusion module is connected behind the Conv1 module, the Block1 module and the Block4 module;

in the Conv1 module, a base convolution operation with a convolution kernel of 3 and a maximum pooling operation with a span of 2 would down-sample the input feature map by a factor of 4.

In the basic convolution module Conv, a three-layer network module is formed by convolution operation, BatchNorm batch normalization and ReLU linear mapping;

the Block module takes a Fire module of the SqueezeNet model as a basic unit, and builds an effective Block by stacking different Fire modules and connecting pooling layers; the Fire module comprises a compression layer and an expansion layer; the compression layer is composed of a group of continuous 1 x 1 convolutions, and the expansion layer is composed of a group of continuous 1 x 1 convolutions and a group of continuous 3 x 3 convolution concatenations; the number of parameters can be greatly reduced by using a special combination of 1 × 1 convolution and 3 × 3 convolution; wherein, Block1 and Block2 have the same structure, and both comprise 2 Fries modules and 1 maximum pooling layer; they differ in the number of channels; through the operations of Block1 and Block2, the space size of the input image is respectively reduced to 1/8 and 1/16 of the original size; block3 contains 4 Fries modules and 1 max pooling layer to further increase the network depth; through Block3 operation, the space size of the input image is respectively reduced to 1/32 of the original size; block4 contains 2 Fries modules and 1 global average pooling layer for reducing the feature map spatial size to the target feature map size;

in the context aggregation network, the context feature map is subjected to 1 × 1 convolution module and 3 × 3 convolution module respectively, global average pooling and channel aggregation to form an aggregation feature map; the convolution module is still composed of convolution operation, BatchNorm batch normalization and ReLU linear mapping; the 1 × 1 convolution operation is adopted to reduce the number of channels of the context feature map, i.e. to the number of channels of the target feature maprThe number of times of the total number of the parts,rthe value is [0.1,0.5 ]]This is done to ensure that the amount of context information does not obscure the target feature map itself; the 3 × 3 convolution operation is adopted to perform the first downsampling; the global average pooling operation is to perform a direct downsampling; finally, channel aggregation is carried out on the context feature diagram to form an aggregated feature diagram, and the aggregated feature diagram and the target are obtainedThe characteristic diagram is subjected to channel aggregation again and sent to a Regressor for processing;

the Regressor module consists of three layers of series-connected 1 multiplied by 1 convolution modules and a Sigmoid function; wherein the volume block is still composed of convolution operation, BatchNorm batch normalization and ReLU linear mapping; after the target characteristic diagram is processed by the regressor, a characteristic diagram vector with space size of S multiplied by S and channel number of C is obtained; similar to the design idea of the YOLO model, each feature vector comprises B prediction frames and a classification probability; b prediction frames are respectively used for predicting B targets, if the frames are overlapped, the frames are regarded as one target, and each frame is a 5-dimensional vector and comprises a target existence probability conf, an upper left corner coordinate (x, y) and a target size (w, h); if the detection target includes CLS _ NUM classes, the number of channels C is B × 5+ CLS _ NUM.

Preferably, in the step (5), an RMSprop algorithm is adopted by an optimizer of the training model, and a StepLR strategy is adjusted at equal intervals for learning rate attenuation; and when the LOSS function value of the detection model is lower than the LOSS threshold LOSS _ THR, finishing training and storing the model parameters.

Preferably, in the step (6), a mask map is generated by reading the rainwater grate spatial distribution prior probability map and performing image binarization on the rainwater grate spatial distribution prior probability map; using the mask map to model-predicted bounding results, i.e. feature map vectors, in terms ofMThe operator performs product operation to finally obtain a mask frame result;Mthe operator is defined as follows:

in the formula (I), the compound is shown in the specification,Vis the first in the feature mapiLine, firstjThe feature vector of the column is determined,bthe number of the frame is shown as the frame number,confis the confidence probability in the bounding box;Pis the first in the mask diagramiLine, firstjThe prior probability values of the columns.

Preferably, in step (7), the NMS deduplication process firstly screens the frames with confidence higher than the threshold CONF _ THR, and then removes the frames with high coincidence rate according to the intersection ratio threshold IOU _ THR. The coordinate transformation of the frame is transformed from the relative coordinates of the grid to the absolute coordinates of the image.

Compared with the prior art, the invention discloses a light-weight small target detection method coupling attention and context, which has the following beneficial effects:

compared with some existing detection methods, the beneficial effects of the invention comprise the following points.

1) The invention designs a model structure of the framework network coupling context information based on a lightweight convolution module and a Fire module; meanwhile, the rainwater grate spatial distribution prior probability factor is considered in model prediction, so that the target detection precision is obviously improved; the Average Precision (AP) of the rainwater grate detected by the method can reach 0.79, wherein the Recall rate (Recall) reaches 0.87, the Precision rate (Precision) reaches 0.91, the Precision is improved by 12% compared with a YOLO model, the Precision is improved by 1% compared with a VGG-YOLO model, and the Precision is improved by 15% compared with an SSD model.

2) The model framework designed by the invention mainly comprises a 1 × 1 convolution module and a 3 × 3 convolution module, so the parameters of the model are light, the weight file of the model is only 13.98MB, which is only 6% of the YOLO model, and is 23% of the VGG-Yolo model and 15% of the SSD model.

3) The detection method designed by the invention has higher real-time performance, the light network architecture reduces the consumption of computing resources, shortens the model loading and feedforward time, and improves the operation speed. The frame Frequency (FPS) of the detection method designed by the invention can reach 56, which is increased by 11 compared with a YOLO model, by 45 compared with a VGG-Yolo model, and by 22 compared with an SSD model.

Drawings

FIG. 1 is a schematic flow diagram of the present invention;

FIG. 2 is a prior probability graph of spatial distribution of rain grate in the embodiment of the present invention;

FIG. 3 is a structural diagram of a rainwater grate detection model designed by the invention;

FIG. 4 is a block structure diagram included in a rainwater grate detection model according to the present invention;

FIG. 5 is a structure diagram of a context aggregation network included in a rainwater grate detection model according to the present invention;

FIG. 6 is a diagram of a mask process designed in accordance with the present invention;

fig. 7 is a diagram illustrating an actual effect of the rainwater grate according to the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

A light rainwater grate detection method coupled with a context aggregation network is shown in figure 1, and comprises the following specific steps:

(1) image preprocessing: according to the obtained street view image data, the positions of the rain grates are marked in the street view image by using an image marking tool, and rain grate image-label data are generated. And carrying out enhancement processing on the street view image by utilizing an image processing technology.

(3) generating a data set: and (3) carrying out image-label screening on the image-label data in the step (1) to ensure that the image data and the label data are in one-to-one correspondence. And (3) according to the initial parameters set in the step (2), carrying out size adjustment and channel normalization on the image, and converting the label data into grid data. And dividing the data set into training data and testing data according to the empirical ratio of the training samples to the testing samples. Generating a rainwater grate spatial distribution prior probability graph according to training data;

(4) constructing a rainwater grate detection model: the model framework mainly comprises a skeleton network and a context aggregation network. The framework network is composed of a lightweight Conv1 module, a series of serially connected Block modules and Regressor modules. The context aggregation network is formed by channel aggregation of a shallow context feature map and a deep target feature map. By utilizing the context aggregation network, the superficial layer of target detail information and the deep layer of target semantic information can be fused, so that the attenuation of target information in network transmission, particularly small target information, is reduced.

(5) Model training and prediction: and (3) training the detection model by using the training data generated in the step (3) according to the initial parameters set in the step (2) until convergence, and recording and storing the optimal model weight. After training is finished, the optimal model weight is loaded, and the detection model constructed by the method is used for rainwater grate prediction;

(6) mask treatment: and according to the result of model prediction, masking the rainwater grate spatial distribution prior probability graph by using the rainwater grate spatial distribution prior probability graph to obtain a masked frame result.

(7) And (3) post-treatment: and (3) according to the initial parameters set in the step (2), carrying out Non-Maximum Suppression (NMS) duplicate removal processing on the frame in the step (6), carrying out space transformation on the coordinates of the frame, reducing the coordinates into image absolute coordinates, and testing by using the test data obtained in the step (3).

Further preferably, in the step (1), the street view image data obtained includes near 2000 high-definition color images, and the resolution is 1024 × 512 pixels. The scenes in the street view image data mainly include roads, sidewalks, street trees, road isolation strips, buildings and the like. The rainwater grate is mainly distributed on two sides of a road and at the upstream of a pedestrian crosswalk, and the area ratio in the whole image is very small (the average value is only 5 per thousand). The image annotation tool used is not unique, in the embodiment, an open source annotation tool LabelMe is used, and the tag data storage format is JSON or XML; . The image enhancement method mainly relates to operations such as image denoising, image sharpening, image equalization and the like. The image denoising method mainly adopts the median filtering with a convolution kernel of 3 multiplied by 3 to eliminate the salt and pepper noise. Image sharpening mainly uses a laplacian Operator (Laplace Operator) in the 4-domain to highlight the contour of the ground object. The image equalization method mainly comprises global histogram equalization, so that the brightness consistency of all areas of an image is kept and the image definition of partial areas is improved;

further preferred, inIn the step (2), the parameters to be initialized mainly include: number of classifications CLS _ NUM =1, grid SIZE S =7, predicted number of frames per grid B =2, image SIZE IMG _ SIZE =448, BATCH SIZE BATCH _ SIZE =8, and learning rate LR =2 × 10^-4LOSS threshold LOSS _ THR =50, confidence threshold CONF _ THR =0.8, intersection ratio threshold IOU _ THR =0.5, and so on;

further preferably, in the step (3), the image data is resampled by a nearest neighbor method according to the set image SIZE IMG _ SIZE and the label data is SIZE-converted. The image data can be converted into a tensor with a value range of [ -1,1] after being normalized by a channel. And converting the label data into grid data according to the set grid number S, and simultaneously converting the frame coordinates of the label from image absolute coordinates into grid relative coordinates. The training sample to test sample ratio was maintained at 8: 2. According to the training data, the spatial position information of the rain grate in the image is counted, and a spatial distribution prior probability graph of the rain grate is drawn, wherein the spatial distribution prior probability graph is shown in figure 2;

further preferably, in the step (4), the backbone network is composed of a Conv1 module, a Block1 module, a Block2 module, a Block3 module, a Block4 module, a Regressor module, and the like, which are connected in series, as shown in fig. 3. The context aggregation network Feature Fusion is connected behind the Conv1 module, the Block1 module and the Block4 module, as shown in fig. 3.

In the Conv1 module, a base convolution (Conv) operation with a convolution kernel of 3 and a max pooling operation with a span of 2 would down-sample the input feature map by a factor of 4, as shown in FIG. 4 a.

In the basic convolution module Conv, the convolution operation, BatchNorm batch normalization and ReLU linear mapping constitute a three-layer network module, as shown in fig. 4 f.

The Block module takes a Fire module of an SqueezeNet model (Iandola and the like, 2017) as a basic unit, and builds an effective Block by stacking different Fire modules and connecting pooling layers. The Fire module comprises a compression layer and an expansion layer. The compression layer is composed of a set of successive 1 x 1 convolutions and the extension layer is composed of a set of successive 1 x 1 convolutions and a set of successive 3 x 3 convolution concatenations. The number of parameters can be reduced considerably using a special combination of 1 × 1 convolution and 3 × 3 convolution. In the present invention, Block1 and Block2 have the same structure, each containing 2 Fries modules and 1 max pooling layer. They differ in the number of channels as shown in fig. 4 b. The spatial size of the input image is reduced to 1/8 and 1/16, respectively, at its original size, via Block1 and Block2 operations. Block3 contains 4 Fries blocks and 1 max pooling layer to further increase the network depth, as shown in FIG. 4 c. By the Block3 operation, the spatial size of the input image is reduced to 1/32 of the original size. Block4 contains 2 Fries blocks and 1 global average pooling layer to reduce the feature map spatial size to the target feature map size, as shown in FIG. 4 d.

In the context aggregation network, the context feature map is passed through a 1 × 1 convolution module and a 3 × 3 convolution module, respectively, and the global average pooling and channel aggregation are performed to form an aggregation feature map, as shown in fig. 5. Wherein the convolution module is still composed of convolution operation, BatchNorm batch normalization and ReLU linear mapping, as shown in fig. 4 f. The 1 × 1 convolution operation is adopted to reduce the number of channels of the context feature map, i.e. to the number of channels of the target feature maprThe number of times of the total number of the parts,rthe value is [0.1,0.5 ]]This is done to ensure that the amount of context information does not obscure the target feature map itself. The first downsampling is performed using a 3 x 3 convolution operation. The global average pooling operation is a direct downsampling. And finally, performing channel aggregation on the context characteristic diagram to form an aggregated characteristic diagram, performing channel aggregation on the aggregated characteristic diagram and the target characteristic diagram again, and sending the aggregated characteristic diagram and the target characteristic diagram into a Regressor for processing.

The Regressor module consists of three layers of series connected 1 × 1 convolution modules and Sigmoid function, as shown in fig. 4 e. Where the volume block is still composed of convolution operations, BatchNorm batch normalization and ReLU linear mapping, as shown in fig. 4 f. After the target feature map is processed by the regressor, a feature map vector with space size of S multiplied by S and channel number of C is obtained. Similar to the design idea of the YOLO model (Redmon et al, 2016), each feature vector contains B prediction frames and a classification probability. The B prediction frames are used to predict B targets (if the frames overlap, the frame is regarded as one target), and each frame is a 5-dimensional vector including a target existence probability conf, an upper-left coordinate (x, y), and a target size (w, h). If the detection target comprises CLS _ NUM classes, the number of channels C is B × 5+ CLS _ NUM;

further preferably, in the step (5), the RMSprop algorithm is used by the optimizer for training the model, and the StepLR strategy is adjusted at equal intervals for learning rate attenuation. And when the LOSS function value of the detection model is lower than the LOSS threshold LOSS _ THR, finishing training and storing the model parameters.

Further preferably, in the step (6), the rainwater grate spatial distribution prior probability map is read, and image binarization processing is performed on the rainwater grate spatial distribution prior probability map to generate the mask map. Bounding box result (i.e., feature map vector) of model prediction using mask mapMThe operator performs product operation to obtain a mask frame result, as shown in fig. 6.MThe operator is defined as follows:

in the formula (I), the compound is shown in the specification,Vis the first in the feature mapiLine, firstjThe feature vector of the column is determined,bthe number of the frame is shown as the frame number,confis the confidence probability in the bounding box.PIs the first in the mask diagramiLine, firstjThe prior probability values of the columns.

Further preferably, in the step (7), the NMS deduplication process firstly screens the frames with the confidence higher than the threshold CONF _ THR, and then removes the frames with the high overlapping rate according to the intersection ratio threshold IOU _ THR. The coordinate transformation of the frame is transformed from the relative coordinates of the grid to the absolute coordinates of the image.

After the steps are processed, the actual effect of the detection method for detecting the rainwater grate is shown in fig. 7 (the black frame is a detection value, and the white frame is a real value). It can be seen that in the two scenes, the detection method designed by the invention can better detect the rainwater grate.

Claims

1. A lightweight rainwater grate detection method coupled with a context aggregation network is characterized by comprising the following steps: the method comprises the following steps:

2. The light rainwater grate detection method coupled with the context aggregation network according to claim 1, wherein the method comprises the following steps: in the step (1), the obtained street view image data contains near 2000 high-definition color images, and the resolution of the street view image data is 1024 × 512 pixels; scenes in the street view image data mainly comprise roads, sidewalks, street trees, road isolation belts, rain grates and buildings; the rainwater grate is mainly distributed on two sides of a road and at the upstream of a pedestrian crosswalk, the area ratio in the whole image is very small, the average value is only 5 per thousand, an open source annotation tool LabelMe is adopted as an image annotation tool, and the tag data storage format is JSON or XML; the image enhancement method mainly relates to image denoising, image sharpening and image equalization operation; the image denoising method mainly adopts a median filter with a convolution kernel of 3 multiplied by 3 to eliminate salt and pepper noise; the image sharpening mainly uses Laplacian Operator in the field of 4 to highlight the contour of the ground object; the image equalization method mainly comprises global histogram equalization, so that the brightness consistency of all areas of an image is kept and the image definition of partial areas is improved.

3. The light rainwater grate detection method coupled with the context aggregation network according to claim 1, wherein the method comprises the following steps: in step (2), the parameters to be initialized mainly include: number of classifications CLS _ NUM =1, grid SIZE S =7, predicted number of frames per grid B =2, image SIZE IMG _ SIZE =448, BATCH SIZE BATCH _ SIZE =8, and learning rate LR =2 × 10^-4LOSS threshold LOSS _ THR =50, confidence threshold CONF _ THR =0.8, and merge ratio threshold IOU _ THR = 0.5.

4. The light rainwater grate detection method coupled with the context aggregation network according to claim 1, wherein the method comprises the following steps: in the step (3), resampling the image data according to a nearest neighbor method and carrying out SIZE transformation on the label data according to the set image SIZE IMG _ SIZE; the image data can be converted into a tensor with the value range of [ -1,1] after being normalized by a channel; converting the label data into grid data according to the set grid number S, and simultaneously converting the frame coordinates of the label from image absolute coordinates into grid relative coordinates; according to the training data, the spatial position information of the rain grate in the image is counted, and a spatial distribution prior probability map of the rain grate is drawn.

5. The light rainwater grate detection method coupled with the context aggregation network according to claim 1, wherein the method comprises the following steps: in the step (4), the skeleton network is mainly formed by connecting a Conv1 module, a Block1 module, a Block2 module, a Block3 module, a Block4 module and a Regressor module in series; the context aggregation network Feature Fusion module is connected behind the Conv1 module, the Block1 module and the Block4 module;

in the Conv1 module, a base convolution operation with a convolution kernel of 3 and a maximum pooling operation with a span of 2 would down-sample the input feature map by a factor of 4;

in the context aggregation network, the context feature map is subjected to 1 × 1 convolution module and 3 × 3 convolution module respectively, global average pooling and channel aggregation to form an aggregation feature map; the convolution module is still composed of convolution operation, BatchNorm batch normalization and ReLU linear mapping; the 1 × 1 convolution operation is adopted to reduce the number of channels of the context feature map, i.e. to the number of channels of the target feature maprThe number of times of the total number of the parts,rthe value is [0.1,0.5 ]]This is done to ensure that the amount of context information does not obscure the target feature map itself; the 3 × 3 convolution operation is adopted to perform the first downsampling; the global average pooling operation is to perform a direct downsampling; finally, channel aggregation is carried out on the context characteristic diagram to form an aggregated characteristic diagram, the aggregated characteristic diagram and the target characteristic diagram are subjected to channel aggregation again, and the aggregated characteristic diagram and the target characteristic diagram are sent to a Regressor to be processed;

6. The light rainwater grate detection method coupled with the context aggregation network according to claim 1, wherein the method comprises the following steps: in the step (5), an RMSprop algorithm is adopted by an optimizer of the training model, and a StepLR strategy is adjusted at equal intervals for the attenuation of the learning rate; and when the LOSS function value of the detection model is lower than the LOSS threshold LOSS _ THR, finishing training and storing the model parameters.

7. A coupling according to claim 1The detection method for the lightweight rainwater grate of the context aggregation network is characterized by comprising the following steps: in the step (6), a rainwater grate space distribution prior probability graph is read, and image binarization is carried out on the rainwater grate space distribution prior probability graph to generate a mask graph; using the mask map to model-predicted bounding results, i.e. feature map vectors, in terms ofMThe operator performs product operation to finally obtain a mask frame result;Mthe operator is defined as follows:

8. The light rainwater grate detection method coupled with the context aggregation network according to claim 1, wherein the method comprises the following steps:

in the step (7), NMS deduplication processing firstly screens frames with confidence degrees higher than a threshold CONF _ THR, and then removes frames with high coincidence rate according to a cross ratio threshold IOU _ THR; the coordinate transformation of the frame is transformed from the relative coordinates of the grid to the absolute coordinates of the image.