CN116310764B

CN116310764B - Intelligent detection method and system for road surface well lid

Info

Publication number: CN116310764B
Application number: CN202310562328.1A
Authority: CN
Inventors: 张傲南; 何安正; 张航; 董子硕
Original assignee: Southwest Jiaotong University
Current assignee: Southwest Jiaotong University
Priority date: 2023-05-18
Filing date: 2023-05-18
Publication date: 2023-07-21
Anticipated expiration: 2043-05-18
Also published as: CN116310764A

Abstract

The invention discloses an intelligent detection method and system for a road surface well lid, and relates to the technical field of road surface well lid detection, wherein the method comprises the steps of constructing a well lid target detection model taking a road surface well lid image as an input image, taking an SSD network as a base line, replacing a feature extraction network in the SSD network by a symmetrical feature pyramid network and an additional feature extraction layer, and reserving a convolution predictor in the SSD network; acquiring historical road surface well lid images, and constructing a training data set by using the historical road surface well lid images; inputting the training data set into a well lid target detection model, and training the well lid target detection model; and classifying and positioning the well lid targets on the real-time road well lid images by using the trained well lid target detection model. The invention improves the recognition precision and reliability of the road surface well lid, recognizes and positions three conditions of the well lid in a health state, the well lid collapse and the lack of the well lid, and provides necessary data support for road traffic asset investigation.

Description

Intelligent detection method and system for road surface well lid

Technical Field

The invention belongs to the technical field of road pavement detection, in particular to the technical field of pavement well lid detection, and particularly relates to an intelligent pavement well lid detection method and system.

Background

With the rapid development of the economy in China, the mileage of each grade of road is continuously increased, and the matched infrastructure is also increasingly perfected, so that the workload of operation, investigation, maintenance and the like of each grade of road is continuously increased. Road manhole covers, which are one of the main subjects of road traffic assets, particularly municipal roads, play a great role in road management such as sewage discharge, cabling, fire protection, and supply of electricity and natural gas. However, under the action of frequent and heavy traffic load, artificial damage and other external factors, well lid collapse, surrounding concrete cracking and well lid loss often occur, so that accumulated water on the road surface is not discharged, an optical cable is damaged, pedestrians drop out of the well, traffic accidents and the like, and public property and personal safety are seriously affected. Therefore, the road well lid is regularly detected, the structural health condition of the road well lid can be known in time, and serious potential safety hazards to vehicles, pedestrians, optical cable facilities and the like caused by theft or damage of the road well lid are avoided.

The traditional manual visual investigation statistics detects road surface well lid, and is slow, inefficiency, with high costs, also has very strong subjectivity and unsafe factor simultaneously, obviously can't satisfy the detection work demand of rapid increase. Currently, target detection and recognition methods based on deep learning have been used in many industries to solve various problems in various industries. In the technical field of road surface detection, a road surface well lid target detection method based on a convolutional neural network has achieved a certain achievement, well lids in images are identified through the road surface well lid target detection method, and state types of the well lids are confirmed, wherein common state types of the well lids include well lids in a healthy state, well lid collapse and well lid lack. However, this method exhibits the following drawbacks in practical applications: in a real road environment, concrete around the well lid is damaged, the well lid collapses, ponding, various shielding objects and random existence of road noise exist, so that the existing convolutional neural network algorithm has great difficulty in accurate quantitative evaluation of the road well lid, and therefore a large number of misidentification and inaccurate target boundary positioning phenomena occur, the identification rate is low, and the robustness is poor. It can be seen that there is still a need for further improvements in convolutional neural networks for manhole cover target detection.

Disclosure of Invention

In view of the above, the invention provides an intelligent detection method and system for a road well lid, which are used for solving the technical problems of low recognition rate and poor robustness of the conventional convolutional neural network for detecting the road well lid target.

The aim of the invention is realized by the following technical scheme:

first aspect

The first aspect of the invention provides an intelligent detection method for a road well lid, which comprises the following steps:

constructing a well lid target detection model taking a road surface well lid image as an input image, wherein the well lid target detection model uses an SSD network

Replacing a feature extraction network in the SSD network with a symmetrical feature pyramid network and an additional feature extraction layer for a baseline, and reserving a convolution predictor in the SSD network;

acquiring historical road surface well lid images, and constructing a training data set by using the historical road surface well lid images;

inputting the training data set into a well lid target detection model, and training and optimizing parameters of the well lid target detection model;

performing well lid target classification and positioning on the real-time road well lid image by using the trained well lid target detection model;

the symmetrical feature pyramid network is used for extracting features of an input image to obtain a plurality of feature graphs with different scales; the additional feature extraction layer performs multi-stage downsampling on the feature graphs output by the symmetrical feature pyramid network to obtain a plurality of feature graphs with different scales in a one-to-one correspondence manner; the convolution predictor is used for classifying and positioning the well lid targets in all the feature diagrams.

Preferably, the construction process of the symmetrical feature pyramid network specifically comprises the following steps:

determining the scale of each level in the symmetrical feature pyramid network according to the size of the input image and the scale of the output feature map, wherein each level has different receptive fields;

feature fusion is carried out between two adjacent layers through path aggregation;

when feature fusion is carried out through path aggregation, up-sampling calculation is carried out on the feature images output by the layers which are used for extracting low-level features in two adjacent layers, feature selection of characteristic space details is carried out on the feature images output by the other layers, and the feature images output after the feature selection are fused with the feature images output after the up-sampling calculation.

Preferably, the symmetrical feature pyramid network comprises an image input layer, a first feature output layer, a first upsampling layer, a second upsampling layer and a third upsampling layer which are sequentially connected from top to bottom, and a first downsampling layer, a second downsampling layer and a third downsampling layer which are sequentially connected from bottom to top;

the image input layer is used for obtaining a first feature map after downsampling an input image;

the first upsampling layer is used for performing a first convolution operation on the first feature map to obtain a feature map F1;

The second up-sampling layer is used for carrying out first convolution operation on the feature map after carrying out maximum value pooling calculation on the feature map F1 to obtain a feature map F2;

the third upsampling layer is used for carrying out a first convolution operation on the feature map after carrying out maximum value pooling calculation on the feature map F2 to obtain a feature map F3;

performing bilinear upsampling calculation on the feature map F3, inputting the feature map F2 into a first FSM module, and performing first-time connectate feature fusion on the output obtained after the bilinear upsampling calculation and the output of the first FSM module to obtain a feature map F4;

performing bilinear upsampling calculation on the feature map F4, inputting the feature map F1 into a second FSM module, and performing second-time connectate feature fusion on the output obtained after the bilinear upsampling calculation and the output of the second FSM module to obtain a feature map F5;

performing a second convolution operation on the feature map F5 to obtain a feature map F6;

performing maximum value pooling calculation on the feature map F6 and performing second convolution operation on the feature map F4, and performing third-time Concate feature fusion on the output obtained after the maximum value pooling calculation and the output obtained after the second convolution operation to obtain a feature map F7;

performing a second convolution operation on the feature map F3, performing maximum value pooling calculation on the feature map F7, and performing fourth Concate feature fusion on the output after the second convolution operation and the output after the maximum value pooling to obtain a feature map F8;

The first downsampling layer is used for performing a first convolution operation on the feature map F8 to obtain a feature map F9;

performing bilinear upsampling calculation on the feature map F9, inputting the feature map F7 into a third FSM module, and performing fifth-time connectate feature fusion on the output obtained after the bilinear upsampling calculation and the output of the third FSM module to obtain a feature map F10;

the second downsampling layer is used for performing a first convolution operation on the feature map F10 to obtain a feature map F11;

performing bilinear upsampling calculation on the feature map F11, inputting the feature map F6 into a fourth FSM module, performing sixth Concate feature fusion on the output obtained after the bilinear upsampling calculation and the output of the fourth FSM module, inputting the feature map obtained after the sixth Concate feature fusion into a third downsampling layer,

the third downsampling layer is used for performing a first convolution operation on the feature map after the sixth Concate feature fusion to obtain a feature map F12;

and respectively inputting the feature map F12, the feature map F11 and the feature map F9 into a first feature output layer, wherein the first feature output layer is used for respectively performing a third convolution operation on the feature map F12, the feature map F11 and the feature map F9 to obtain feature maps FM1, FM2 and FM3 with different scales in a one-to-one correspondence manner.

Preferably, when the additional feature extraction layer performs multi-stage downsampling on the feature map output by the symmetrical feature pyramid network, residual connection is introduced between two adjacent downsampling stages.

Preferably, the additional feature extraction layer comprises a second feature output layer, a first convolution layer, a second convolution layer and a third convolution layer which are sequentially connected from top to bottom;

the first convolution layer is used for executing third convolution operation again on the output of the feature map F9 after executing fourth convolution operation, and performing first residual connection on the output of the third convolution operation and the output of the feature map F9 after executing fourth convolution operation, and obtaining a feature map F13 after the first residual connection;

performing a fourth convolution operation on the feature map F13, performing a second residual error connection on the output of the fourth convolution operation and the output of the fourth convolution operation on the feature map F9, and inputting the feature map obtained after the second residual error connection into a second convolution layer;

the second convolution layer is used for carrying out third convolution operation on the feature map obtained after the second residual error connection to obtain a feature map F14;

performing fifth convolution operation on the feature map F14, performing third residual connection on the output of the fifth convolution operation and the feature map obtained after the second residual connection, and inputting the feature map obtained after the third residual connection into a third convolution layer;

The third convolution layer is used for performing a third convolution operation on the feature map obtained after the third residual connection to obtain a feature map F15;

and respectively inputting the feature map F13, the feature map F14 and the feature map F15 into a second feature output layer, wherein the second feature output layer is used for respectively performing a third convolution operation on the feature map F13, the feature map F14 and the feature map F15 to obtain feature maps FM4, FM5 and FM6 with different scales in a one-to-one correspondence manner.

Preferably, the size of the input image is 320×320×3, the first convolution operation is a 3×3 convolution with two steps being 1, the second convolution operation is a 1×1 convolution with one step being 1, the third convolution operation is a 3×3 convolution with one step being 1, the fourth convolution operation is a 3×3 convolution with one step being 2, the fifth convolution operation is a 3×3 convolution with one step being 1 without padding, the step size of the bilinear upsampling calculation is 2, the pooling kernel of all maximum pooling calculations is 2, the scale of the feature map FM1 is 40×40×1024, the scale of the feature map FM2 is 20×20×512, the scale of the feature map FM3 is 10×10×512, the scale of the feature map FM4 is 5×5×256, the scale of the feature map FM5 is 3×3×256, and the scale of the feature map FM6 is 1×1×256.

Preferably, the feature selection for characterizing the spatial detail of the feature map output by another hierarchy is specifically:

defining a feature map output by another level as an original feature map, aggregating spatial information of the original feature map in a maximum value pooling mode, and generating a second feature map after aggregating the spatial information;

aggregating the spatial information of the original feature map in an average value pooling mode, and generating a third feature map after aggregating the spatial information;

respectively feeding a second feature map and a third feature map into a shared multi-layer sensor, adding matrix elements one by the output of the second feature map after passing through the multi-layer sensor and the output of the third feature map after passing through the multi-layer sensor, activating and outputting to obtain an importance vector, and scaling the original feature map by using the importance vector;

adding the feature map obtained after scaling to the original feature map by adopting skip connection to obtain a fourth feature map;

and performing 1×1 convolution and BN normalization calculation with a step length of 1 on the fourth feature map to obtain a fifth feature map.

Preferably, the road surface well lid image is acquired by vehicle-mounted 3D cameras arranged at the left end and the right end of the tail; the road surface well lid image is pretreated before being used as an input image of a well lid target detection model, and the pretreatment process specifically comprises the following steps:

The method comprises the steps that left-side road surface well lid images collected by a vehicle-mounted 3D camera arranged at the left end of a vehicle tail and right-side road surface well lid images collected by a vehicle-mounted 3D camera arranged at the right end of the vehicle tail are subjected to left-right splicing processing, so that road surface well lid images are formed;

and performing size reduction treatment on the pavement well lid image.

Preferably, classification categories when classifying well lid targets in all feature maps include health status well lids, well lid collapse, well lid lack and background.

The first aspect of the invention has the following beneficial effects:

(1) When the well lid target detection model is constructed, a feature extraction network is redesigned on the basis of the design framework of the original SSD network, and the new feature extraction network comprises two new design modules: a symmetrical feature pyramid network and an additional feature extraction layer; the symmetrical feature pyramid network merges the end-to-end, pyramid and path aggregation design ideas, and performs feature fusion of different scales for a plurality of times in the order of pyramid-path aggregation-pyramid, so that the obtained feature graphs of different scales all contain rich positioning information and semantic information features; the additional feature extraction layer utilizes multi-stage downsampling to reduce the size of the symmetrical feature pyramid network output feature map, so that the obtained small-scale feature map retains more positioning information features; in the stage of regression positioning of the road surface well cover, feature graphs with different scales obtained by a feature extraction network are input into an original convolution predictor of an SSD network together, small convolution filters with different sizes and numbers are applied to the feature graphs with different scales to generate a prediction boundary frame set with fixed sizes and one-to-one correspondence type scores, and finally, accurate classification and positioning of the well cover target are realized, so that the recognition rate of the well cover target is improved;

(2) Considering that the unrecognizable nature of the upsampling calculation can lead to the problem of redundant feature graphs, an FSM module (Feature Selecting Module, feature selection module) is also incorporated into the symmetric feature pyramid network to aggregate more spatial details of the feature graphs, thereby improving the classification and positioning accuracy of the well lid targets;

(3) By fusing the residual connection concept in the additional feature extraction layer, a residual branch is introduced before downsampling and reducing the size of the feature map, trainable parameters of the model are increased, and the problem of network degradation of the model is avoided;

(4) When noise interference such as ponding, drain valves and marking lines exists on a road surface, the characteristics of the well cover are similar to other characteristics of the road surface, and pavement concrete notch grooves exist near the well cover, more characteristic details related to the well cover of the road surface can be well learned by the redesigned characteristic extraction network, so that more local well cover semantic information characteristics are recovered, interference objects are eliminated, the influence of a noise mode is reduced, misjudgment is reduced, accurate identification and positioning of the well cover of the road surface are realized, and therefore, the identification effect of the well cover target detection model constructed according to the first aspect of the invention has better robustness.

Second aspect

The second aspect of the invention provides an intelligent detection system for the road surface well lid, which comprises a memory and a processor, wherein the memory is in communication connection with the processor, the processor is also in communication connection with an external image acquisition device, the intelligent detection method for the road surface well lid according to the first aspect of the invention is stored in the memory, and the processor is used for calling the method stored in the memory to classify and position the well lid targets of the real-time road surface well lid images acquired by the image acquisition device.

The second aspect of the present invention brings about the same advantageous effects as the first aspect of the present invention and will not be described in detail here.

Drawings

FIG. 1 is a flow chart of an intelligent detection method of a road well lid;

FIG. 2 is a first part of a schematic overall structure of a manhole cover target detection model;

FIG. 3 is a second part of a schematic overall structure of a manhole cover target detection model;

fig. 4 is a schematic diagram of a first FSM module/second FSM module/third FSM module/fourth FSM module.

Detailed Description

The technical solutions of the present invention will be clearly and completely described below with reference to the embodiments, and it is apparent that the described embodiments are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by a person skilled in the art without any inventive effort, are intended to be within the scope of the present invention, based on the embodiments of the present invention.

For a better description of the present application and its embodiments, the terms appearing hereinafter will be explained.

Convk×k (stride=s, padding= "same/valid"): a two-dimensional convolution operation with a convolution kernel size of k and a step size of s is performed (padding=same means that the up-down, left-right matrix is not patched, and padding=valid means that the up-down, left-right matrix is patched with 0).

BN: the method is fully called as Batch Normalization, and batch normalization calculation is a mode of standard regularization processing, so that the image feature maps after convolution meet the distribution rule that the mean value is 0 and the variance is 1.

Feature map: and the characteristic output obtained after the operation such as convolution and the like of the original input picture is represented.

ReLU: the full name is Rectified Linear Unit, one of the correction linear units and the activation function provides more sensitive activation and input by utilizing the characteristic of linearity and nonlinearity, and avoids saturation.

Picture size h×w×c: h represents the height of the picture, W represents the width of the picture, and C represents the number of channels of the picture.

Max-working k×k: the maximum pooling downsampling operation with the pooling kernel size k and the step length k is performed, namely, the maximum value in the k multiplied by k region of the feature map is taken as the value of the feature map of the new feature map, and the size of the new feature map is changed from H multiplied by W multiplied by C to H multiplied by 2 multiplied by W multiplied by 2 multiplied by C, and particularly, no parameter exists in the pooling layer.

Global Average Pooling: and (3) carrying out average value pooling on the whole feature maps (the size is H multiplied by W multiplied by C) to form a feature point 1 multiplied by C, namely adding all pixel values of the feature maps to average, and representing the corresponding feature maps by the average value.

Global Max Pooling: maximum value pooling is performed on the whole feature maps (the size is H multiplied by W multiplied by C), and one feature point 1 multiplied by C is formed, namely, the maximum value in all pixel values of the feature maps is used for representing the corresponding feature maps.

Concate: the two feature maps are superimposed to obtain a new feature map (H, W of the two feature maps are equal, and C of the new feature maps is the sum of both C).

Up-sampling k×k: the operation of performing bilinear upsampling by k times is shown to change the size of the output feature maps from h×w×c to k ∙ h×k ∙ w×c, i.e., the height and width are both increased by k times.

Sigmoid: and activating one of the functions, and utilizing the characteristic of the S-shaped curve to enable the probability value finally obtained after the model is encoded and decoded to be between 0 and 1.

Example 1

Referring to fig. 1-4, the embodiment provides an intelligent detection method for a road well lid, which specifically includes the following steps:

s100, constructing a well lid target detection model taking a road well lid image as an input image, wherein the well lid target detection model takes an SSD network as a base line, replaces a feature extraction network in the SSD network with a symmetrical feature pyramid network and an additional feature extraction layer, and reserves a convolution predictor in the SSD network. Thus, the well lid target detection model constructed based on the SSD network includes a symmetric feature pyramid network, an additional feature extraction layer, and a convolution predictor. The method comprises the steps that feature fusion is achieved among layers with different receptive fields in a symmetrical feature pyramid network through path aggregation, and the symmetrical feature pyramid network is used for extracting features of an input image to obtain a plurality of feature images with different scales. The additional feature extraction layer performs multi-stage downsampling on the feature graphs output by the symmetrical feature pyramid network to obtain a plurality of feature graphs with different scales in a one-to-one correspondence mode. The convolution predictor is used for classifying and positioning the well lid targets in all the feature diagrams. The classification categories of the manhole cover targets in this embodiment include: health status manhole cover, manhole cover collapse, manhole cover lack and background. It can be known that other classification standards can be adopted to obtain other classification categories of the well lid targets according to the detection purpose of the well lid targets.

And S200, acquiring historical road surface well cover images, and constructing a training data set by using the historical road surface well cover images.

In some embodiments, the road well cover image is acquired by vehicle-mounted 3D cameras mounted at the left and right ends of the tail, and is typically a single-channel gray scale image. Therefore, the vehicle-mounted 3D camera mounted at the left end of the tail acquires the left-side road surface well lid image, and the vehicle-mounted 3D camera mounted at the right end of the tail acquires the right-side road surface well lid image. Before the image of the road surface well cover is used as an input image, the following preprocessing operation is needed: the method comprises the steps that left-side road surface well lid images collected by a vehicle-mounted 3D camera arranged at the left end of a vehicle tail and right-side road surface well lid images collected by a vehicle-mounted 3D camera arranged at the right end of the vehicle tail are subjected to left-right splicing processing, so that complete road surface well lid images are formed; and performing size reduction treatment on the obtained complete pavement well lid image.

When the left and right splicing processing is carried out, a temporary image object is created in the C++ platform, and then the road surface characteristic information in the collected left road surface well cover image and the collected right road surface well cover image is respectively copied into the created temporary image object to form a complete road surface well cover image. In addition, the formed road surface well lid image also needs to match the requirement of the aspect ratio of the input image of the well lid target detection model, so when copying the left road surface well lid image and the right road surface well lid image, the aspect ratio needs to be considered, for example: if the heights of the acquired left side road well lid image and the right side road well lid image are half of the heights of the input images required by the well lid target detection model, in the copying process, the characteristic information of each line of the left side road well lid image and the right side road well lid image is copied twice to be the characteristic information of two lines of the temporary image object (for example, the first line is used as the first line and the second line, the second line is used as the third line and the fourth line, and the like), and if the widths of the acquired left side road well lid image and the right side road well lid image are half of the widths of the input images required by the well lid target detection model, in the copying process, the characteristic information of each line of the left side road well lid image is copied first, and then the characteristic information of each line of the right side road well lid image is copied next. Meanwhile, the pavement well lid image of a single channel is converted into three channels.

In particular, in this embodiment, the size reduction is performed on the manhole cover image by eight times.

After pretreatment, a training data set is constructed through historical road surface well lid images, and a training data set construction method in a common embodiment can be adopted, and the following optimized construction method can be adopted, wherein the optimized construction method specifically comprises the following steps:

s201, adopting a human-computer interaction pre-training screening method to select a historical roadSelecting three types of samples of a health-state well lid image, a well lid collapse image and a well lid missing image from the well lid image, respectively inputting the selected samples into a well lid target detection model in a constructed initial state, and predicting to obtain the class probability of the well lid state in each sample, wherein the probabilities of the health-state well lid, the well lid collapse, the well lid missing and the background are respectively recorded asP _i1 、P _i2 、P _i3 、P _i4 ，i=0,1,2,3,…M…，MIndicating the total number of selected samples, i indicating the sample number;

s202, taking a single sampleP _i1 、P _i2 、P _i3 、P _i4 And (3) taking the maximum value of the number as the category of the well lid state in the sample, determining the quantity ratio of all types of well lid samples, selecting a preset quantity of historical road surface well lid images according to the quantity ratio, and forming a training data set for balancing the positive and negative samples according to all selected historical road surface well lid images.

It can be known that the verification data set for verifying the well lid target detection model and the test data set for testing the well lid target detection model are also constructed by adopting a construction method equivalent to the training data set. The data volume ratio comprised by the training data set, the validation data set and the test data set is preferably 5:2:3. In addition, the data sets can be uniformly constructed, and then the data sets are divided into training data sets, verification data sets and test data sets according to a preset data volume ratio.

S300, inputting the training data set into a well lid target detection model, and training and parameter optimization are carried out on the well lid target detection model. It can be known that when the well lid target detection model is trained, a label image is also required to be input. The generation of the label image employs a procedure in a general embodiment, such as: and carrying out square-frame-level manual calibration on the well lid characteristics by using the existing LabelImg calibration software to form an xml-format tag containing coordinate information and category information of the well lid characteristics, thereby forming a tag image. Training and parameter optimization of the well lid target detection model adopts the processes in the common embodiments, such as: and the training process takes the loss function as constraint, and model parameters are continuously optimized by using back propagation, so that a current training optimal model parameter file is finally obtained.

And S400, classifying and positioning the well lid targets of the real-time road well lid images by using the trained well lid target detection model.

As a preferred embodiment, the construction process of the symmetrical feature pyramid network specifically includes:

s101, determining the scale of each level in a symmetrical characteristic pyramid network according to the size of an input image and the scale of an output characteristic graph, wherein each level has different receptive fields;

s102, feature fusion is carried out between two adjacent layers through path aggregation;

As a preferred embodiment, as shown in fig. 4, a specific implementation procedure for selecting features for characterizing spatial details on a feature map output by another hierarchy is as follows: defining a feature map output by another level as an original feature map, aggregating spatial information of the original feature map in a maximum value pooling mode, and generating a second feature map after aggregating the spatial information; aggregating the spatial information of the original feature map in an average value pooling mode, and generating a third feature map after aggregating the spatial information; respectively feeding a second feature map and a third feature map into a shared multi-layer sensor, adding matrix elements one by the output of the second feature map after passing through the multi-layer sensor and the output of the third feature map after passing through the multi-layer sensor, activating and outputting to obtain an importance vector, and scaling the original feature map by using the importance vector; adding the feature map obtained after scaling to the original feature map by adopting skip connection to obtain a fourth feature map; and performing 1×1 convolution and BN normalization calculation with a step length of 1 on the fourth feature map to obtain a fifth feature map. The Sigmoid function is preferably used to activate the output.

The symmetrical characteristic pyramid network based on the construction process comprises an image input layer, a first characteristic output layer, a first upsampling layer, a second upsampling layer and a third upsampling layer which are sequentially connected from top to bottom, and a first downsampling layer, a second downsampling layer and a third downsampling layer which are sequentially connected from bottom to top. The image input layer is used for obtaining a first characteristic diagram after downsampling the input image. The first upsampling layer is configured to perform a first convolution operation on the first feature map, and obtain a feature map F1 after the first convolution operation. The second upsampling layer is used for executing a first convolution operation on the feature map obtained after the maximum value pooling calculation is performed on the feature map F1, and obtaining a feature map F2 after the first convolution operation. The third upsampling layer is configured to perform a first convolution operation on the feature map after performing the maximum value pooling calculation on the feature map F2, and obtain a feature map F3 after the first convolution operation. And performing bilinear upsampling calculation on the feature map F3, inputting the feature map F2 into a first FSM module, and performing first-time connectate feature fusion on the output obtained after the bilinear upsampling calculation and the output of the first FSM module to obtain a feature map F4. And performing bilinear upsampling calculation on the feature map F4, inputting the feature map F1 into a second FSM module, and performing second-time connectate feature fusion on the output obtained after the bilinear upsampling calculation and the output of the second FSM module to obtain a feature map F5. The second convolution operation is performed on the feature map F5 to obtain a feature map F6. And carrying out maximum value pooling calculation on the feature map F6, carrying out second convolution operation on the feature map F4, and carrying out third-time Concate feature fusion on the output obtained after the maximum value pooling calculation and the output obtained after the second convolution operation to obtain a feature map F7. And executing a second convolution operation on the feature map F3, carrying out maximum value pooling calculation on the feature map F7, and carrying out fourth Concate feature fusion on the output after the second convolution operation and the output after the maximum value pooling to obtain a feature map F8. The first downsampling layer is used for performing a first convolution operation on the feature map F8 to obtain a feature map F9. And performing bilinear upsampling calculation on the feature map F9, inputting the feature map F7 into a third FSM module, and performing fifth-time connectate feature fusion on the output obtained after the bilinear upsampling calculation and the output of the third FSM module to obtain a feature map F10. The second downsampling layer is used for obtaining a feature map F11 after performing a first convolution operation on the feature map F10. And performing bilinear upsampling calculation on the feature map F11, inputting the feature map F6 into a fourth FSM module, performing sixth Concate feature fusion on the output obtained after the bilinear upsampling calculation and the output of the fourth FSM module, and inputting the feature map obtained after the sixth Concate feature fusion into a third downsampling layer. The third downsampling layer is used for performing a first convolution operation on the feature map after the sixth Concate feature fusion to obtain a feature map F12. And respectively inputting the feature map F12, the feature map F11 and the feature map F9 into a first feature output layer, wherein the first feature output layer is used for respectively performing a third convolution operation on the feature map F12, the feature map F11 and the feature map F9 to obtain feature maps FM1, FM2 and FM3 with different scales in a one-to-one correspondence manner.

Optionally, network architectures in the first FSM module, the second FSM module, the third FSM module, and the fourth FSM module are the same, and all the above-described feature selection implementation processes are adopted to perform module construction, so as to obtain an FSM module shown in fig. 4, where the FSM module includes a maximum pooling layer, an average pooling layer, a shared multi-layer perceptron, and the like, in fig. 4, u represents an importance vector, a represents matrix element addition, S represents an activation output function, a represents feature graph superposition, BN represents batch normalization calculation, and f _s (∙) represents a 1×1 convolution operation with a step size of 1, M representing the matrix element multiplication.

As a preferred embodiment, when the additional feature extraction layer performs multi-stage downsampling on the feature map output by the symmetrical feature pyramid network, residual connection is introduced between two adjacent downsampling stages.

The additional feature extraction layer introducing residual connection specifically comprises: the second characteristic output layer, the first convolution layer, the second convolution layer and the third convolution layer which are sequentially connected from top to bottom. The first convolution layer is configured to perform a third convolution operation again on the output after performing the fourth convolution operation on the feature map F9, and perform a first residual connection on the output after performing the fourth convolution operation on the feature map F9 and obtain a feature map F13 after the first residual connection. And performing a fourth convolution operation on the feature map F13, performing a second residual connection on the output of the fourth convolution operation and the output of the fourth convolution operation on the feature map F9, and inputting the feature map obtained after the second residual connection into a second convolution layer. The second convolution layer is used for performing a third convolution operation on the feature map obtained after the second residual connection to obtain a feature map F14. And executing fifth convolution operation on the feature map F14, performing third residual connection on the output of the fifth convolution operation and the feature map obtained after the second residual connection, and inputting the feature map obtained after the third residual connection into a third convolution layer. The third convolution layer is configured to perform a third convolution operation on the feature map obtained after the third residual connection to obtain a feature map F15. And respectively inputting the feature map F13, the feature map F14 and the feature map F15 into a second feature output layer, wherein the second feature output layer is used for respectively performing a third convolution operation on the feature map F13, the feature map F14 and the feature map F15 to obtain feature maps FM4, FM5 and FM6 with different scales in a one-to-one correspondence manner.

When the input image size of the well lid target detection model is 320×320×3, specific planning of each level is performed on the symmetrical feature pyramid network constructed in the above preferred embodiment, where the first convolution operation is set to a 3×3 convolution with a step size of 1 twice, the second convolution operation is set to a 1×1 convolution with a step size of 1 once, the third convolution operation is set to a 3×3 convolution with a step size of 1 once, the step size of the bilinear upsampling calculation is set to 2, the pooling kernel of all maximum pooling calculations is set to 2, and the feature extraction procedure of the symmetrical feature pyramid network is described below in connection with fig. 2:

SS1. Performing two "Conv3×3 (stride=2) +BN+ReLU" and one "Max-pulling 2×2" operations on the input image 320×320×3, resulting in a first feature map of size 40×40×64;

SS2. Performing two "Conv3×3 (stride=1) +BN+ReLU" operations on a first feature map of input size 40×40×64, resulting in a feature map F1 of size 40×40×128;

SS3, performing a Max-pulling 2×2 operation once on the feature map F1 to obtain a feature map with a size of 20×20×128, and performing a Conv3×3 (stride=1) +BN+ReLU operation twice to obtain a feature map F2 with a size of 20×20×256;

SS4 performing a Max-pulling 2×2 operation once on the feature map F2 to obtain a feature map with a size of 10X10X1256, and performing a Conv3×3 (stride=1) +BN+ReLU operation twice to obtain a feature map F3 with a size of 10X10X1512;

SS5, executing an operation of 'Up-sampling 2×2' on the feature map F3, inputting the feature map F2 into a first FSM module, and performing a 'Concate' operation on the two calculated results to obtain a feature map F4 with the size of 20×20×768;

SS6, executing an "Up-sampling 2×2" operation on the feature map F4, inputting the feature map F1 into a second FSM module, and performing a "connect" operation on the two calculated results to obtain a feature map F5 with a size of 40×40×896;

SS7 performing a "Conv1×1 (stride=1) +BN+ReLU" operation on the feature map F5 to obtain a feature map F6 of size 40×40×128;

SS8 performing Max-pulling 2×2 operation once on the feature map F6, performing Conv1×1 operation once on the feature map F4, performing operation of "connectate" on the two calculated results to obtain a feature map F7 with a size of 20×20×1024;

SS9 performing Max-pulling 2×2 operation once on the feature map F7, performing Conv1×1 operation once on the feature map F3, performing operation of "connectate" on the two calculated results to obtain a feature map F8 with a size of 10×10X165 6;

SS10 executing twice "Conv3×3 (stride=1) +BN+ReLU" operation on the feature map F8 to obtain a feature map F9 with a size of 10×10×512, then executing once "Up-sampling 2×2" operation on the feature map F9, inputting the feature map F7 into a third FSM module, and performing "Concate" operation on the two calculated results to obtain a feature map F10 with a size of 20×20×2048;

SS11 performing twice "Conv3×3 (stride=1, pad=same) +BN+ReLU" operation on the feature map F10 to obtain a feature map F11 with a size of 20×20×256, performing once "Up-sampling 2×2" operation on the feature map F11, inputting the feature map F6 into a fourth FSM module, performing "connect" operation on the two calculated results, and performing twice "Conv3×3 (stride=1, pad=same) +BN+ReLU" operation on the fused feature map to obtain a feature map F12 with a feature map size of 40×40×128;

SS12 the "Conv3×3 (stride=1) +BN+ReLU" operations are performed once on the feature map F12 of size 40×40×128, the feature map F11 of size 20×20×256 and the feature map F9 of size 10×10×512, respectively, to obtain the feature map FM1 of size 40×40×1024, the feature map FM2 of size 20×20×512 and the feature map FM3 of size 10×10×512.

After the symmetric feature pyramid network planning is completed, the additional feature extraction layers constructed in the above preferred embodiment are specifically planned for each level, where the fourth convolution operation is set to a 3×3 convolution with a step size of 2, and the fifth convolution operation is set to a 3×3 convolution with a step size of 1 without padding. The feature extraction process for the additional feature extraction layer is described below in conjunction with fig. 2:

sss1, performing a "conv3×3 (stride=2, stride=same) +bn+relu" operation on a feature map F9 having a size of 10×10×512 to obtain a feature map having a size of 5×5×128, performing a "conv3×3 (stride=1, stride=same) +bn+relu" operation again, and performing a first residual connection on an output after performing a "conv3×3 (stride=1, stride=same) +bn+relu" operation and the feature map having the size of 5×5×128 to obtain a feature map F13 having a size of 5×5×128;

sss2, performing a "conv3×3 (stride=2, padding=same) +bn+relu" operation on the feature map F13, performing a second residual connection on the output after the operation and the output after the "conv3×3 (stride=2, padding=same) +bn+relu" operation on the feature map F9, and performing a "conv3×3 (stride=1, padding=same) +bn+relu" operation on the output after the residual connection, to obtain a feature map F14 with a size of 3×3×128;

Sss3, performing a "conv3×3 (stride=1, running=valid) +bn+relu" operation on the feature map F14, performing a third residual connection on the output after the operation and the output after the second residual connection, and performing a "conv3×3 (stride=1, running=same) +bn+relu" operation on the output after the residual connection, to obtain a feature map F15 with a size of 1×1×128;

sss4 the "cond 3×3 (stride=1) +bn+relu" operation is performed once on the feature map F13 of size 5×5×128, the feature map F14 of size 3×3×128, and the feature map F15 of size 1×1×128, respectively, to obtain the feature map FM4 of size 5×5×256, the feature map FM5 of size 3×3×256, and the feature map FM6 of size 1×1×256.

After the six feature maps with different scales are extracted, the six feature maps are input into a convolution predictor together for classifying and positioning the well lid targets, and predicted position coordinate information is obtained. The convolution predictor adopts an original convolution predictor of the SSD network, so that the classification and positioning process of the well lid target are not repeated for the convolution predictor. An alternative convolution predictor construction is shown with reference to FIG. 3, in which FIG. 3, a total of 9590 detection frames are used, and a non-maximum suppression algorithm is introduced, feature map FM1 is input to a first top-down detector and classifier, feature map FM2 is input to a second top-down detector and classifier, feature map FM3 is input to a third top-down detector and classifier, feature map FM4 is input to a fourth top-down detector and classifier, feature map FM5 is input to a fifth top-down detector and classifier, and feature map FM6 is input to a sixth top-down detector and classifier.

1600 acquired road surface well lid images are tested by respectively applying the well lid target detection model, the traditional SSD algorithm model which takes VGG16 as a characteristic to extract a backbone network, the traditional YOLOX algorithm model which takes CSPDarkNet53 as the backbone network, the traditional CenterNet algorithm model which takes ResNet-101 as the backbone network and the traditional EfficientDet algorithm model which takes EfficientNet as the backbone network, and evaluation index values of different algorithm models are shown in the table I.

List one

The specific calculation formula of each evaluation index is as follows:

；

where Recall denotes Recall, precision denotes Precision,for the comprehensive evaluation index determined according to the recall rate and the precision rate, the overlay-IOU represents the intersection ratio, TP is the true positive number, FP is the false positive number, FN is the false negative number, N is all the picture numbers, M is the picture number where the prediction frame and the real frame have intersection,representing the predicted value of the model ∈>Indicating the ground truth.

Example two

The embodiment provides an intelligent detection system for a road well lid, and the intelligent detection method for the road well lid is used for classifying and positioning a well lid target based on the intelligent detection method for the road well lid. Specifically, the intelligent detection system for the road surface well lid comprises a memory and a processor, wherein the memory is in communication connection with the processor, the processor is also in communication connection with an external image acquisition device, the intelligent detection method for the road surface well lid realized in the first embodiment is stored in the memory, and the processor is used for calling the method stored in the memory to classify and position the well lid targets of the real-time road surface well lid images acquired by the image acquisition device. The external image pickup device is preferably a vehicle-mounted 3D camera, and the vehicle-mounted 3D camera is mounted at both right and left ends of the tail of the vehicle.

The foregoing is merely a preferred embodiment of the invention, and it is to be understood that the invention is not limited to the form disclosed herein but is not to be construed as excluding other embodiments, but is capable of numerous other combinations, modifications and environments and is capable of modifications within the scope of the inventive concept, either as taught or as a matter of routine skill or knowledge in the relevant art. And that modifications and variations which do not depart from the spirit and scope of the invention are intended to be within the scope of the appended claims.

Claims

1. The intelligent detection method for the road well lid is characterized by comprising the following steps of:

constructing a well lid target detection model taking a road surface well lid image as an input image, wherein the well lid target detection model takes an SSD network as a base line, replaces a feature extraction network in the SSD network with a symmetrical feature pyramid network and an additional feature extraction layer, and reserves a convolution predictor in the SSD network;

The symmetrical feature pyramid network is used for extracting features of an input image to obtain a plurality of feature graphs with different scales; the additional feature extraction layer performs multi-stage downsampling on the feature graphs output by the symmetrical feature pyramid network to obtain a plurality of feature graphs with different scales in a one-to-one correspondence manner; the convolution predictor is used for classifying and positioning the well lid targets in all the feature diagrams;

the construction process of the symmetrical characteristic pyramid network specifically comprises the following steps:

when feature fusion is carried out through path aggregation, up-sampling calculation is carried out on the feature images output by the layers for extracting low-level features in two adjacent layers, feature selection for characterizing space details is carried out on the feature images output by the other layers, and the feature images output after the feature selection are fused with the feature images output after the up-sampling calculation;

The symmetrical characteristic pyramid network comprises an image input layer, a first characteristic output layer, a first upsampling layer, a second upsampling layer and a third upsampling layer which are sequentially connected from top to bottom, and a first downsampling layer, a second downsampling layer and a third downsampling layer which are sequentially connected from bottom to top;

respectively inputting the feature map F12, the feature map F11 and the feature map F9 into a first feature output layer, wherein the first feature output layer is used for respectively performing a third convolution operation on the feature map F12, the feature map F11 and the feature map F9 to obtain feature maps FM1, FM2 and FM3 with different scales in a one-to-one correspondence manner;

when the additional feature extraction layer performs multi-stage downsampling on the feature map output by the symmetrical feature pyramid network, residual connection is introduced between two adjacent downsampling stages;

the additional feature extraction layer comprises a second feature output layer, a first convolution layer, a second convolution layer and a third convolution layer which are sequentially connected from top to bottom;

respectively inputting the feature map F13, the feature map F14 and the feature map F15 into a second feature output layer, wherein the second feature output layer is used for respectively performing a third convolution operation on the feature map F13, the feature map F14 and the feature map F15 to obtain feature maps FM4, FM5 and FM6 with different scales in a one-to-one correspondence manner;

the size of the input image is 320×320×3, the first convolution operation is a 3×3 convolution with two steps being 1, the second convolution operation is a 1×1 convolution with one step being 1, the third convolution operation is a 3×3 convolution with one step being 1, the fourth convolution operation is a 3×3 convolution with one step being 2, the fifth convolution operation is a 3×3 convolution with one step being 1 without padding, the step being 2 in the bilinear upsampling calculation, the pooling kernels of all maximum pooling calculations are 2, the scale of the feature map FM1 is 40×40×1024, the scale of the feature map FM2 is 20×20×512, the scale of the feature map FM3 is 10×10×512, the scale of the feature map FM4 is 5×5×256, the scale of the feature map FM5 is 3×3×256, and the scale of the feature map FM6 is 1×1×256;

Classification categories when classifying well lid targets in all feature maps include health status well lids, well lid collapse, well lid lack and background.

2. The intelligent detection method of the road well lid according to claim 1, wherein the feature selection of the feature map output by the other level for characterizing the space details is specifically as follows:

respectively feeding a second characteristic diagram and a third characteristic diagram into a shared multi-layer sensor, adding matrix elements one by the output of the second characteristic diagram after passing through the multi-layer sensor and the output of the third characteristic diagram after passing through the multi-layer sensor, and then activating and outputting to obtain an importance vector;

scaling the original feature map using the importance vector;

3. The intelligent detection method of the road surface well cover according to claim 1, wherein the road surface well cover image is acquired by vehicle-mounted 3D cameras arranged at the left end and the right end of a vehicle tail; the road surface well lid image is pretreated before being used as an input image of a well lid target detection model, and the pretreatment process specifically comprises the following steps:

and performing size reduction treatment on the pavement well lid image.

4. The intelligent detection system for the road well lid is characterized by comprising a memory and a processor, wherein the memory is in communication connection with the processor, the processor is also in communication connection with an external image acquisition device, the intelligent detection method for the road well lid is stored in the memory, and the processor is used for calling the method stored in the memory to classify and position the well lid targets of the real-time road well lid images acquired by the image acquisition device.