CN115937807A

CN115937807A - Road disease identification and detection method and system

Info

Publication number: CN115937807A
Application number: CN202211522042.2A
Authority: CN
Inventors: 王鑫; 毛昭勇; 张驰; 沈钧戈
Original assignee: Northwestern Polytechnical University
Current assignee: Northwestern Polytechnical University
Priority date: 2022-11-30
Filing date: 2022-11-30
Publication date: 2023-04-07

Abstract

The application discloses a road disease identification and detection method and a system, wherein the method comprises the following steps: collecting a data set, and dividing the data set into a training set, a verification set and a test set; preprocessing the data set to obtain processed data; extracting target features of different levels of the image by using a Transformer algorithm according to the processed data to obtain a final target vector; and obtaining the class probability of the target based on the final target vector, and regressing a target boundary frame to realize the accurate detection of the target. The image characteristics are automatically extracted by using the visual Transformer structure, the defects that context information cannot be effectively utilized and the global property is insufficient based on a CNN method are overcome, road disease identification is carried out on image data, the condition of road pavement diseases can be replaced by manual inspection, the detection efficiency is effectively improved, the detection cost is reduced, and compared with a target detection algorithm based on the traditional CNN, the method and the device have higher recall rate and accuracy.

Description

Road disease identification and detection method and system

Technical Field

The application relates to the technical field of road damage detection, in particular to a road disease identification and detection method and system.

Background

With the rapid development of road traffic, traffic accidents, vehicle overload or malicious road damage, road damage caused by natural disasters and other situations, the systematic and the integrity of the road are damaged to different degrees, and the situations need to be discovered in daily maintenance patrol in time, technical condition evaluation is given according to national standards, and road maintenance is further carried out according to evaluation results. Most of the existing road patrol methods are in a manual patrol stage, the efficiency is low, the cost is high, part of road sections are low in robustness by adopting an image processing algorithm based on the traditional method, the road patrol methods are easily interfered by external factors such as illumination, noise, definition and the like, the missed detection and the false detection are relatively large, and computing resources are usually consumed.

In the field of computer vision, a CNN has become a leading model of a vision task since 2012, with the development of technology, an increasingly efficient algorithm structure appears in the field of vision processing, a vision Transformer structure is used as an attention-based encoder architecture, pioneering work is done in the field of computer vision, compared with the CNN, the vision Transformer structure has more excellent modeling capability, a self-attention mechanism in the Transformer structure fully utilizes global context information, global modeling is performed by using global characteristics, and target characteristics of a road surface disease under a similar background can be effectively extracted, so that the Transformer structure provides a new idea for improving a road disease identification method. The road disease identification method based on the visual Transformer structure extracts features from a training set, overcomes the defect of manual feature extraction, reduces a complex image preprocessing process, overcomes the defects of regionality based on a CNN method, incapability of effectively utilizing global context information and the like, can obtain a large number of images with correlation by utilizing multi-dimensional data enhancement, reduces the possibility of neural network overfitting, and has a better feature expression effect.

Disclosure of Invention

By analyzing and processing the optical image data containing the road surface information, the method and the device can provide accurate road surface damage and disease identification results which accord with the national highway technical evaluation standard.

In order to achieve the above object, the present application provides a road disease identification and detection method, comprising the steps of:

collecting a data set, and dividing the data set into a training set, a verification set and a test set;

preprocessing the data set to obtain the processed data;

extracting target features of different levels of the image by using a Transformer algorithm according to the processed data to obtain a final target vector;

and obtaining the class probability of the target based on the final target vector, and regressing a target boundary frame to realize accurate target detection.

Preferably, the method of acquiring a data set comprises: and making disease marking data according to a road technical condition evaluation standard, and constructing the data set by adopting optical image data containing road surface disease information under multi-scale, different background environments and different illumination states.

Preferably, the method of pretreatment comprises:

carrying out image size adjustment on the training set and the test set to obtain a plurality of image samples with adjusted sizes, and expanding the data volume of the training set by a data enhancement method; obtaining a data sample;

carrying out dicing processing on the data samples to obtain a plurality of small three-dimensional image cubes;

and performing flattened linear mapping on each small three-dimensional image cube to obtain a plurality of one-dimensional vectors corresponding to the small three-dimensional image cube, wherein the one-dimensional vectors represent the original characteristics of the image sample.

Preferably, the data enhancement method includes:

randomly reading four pictures from the data set each time;

respectively turning over, zooming and changing the color gamut of the four pictures to obtain processed pictures;

placing the processed picture at the upper left, upper right, lower left and lower right positions of the large picture with the specified size;

and intercepting the fixed areas of the processed pictures by using a matrix mode, splicing the fixed areas into a new picture, and correspondingly, performing edge processing on the coordinates of the detection frame exceeding the boundary so as to conveniently generate an XML file corresponding to the new picture.

Preferably, the dicing process includes:

setting the step length as S, and cutting each image sample with the size of C multiplied by 3 into N image samples in sequence by taking S as the step length

A small cube of (2).

Preferably, the method for obtaining the processed data comprises: will have a size of

Is flattened by a linear mapping into a cube having a length ∑ n>

The one-dimensional vectors of (1) constitute the processed data from the N one-dimensional vectors.

Preferably, the method for extracting the target feature comprises:

constructing a vector with the same dimension as the original characteristic as a position code, and then adding the vector with the one-dimensional vector representing the original characteristic to obtain the input of a Transformer module;

and inputting the one-dimensional vector added with the position codes into a coding layer based on a visual Transformer module, and extracting the target features of different levels of the image.

The application also provides a road disease identification and detection system, which is characterized by comprising: the device comprises an acquisition module, a preprocessing module, an extraction module and a calculation module;

the acquisition module is used for acquiring a data set and dividing the data set into a training set, a verification set and a test set;

the preprocessing module is used for preprocessing the data set to obtain the processed data;

the extraction module is used for extracting target features of different levels of the image according to the processed data to obtain a final target vector;

and the calculation module is used for obtaining the class probability of the target based on the final target vector and regressing a target boundary box.

Preferably, the acquisition module comprises: the device comprises an acquisition unit and a dividing unit;

the acquisition unit is used for acquiring the data set;

the dividing unit is used for dividing the data set into the training set, the verification set and the test set.

Compared with the prior art, the beneficial effects of this application are as follows:

the method and the device can rapidly and accurately deduce the information such as the position, the type and the like of the road disease, and the detection result meets the national regulation 'road technical condition evaluation standard'. The image features are automatically extracted by using the visual Transformer structure, the defects that context information cannot be effectively utilized and the global property is insufficient based on a CNN method are overcome, road disease identification is carried out on image data, the condition of road pavement diseases can be replaced by manual inspection, the detection efficiency is effectively improved, the detection cost is reduced, compared with a target detection algorithm based on the traditional CNN, the image feature detection method has higher recall rate and accuracy, and the target detection effect on high background similarity is better.

Drawings

In order to more clearly illustrate the technical solution of the present application, the drawings needed to be used in the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for a person skilled in the art to obtain other drawings without any inventive exercise.

FIG. 1 is a schematic flow chart of a method according to a first embodiment of the present application;

FIG. 2 is a schematic diagram of a data sample enhancement and dicing process according to a first embodiment of the present application;

FIG. 3 is a schematic structural diagram of a standard Transformer according to a first embodiment of the present application;

fig. 4 is a schematic structural diagram of an MLP module according to a first embodiment of the present application;

FIG. 5 is a schematic diagram of a regressor module according to an embodiment of the present disclosure;

fig. 6 is a schematic structural diagram of a system according to a second embodiment of the present application;

fig. 7 is a schematic diagram of an LWT module according to a third embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments in the present application without making any creative effort belong to the protection scope of the present application.

In order to make the aforementioned objects, features and advantages of the present application more comprehensible, the present application is described in further detail with reference to the accompanying drawings and the detailed description.

Example one

As shown in fig. 1, which is a schematic flow chart of a method according to a first embodiment of the present application, the method includes:

s1, collecting a data set, and dividing the data set into a training set, a verification set and a test set.

Disease marking data are manufactured according to the 'evaluation standard of road technical conditions', optical image data containing road surface disease information under multi-scale, different background environments and different illumination states are adopted to form a data set, and the data set is reasonably divided into a training set, a verification set and a test set.

And S2, preprocessing the data set to obtain processed data.

In the first embodiment, the preprocessing method includes:

carrying out image size adjustment on the training set and the test set to obtain a plurality of image samples with adjusted sizes, and expanding the data volume of the training set by a data enhancement method; obtaining a data sample; carrying out dicing processing on the data samples to obtain a plurality of small three-dimensional image cubes; and performing flattened linear mapping on each small three-dimensional image cube to obtain a plurality of one-dimensional vectors corresponding to the small three-dimensional image cube, wherein the one-dimensional vectors represent the original characteristics of the image sample.

As shown in fig. 2, a schematic diagram of a data enhancement and dicing process according to a first embodiment of the present application, a method includes: randomly reading four pictures from the data set each time; respectively carrying out operations such as turning (left-right turning on the original picture), zooming (size zooming on the original picture), color gamut change (changing brightness, saturation and hue of the original picture) and the like on the four pictures; placing the four pictures at the upper left, upper right, lower left and lower right positions of the large picture with the specified size; and intercepting the fixed areas of the four pictures by using a matrix mode, splicing the fixed areas into a new picture, and correspondingly, performing edge processing on the coordinates of the detection frame exceeding the boundary so as to conveniently generate an XML file corresponding to the new picture.

Then, setting the step size as S, and cutting each image sample with the size of C multiplied by 3 into N image samples in sequence by taking S as the step size

A small cube of (2).

Will have a size of

Is flattened by linear mapping to obtain a length ≥ m>

A total of N such one-dimensional vectors.

And S3, extracting target features of different levels of the image according to the processed data to obtain a final target vector.

Constructing a vector with the same dimensionality as the original characteristic dimension as a position code according to the S2, and then adding the vector with the one-dimensional vector representing the original characteristic to obtain the input of the transform module; and then inputting the one-dimensional vector added with the position code into a coding layer based on a visual Transformer module, and extracting target features of different levels of the image to obtain a final target vector. The position code is used to record the position information of different image features in the original image, obviously, the coding scheme for recording the position information is the naive scheme designed by absolute position coding, and the position code is taken as a trainable parameter. The feature extraction result is derived from a self-attention mechanism in a standard Transformer coding structure, a lightweight convolution module LWC is designed through depth separable convolution, and a lightweight Transformer module LWT is further designed in combination with the self-attention mechanism. I.e. using one-dimensional vectors for interactive self-attention computation, where the coding layer network is stacked of M identical layers, each layer connecting the multi-head attention mechanism network and the fully-connected layer using residuals. The structure of the standard Transformer in the first embodiment of the present application is shown in fig. 3.

And S4, obtaining the class probability of the target based on the final target vector, and regressing a target boundary box.

Finally, inputting the one-dimensional vector with the length of D calculated and output by the coding layer into an MLP module to obtain the class probability of the target and regressing a target boundary box; the MLP module structure of the first embodiment is shown in fig. 4. Inputting the final feature vector into an MLP module consisting of two fully-connected layers and a nonlinear activation function softmax activation function to obtain a classification prediction result; and inputting the final feature vector into a regressor module consisting of a plurality of layers of full connection layers and sigmoid functions to obtain a four-dimensional vector (X, Y, W and H), wherein the four dimensions respectively represent the coordinate of the central point and the width and the height of the target boundary box. The regressor module of the first embodiment is shown in fig. 5.

Example two

As shown in fig. 6, a schematic structural diagram of a system according to a second embodiment of the present application includes: the device comprises an acquisition module, a preprocessing module, an extraction module and a calculation module; the acquisition module is used for acquiring a data set and dividing the data set into a training set, a verification set and a test set; the preprocessing module is used for preprocessing the data set to obtain processed data; the extraction module is used for extracting target features of different levels of the image according to the processed data to obtain a final target vector; and the calculation module is used for obtaining the class probability of the target based on the final target vector and regressing a target boundary box. Wherein, the collection module includes: the device comprises a collecting unit and a dividing unit.

The working flow of the system will be described in detail below with reference to the present embodiment.

Firstly, a data set is collected by a collection unit, and the data set is divided into a training set, a verification set and a test set by a division unit.

The working process of the acquisition module comprises the following steps: disease marking data are manufactured according to the 'evaluation standard of road technical conditions', optical image data containing road surface disease information under multi-scale, different background environments and different illumination states are adopted to form a data set, and the data set is reasonably divided into a training set, a verification set and a test set.

And then, preprocessing the data set by using a preprocessing module to obtain processed data.

In the second embodiment, the work flow of the preprocessing module includes:

carrying out image size adjustment on the training set and the test set to obtain a plurality of image samples with adjusted sizes, and expanding the data volume of the training set by a data enhancement method; obtaining a data sample; carrying out dicing processing on the data samples to obtain a plurality of small three-dimensional image cubes; and performing flattened linear mapping on each small three-dimensional image cube to obtain a plurality of one-dimensional vectors corresponding to the small three-dimensional image cubes, wherein the one-dimensional vectors represent the original characteristics of the image sample.

Then, randomly reading four pictures from the data set each time; respectively carrying out operations such as turning (left-right turning on the original picture), zooming (size zooming on the original picture), color gamut change (changing brightness, saturation and hue of the original picture) and the like on the four pictures; placing the four pictures at the upper left, upper right, lower left and lower right positions of the large picture with the specified size; and intercepting fixed areas of the four pictures by using a matrix mode, splicing the fixed areas into a new picture, and correspondingly performing edge processing on the coordinates of the detection frame exceeding the boundary so as to conveniently generate an XML file corresponding to the new picture.

Then, a step size S is set, and each image sample with the size of C multiplied by 3 is cut into N image samples in sequence by taking S as the step size

A small cube of (2).

Will have a size of

Is flattened by linear mapping to obtain a length ≥ m>

A total of N such one-dimensional vectors.

And the extraction module is used for extracting target features of different levels of the image according to the processed data to obtain a final target vector. The work flow comprises the following steps:

constructing a vector with the same dimension as the original characteristic as a position code according to the processed data obtained by the preprocessing module, and then adding the vector with a one-dimensional vector representing the original characteristic to obtain the input of the Transformer module; and then inputting the one-dimensional vector added with the position codes into a coding layer based on a visual Transformer module, and extracting target features of different levels of the image to obtain a final target vector. The position code is used to record the position information of different image features in the original image, obviously, the coding scheme for recording the position information is the naive scheme designed by absolute position coding, and the position code is taken as a trainable parameter. The feature extraction result is derived from a self-attention mechanism in a standard Transformer coding structure, namely, a one-dimensional vector is used for carrying out interactive self-attention calculation, wherein a coding layer network is formed by stacking M identical layers, and each layer is connected with a multi-head attention mechanism network and a full connection layer by using residual errors.

And finally, the calculation module obtains the class probability of the target based on the final target vector and regresses a target boundary box. The work flow comprises the following steps:

inputting the one-dimensional vector with the length of D, which is calculated and output by the coding layer, into an MLP module to obtain the class probability of a target, and regressing a target bounding box; inputting the final feature vector into an MLP module consisting of two fully-connected layers and a nonlinear activation function softmax activation function to obtain a classification prediction result; and inputting the final feature vector into a regressor module consisting of a plurality of layers of full connection layers and sigmoid functions to obtain a four-dimensional vector (X, Y, W and H), wherein the four dimensions respectively represent the coordinate of the central point and the width and the height of the target boundary box.

EXAMPLE III

The LWT module will be described in detail below with reference to the third embodiment, as shown in fig. 7. The LWT module aims to model local and global information in a less parametric input tensor. For a given input tensor X ∈ R ^H×W×C The LWT module applies an n × n standard convolutional layer and then generates X by a point-by-point (or 1 × 1) convolutional layer _L ∈R ^H×W×d . The nxn convolutional layers encode local spatial information, while the point-by-point convolution projects the tensor into a higher dimensional space by learning the linear combination of the input channels.

To enable the LWT module to learn a global representation with spatially generalized biases, we will turn X _L Spread into N non-overlapping blocks, X _U ∈R ^P×N×d Where P = wh, N = HW/P is the number of blocks, h ≦ N and w ≦ N are the height and width of the blocks, respectively. For each P ∈ {1, \8230;, P }, X is obtained by applying a Transformer to encode the inter-block relationship _G ∈R ^P×N×d Comprises the following steps:

X _G (p)＝Transformer(X _U (p))，1≤p≤P (1)

wherein, X _G (p) encoding global information for the block at the p-th position, X _U (p) is a function of n using convolutionThe n region encodes local information. Since the LWT module neither loses the order of the blocks nor the spatial order of the pixels within each block, X can be folded _G ∈R ^P×N×d Obtaining X _F ∈R ^H×W×d Then X is convolved point by point _F The image is projected to a low-dimensional space (C dimension), and is combined with X by a connection operation, and further, local features and global features in the connection tensor are fused by the n multiplied by n convolution layers, so that a final output tensor is obtained. X _U (p) encoding local information for n X n regions using convolution, X _G (p) encoding global information for the block at the p-th position, thus X _G Each pixel in (a) encodes information from all pixels in (X), and finally the LWT module experiences H × W theoretically.

A Lightweight Adaptive clustering attention mechanism (L-ACT). A lightweight convolution module LWC is designed through depth separable convolution, a lightweight Transformer module LWT is further designed by combining a self-attention mechanism, global feature representation is further learned, and a feature map with local information and global information effectively combined is output by designing the position sequence of the LWC and the LWT. The L-ACT adaptively clusters query features using Locality Sensitive Hashing (LSH) and approximates query-key interactions using prototype-key interactions. L-ACT can reduce the quadratic O (N ^ 2) complexity from the attention interior to O (NK), where K is the number of prototypes per layer. On the premise of not influencing the performance of the pre-training DETR model, the L-ACT can replace the original self-attention module in the DETR. L-ACT achieves a good balance between accuracy and computational cost (FLOPs).

L-ACT is intended to select representative prototypes from queries using lightweight LSH, and then pass feature updates of the selected prototypes into the most recent query. The L-ACT can reduce the secondary complexity of the original transformer and is completely compatible with the original transformer;

without any training, this study reduced the FLOPS in the DETR from 73.4 to 58.2 gflps (excluding the backbone Resnet FLOPS), while the loss in AP was only 0.7%; the loss in AP was reduced to 0.2% by multitask knowledge distillation (MTKD), thereby achieving a seamless transition between ACT and original transform.

In the encoder, 2D features are extracted from the input image using the ImageNet pre-trained ResNet model. The position encoding module encodes the spatial information using sine and cosine functions of different frequencies. The DETR flattens the 2D features, complements with position coding, and passes to a 6-layer transformer encoder. Each layer of the encoder is identical in structure and comprises an 8-head self-attention module and an FFN module. The decoder then embeds a small fixed number of learned positions as inputs, these embeddings are called target queries, and pay extra attention to the encoder output. The decoder also has 6 layers, each layer including 8 self-attention modules and 8 common attention modules, and an FFN module. Finally, the det passes each output of the decoder to a shared feed-forward network that predicts either detection (class and bounding box) or no target class.

The above-described embodiments are merely illustrative of the preferred embodiments of the present application, and do not limit the scope of the present application, and various modifications and improvements made to the technical solutions of the present application by those skilled in the art without departing from the spirit of the present application should fall within the protection scope defined by the claims of the present application.

Claims

1. A road disease identification and detection method is characterized by comprising the following steps:

preprocessing the data set to obtain the processed data;

and obtaining the class probability of the target based on the final target vector, and regressing a target boundary frame to realize the accurate detection of the target.

2. The method for identifying and detecting the road diseases according to claim 1, wherein the method for collecting the data set comprises the following steps: and making disease marking data according to a road technical condition evaluation standard, and constructing the data set by adopting optical image data containing road surface disease information under multi-scale, different background environments and different illumination states.

3. The method for identifying and detecting the road diseases according to claim 1, characterized in that the preprocessing method comprises the following steps:

4. The method for identifying and detecting the road diseases according to claim 3, characterized in that the data enhancement method comprises the following steps:

randomly reading four pictures from the data set each time;

respectively turning, zooming and color gamut changing the four pictures to obtain processed pictures;

placing the processed pictures at the upper left, upper right, lower left and lower right positions of a large picture with a specified size;

5. The method for identifying and detecting the road diseases according to claim 3, wherein the method for performing the dicing process comprises:

setting the step length as S, and respectively arranging each ruler in sequence by taking S as the step lengthImage samples of size C x 3 were cut into N

A small cube of (2).

6. The method for identifying and detecting the road diseases according to claim 5, wherein the method for obtaining the processed data comprises the following steps: will have a size of

Is flattened by linear mapping to obtain a length ≥ m>

7. The method for identifying and detecting the road diseases according to claim 6, wherein the method for extracting the target features comprises the following steps:

8. A road disease identification and detection system is characterized by comprising: the device comprises an acquisition module, a preprocessing module, an extraction module and a calculation module;

9. The road disease identification and detection system according to claim 8, wherein the collection module comprises: the device comprises an acquisition unit and a dividing unit;

the acquisition unit is used for acquiring the data set;