CN116645592B

CN116645592B - Crack detection method based on image processing and storage medium

Info

Publication number: CN116645592B
Application number: CN202310914403.6A
Authority: CN
Inventors: 牛伟龙; 吴澄; 盛洁; 叶陆琴; 吕景珑; 钱曙杰; 夏从东; 吕志荣
Original assignee: Suzhou Rail Transit Group Co ltd; Suzhou University
Current assignee: Suzhou Rail Transit Group Co ltd; Suzhou University
Priority date: 2023-07-25
Filing date: 2023-07-25
Publication date: 2023-09-29
Anticipated expiration: 2043-07-25
Also published as: CN116645592A

Abstract

The present invention relates to the field of image processing. The crack detection method based on deep learning provided by the invention adopts a Swin Mask RCNN algorithm in crack detection, and uses a Swin transducer as a backbone network. Compared with the prior target detection model, the Swin Transformer has better feature extraction capability and higher expression capability, so that the features are more abundant, the accuracy of crack detection is improved, the network constructed by the invention constructs a multi-layer window fusion module after the sliding window operation of the Swin Transformer network, so that windows with crack information can be fused, the information loss of cracks is prevented, the crack information in an image can be better positioned, and the calculation amount is reduced while the position detection accuracy is improved.

Description

Crack detection method based on image processing and storage medium

Technical Field

The invention relates to the technical field of image processing, in particular to a crack detection method based on image processing and a storage medium.

Background

Mask RCNN, proposed in 2017, month 10, is a deep learning model for object detection and instance segmentation. It inherits the two-phase detection framework of the Faster RCNN, and introduces an extra "Mask Branch" (Mask Branch) to solve the problem of instance segmentation. The mask branches extract the binary mask of each instance in the target image by adding a convolutional and deconvolution network to the RoI pooling layer of the Faster RCNN, fusing the ideas of convolutional neural networks, regional Convolutional Neural Networks (RCNN), and split networks together. The network structure consists of four parts:

Backspace: a convolutional neural network for feature extraction, typically ResNet or ResNeXt;

region Proposal Network (RPN): a network for generating candidate regions, the same as the Faster RCNN;

bounding Box Head: a network for classifying and regressing candidate regions, the same as the Faster RCNN;

mask Head: a full convolution network for generating a mask of candidate regions.

However, mask RCNN performs generally on small target detection. Firstly, mask RCNN uses larger grid characteristics, so that detail information of a fine target is difficult to capture, and the detection performance is naturally reduced; in addition, in practical application, there is almost no monotonic background, only small targets exist, and the collected picture generally has a plurality of targets. The Mask RCNN classifier is good in adaptation to large targets, and multiple targets of the same type but different sizes are difficult to detect under the same parameter threshold; moreover, when a picture has a plurality of large targets in a small target set, the Mask RCNN cannot be accurately segmented, and the picture is the part which is most needed to be detected in crack detection. Therefore, the Mask RCNN has a good detection effect on a large target, but the crack detection field is difficult to exhibit advantages.

Swin transducer was proposed by Liu Ze et al of Microsoft Asian institute at 2021, 3, and solved two challenges of application of transducer in the image field by hierarchical structure and sliding window, linear computation, respectively: the high computational effort of scale change and image resolution of visual entities, and achieves excellent performance over multiple visual tasks. ( Delete "includes image classification (up to 87.3% top-1 accuracy on ImageNet-1K), object detection and instance segmentation (up to 58.7% box AP and 51.1% mask AP on COCO test-dev), semantic segmentation (up to 53.5% mIoU on ADE20 kval), etc. " )

The Swin Mask RCNN is a framework of Mask RCNN based on a Swin transducer, and the method for target detection and instance segmentation is realized. By using a Swin transducer as the backbone network, replacing the convolutional layer in Mask RCNN, using different sized moving windows and hierarchies, challenges encountered in converting from language to vision, such as dimensional changes of visual entities and high resolution of image pixels, are resolved.

However, the Swin transform sliding window operation may cause the fine object to be segmented into a plurality of windows, affecting feature extraction and positioning accuracy, and the hierarchical structure may cause the fine object to be lost or blurred in the high-level feature map, affecting classification and segmentation quality. Furthermore, position coding is a default setting and is not sensitive to small target detection such as cracks. This also makes Swin transducer perform generally on small target detection such as crack detection.

Disclosure of Invention

Therefore, the technical problem to be solved by the invention is to overcome the defects that Mask RCNN is generally performed on small target detection in the prior art, a small target can be divided into a plurality of windows by a Swin transform sliding window operation, feature extraction and positioning accuracy are affected, and a hierarchical structure can cause the small target to be lost or blurred in a high-level feature map, so that classification and segmentation quality are affected. Furthermore, position coding is a default setting and is not sensitive to small target detection such as cracks. This also causes the problem that Swin transducer exhibits an effect on detection of such small targets as crack detection.

In order to solve the technical problems, the invention provides a crack detection method based on image processing, which comprises the following steps:

s101: acquiring a crack image to be detected;

s102: extracting features of the crack image to be detected by using the constructed improved Swin Transformer network as a backbone network, and generating a series of candidate regions, including:

s201: calculating the crack image to be detected by using the constructed multi-head self-attention mechanism module and outputting a characteristic diagram, wherein the method comprises the following steps:

dividing the crack image to be detected into n small panes with the same window size to obtain a first divided feature image, utilizing a constructed Shift Window Attention mechanism module to slide the small panes in the first divided feature image to obtain a second divided feature image, inputting the first divided feature image and the second divided feature image into a constructed multi-layer window fusion module, normalizing the first divided feature image and the second divided feature image by the multi-layer window fusion module, carrying out feature mapping on each small pane of the normalized first divided feature image and each small pane in the second divided feature image, calculating a similar matrix, judging whether the small panes with the overlapping parts are required to be fused according to the obtained similar matrix, if the similar matrix indicates that the connected crack information exists, generating a single window by the multi-layer window fusion module, and outputting independent window fusion information if the two small panes are not connected with each other, and outputting the independent window fusion information to the multi-layer window fusion module if the two small panes are not connected with each other;

Calculating each window output by the multi-layer window fusion module by using the constructed transformation matrix to obtain a query value Q, a key value K and a value V in a self-attention mechanism of each window, and firstly carrying out linear transformation on pixels in each window through a convolution layer in each window to extract pixel data, and extracting characteristics according to the obtained query value Q, the key value K and the value V of each window, wherein the characteristics are as follows:

wherein, the number of columns of the Q matrix and the K matrix, namely vector dimension, B represents the position offset of each window, and the value of B is given by a set relative position offset parameter table;

the relative position index constructed in the multi-head self-attention mechanism module is utilized to call parameters in a relative position coding table and the characteristic calculation of each window is fused to output a characteristic diagram of multi-head self-attention calculation output;

s202: processing the feature graphs output by multi-head self-attention calculation by using the constructed feature pyramid network, extracting features with different scales in the feature graphs output by multi-head self-attention calculation, fusing the features, and outputting the fused feature graphs;

s203: predicting the fused feature map by using the constructed Region Proposal Network network to obtain a series of candidate areas, wherein each candidate area comprises a score and a position offset, each candidate area is sequenced and screened according to the score, and partial high-score candidate areas are reserved;

And S103, classifying and regressing each obtained high-score candidate region, obtaining a final detection frame according to the position offset of each high-score candidate region, predicting a pixel level mask in each detection frame, and outputting a detection result with a target detection frame and image segmentation.

Further, when the window sliding is performed on the small pane in the feature map after the first division by using the constructed Shift Window Attention mechanism module to obtain the window sliding of the small pane in the feature map after the second division, when the window sliding is performed on the small pane in the feature map after the first division along the horizontal or vertical direction, the window sliding distance is smaller than the side length of the small pane, and when the window sliding is performed on the small pane in the feature map after the first division along the diagonal line of the small pane, the window sliding distance is smaller than the diagonal distance of the small pane.

Further, the processing the feature map output by the multi-head self-attention calculation by using the constructed feature pyramid network comprises the following steps:

carrying out convolution operation on the input feature images to obtain a plurality of feature images with different layers;

constructing a top-down path and transverse connection to construct a feature pyramid, wherein each pyramid layer comprises an up-sampling operation and a 1x1 convolution operation, and the up-sampling operation and the 1x1 convolution operation are used for adding the feature map of the previous layer with the feature map of the current layer and performing feature fusion;

And smoothing each pyramid layer by using a 3x3 convolution operation, and outputting the feature map after smoothing each pyramid layer.

Further, the size and number of channels of each of the plurality of different levels of feature maps are different.

Further, the classifying and regressing each candidate region, obtaining a final detection frame according to the position offset of each candidate region, predicting a pixel level mask in each detection frame, and outputting a detection result with a target detection frame and image segmentation includes:

pooling the candidate regions to the same size using the constructed RoIAlign layer;

outputting a feature vector of a fixed size of the candidate regions, representing the features of each candidate region;

classifying and regressing the feature vector of each candidate region, and outputting a class label and a position offset to represent the final detection frame of each RoI;

and carrying out masking operation on the feature vectors in the final detection frames of each RoI to predict pixel level masks in each detection frame, outputting a binary matrix to represent foreground and background pixels in each detection frame, and outputting a detection result with a target detection frame and image segmentation in combination with the original image.

Further, the RoIAlign layer pools candidate regions using bilinear interpolation.

Further, the classifying and regressing the feature vector of each candidate region specifically includes: the features of each candidate region are subjected to integrated regression by using a full connection layer, and the features of each candidate region are classified by using a softmax layer.

Further, the masking operation is performed on the feature vectors in the final detection frame of each candidate region to predict the pixel level mask in each detection frame, specifically, a convolution layer is used to perform feature extraction on the feature map in the detection frame, and a sigmoid activation function is used to binarize the extracted feature and output a binary matrix.

Further, the convolution layer in the feature extraction of the feature map in the detection frame by using one convolution layer comprises a constructed transverse edge convolution kernel and a constructed longitudinal edge convolution kernel.

The invention also provides a storage medium, on which a computer program is stored, which when being executed by a processor, implements the steps of the above-described crack detection method based on image processing.

Compared with the prior art, the technical scheme of the invention has the following advantages:

The image processing-based crack detection method and the storage medium creatively apply the Swin mask RCNN model to the crack detection field, and optimize the parameters and the model structure of the Swin mask RCNN model. In order to better extract cracks, a multi-layer window fusion module is constructed after the Swin Transformer network sliding window operation, so that small panes with crack characteristics are fused with each other, and complete crack information is reserved. In the model feature extraction link, a transverse edge convolution kernel and a longitudinal edge convolution kernel are constructed, so that the slit information of the slender shape can be better extracted; the calculation amount is reduced while improving the position detection accuracy.

Drawings

In order that the invention may be more readily understood, a more particular description of the invention will be rendered by reference to specific embodiments thereof that are illustrated in the appended drawings, in which

FIG. 1 is a flow chart of a crack detection method based on image processing according to an embodiment of the present invention;

FIG. 2 is an algorithmic workflow of an image processing-based crack detection method constructed in accordance with the present invention;

FIG. 3 is a schematic diagram of sliding window operation in a Swin transducer network;

FIG. 4 is a schematic diagram of the operation of a multi-layer window fusion module;

FIG. 5 is a schematic diagram of the operation of a multi-layer window fusion module;

FIG. 6 is a block diagram of the Self-Attention mechanism;

FIG. 7 is a schematic representation of a relative position encoding table constructed in accordance with the present invention;

FIG. 8 is a schematic diagram of the operation of the two-head self-attention mechanism;

FIG. 9 is a network schematic of a feature pyramid;

FIG. 10 is a network block diagram of a feature pyramid;

FIG. 11 is a block diagram of a Swin transducer network;

FIG. 12 is a graph showing the comparison of the first crack originals and the crack detection results after the crack detection method based on image processing provided by the invention is processed by the maskRCNN network;

FIG. 13 is a graph showing the second type of crack originals and the comparison of the crack detection results after the crack detection method based on image processing provided by the invention is processed by the maskRCNN network;

FIG. 14 is a diagram showing a third crack original graph and a comparison of crack detection results after being processed by a maskRCNN network and the crack detection method based on image processing provided by the invention;

fig. 15 is a comparison chart of a fourth crack original graph and crack detection results after being processed by the maskrnn network and the crack detection method based on image processing provided by the invention.

Detailed Description

The present invention will be further described with reference to the accompanying drawings and specific examples, which are not intended to be limiting, so that those skilled in the art will better understand the invention and practice it.

Example 1

Referring to fig. 1, the steps of the first embodiment of the present invention include:

s101: acquiring a crack image to be detected;

，

The image processing-based crack detection method provided by the invention adopts a Swin Mask RCNN algorithm in crack detection, and uses a Swin transducer as a backbone network. Compared with the prior target detection model, the Swin transducer has better feature extraction capability and higher expression capability, so that the features are more abundant, and the accuracy of crack detection is improved. In order to better extract cracks, a multi-layer window fusion module is constructed after the Swin converter network sliding window operation, so that small panes with crack characteristics are fused with each other, information among the small panes is further transferred, and complete crack information is reserved.

Example two

The algorithm workflow diagram of the second embodiment of the present invention is shown in fig. 2, and training is performed based on the Swin mask RCNN deep learning network structure for the collected road crack, tunnel crack and bridge crack data. The input picture is a crack after scaling and manual marking, and the prediction is mainly divided into two stages, including:

The first stage: using Swin fransformer as backbone network, the features of the image are extracted and a series of candidate regions are generated. First, dividing an input image into a plurality of small blocks called patches, and converting each patch into a feature vector; secondly, performing self-attention calculation on each patch, wherein a sliding window and a layered structure are used in the step, and in the attention calculation process, the embodiment improves a position coding matrix to improve the detection efficiency and accuracy of the model on cracks; thereafter, feature maps of multiple scales are output, each feature map containing patches of different sizes and resolutions.

Candidate region generation: on the final scale feature map, a Region Proposal Network (RPN) is used to predict a series of candidate regions, called Regions of Interest (RoI); each RoI contains a score and a positional offset representing the confidence and position of the region. The rois are ranked and screened according to the scores, with a certain number of high-scoring rois being retained as input to the second stage.

And a second stage: and classifying and regressing each candidate region to obtain a final detection frame, and predicting a pixel level mask in each detection frame. Firstly, aligning candidate areas, and performing alignment operation on feature graphs of different scales for each RoI, wherein a RoI Align layer is used for maintaining the resolution and position accuracy of the feature graphs; a feature vector of fixed size is then output representing the features of each RoI. The next step is to detect the frame prediction. The classification and regression operation is performed on the feature vectors of each RoI, using a full connection layer and a softmax layer to predict the class and positional offset of each RoI. A class label and a position offset are then output representing the final detection box for each RoI. And finally, carrying out mask prediction. Masking operation is performed on the feature vector of each RoI, a convolution layer and a sigmoid layer which are improved aiming at cracks are used for predicting the pixel level mask in each detection frame, then a binary matrix is output, the foreground and background pixels in each detection frame are represented, and finally the binary matrix is output.

The specific steps of the crack detection method based on image processing provided by the second embodiment of the invention include:

s31: acquiring a crack image to be detected and a marked crack image;

s32: using a Swin Transformer as a backbone network, extracting features of an image and generating a series of candidate regions, comprising:

s321: cutting the whole picture, embedding vectors, setting the cutting size as 4*4 pixels, setting an output channel after cutting to determine the size of the embedded vectors, and finally expanding the H dimension and the W dimension and moving to the first dimension; the input pictures are (960, 960,3) (RGB three channels), the input pictures are (240, 240, 96) after being cut, 96 is the model specified channel output, 240 comes from 960/4=240, and the window is set to 7*7, so that a segmented feature map, namely a feature map after first segmentation, is obtained; window sliding is carried out on small panes in the feature map after the first division by using the constructed Shift Window Attention mechanism module, and window positions move downwards and rightwards by two units respectively to obtain a feature map after the second division;

sliding window operation (Shifted Window Attention) in step S321: as shown in FIG. 3, the standard transform architecture and its adaptive version of image classification all perform global self-Attention, which calculates the relationship (Attention Map) between each token and all other tokens. However, swin transducer is different and uses a sliding window (Shift Window Attention) design.

S322: as shown in fig. 4, the feature map after the first division and the feature map after the second division are input into a Multi-layer window fusion module (Multi-window confluence, abbreviated as MWC). The multi-layer window fusion module is a pure mathematical calculation module added after the original sliding window module, does not add network parameters for training, does not increase network calculation amount, and has the following basic principle:

as shown in fig. 5: the MWC takes as input the outputs fw (first divided feature map) and fsw (second divided feature map) of the W-MSA and SW-MSA. First, the feature map is normalized along the channel direction by softmax, ensuring that the vector modulo length is equal to 1, the specific operation: firstly, performing softmax operation, wherein the formula is as followsE represents the base of the natural logarithm, i.e. constant 2.71828, +.>And C is the number of output nodes, wherein the output value is the output value of the ith node. The normalized eigenvector is obtained and then divided by its modulo length, ensuring that the vector modulo length is equal to 1.

And then, processing each small pane of the feature map after the first division after normalization processing and each small pane of the feature map after the second division, taking patch as a vector, carrying out dot multiplication on the same positions of the two feature maps, wherein the inputs are (240, 240 and 96), and carrying out dot multiplication on the corresponding positions to finally obtain a feature map with the same size as the original feature map, and adding to obtain a similar matrix of fw and fsw. The similarity matrix is a feature matrix of small panes, the small panes in the W-MSA and the SW-MSA have overlapping parts, if some crack feature parameters exist in elements in the overlapping parts of the two small panes in the matrix, possible crack information exists between the two small panes with the overlapping parts, the two small panes are fused, and if some feature parameters of the elements in the overlapping parts of the two small panes in the matrix do not have cracks, the two small panes are not fused.

And finally, when the small pane in the feature map after the first division is fused with the small pane in the feature map after the second division, multiplying the feature map obtained by soft pooling and mathematical addition transformation of fw and fsw with a similar matrix to obtain a feature window fc with context information. This allows a plurality of windows to be combined, preventing loss of information about the elongate object, such as a crack. The feature map obtained by soft pooling and mathematical addition of the small panes is multiplied by a similarity matrix, and soft pooling operation is to calculate the weight in each pooled region by using a softmax function, so as to perform weighted average on the input features, and specific operations are as follows, because the actual picture is larger: assume that there is a feature map with an input feature map size of 4x4 and a channel number of 1. Soft pooling operations are performed using a 2x2 pooling window. The inputs are matrices (1 2 3 4,5 6 7 8,9 10 11 12, 13 14 15 16), weights resulting from softmax operations are calculated for each 2x2 pooled region, and then the features within the pooled region are weighted averaged. For the first pooled region: (1, 5, 6), by softmax operation, the calculated weights are: (0.0474 0.0474,0.9526 0.9526) weighted averaging of features within the pooled region: (1×0.0474+2×0.0474+5×0.9526+6×0.9526) = 5.9992, and the softmax operation and weighted average were performed for the other pooled regions as well. The final pooled feature map is: (5.9992 7.9992, 13.9992 15.9992), in this example the original signature is of size 4x4, after a 2x2 soft pooling operation, a 2x2 pooled signature is obtained. The mathematical addition transformation is to add fw and fsw corresponding positions.

S323: calculating each window output by the multi-layer window fusion module by using the constructed transformation matrix to obtain a query value Q, a key value K and a value V in a self-attention mechanism of each window, extracting features according to the obtained query value Q, the key value K and the value V of each window, calling parameters in a relative position coding table by using a relative position index constructed in the multi-head self-attention mechanism module, and fusing the feature calculation of each window to output a feature map of multi-head self-attention calculation output;

the multi-head Attention mechanism constructed by the present invention is composed of a plurality of Self-Attention mechanisms (Self-Attention), and fig. 6 is a structural diagram of a single Self-Attention mechanism:

the matrix Q (query), K (key value), V (value) needs to be used at the time of calculation. In practice, self-Attention receives the input (matrix X of word representation vectors X) or the output of the last coding module. Q, K, V are obtained by linear transformation through the input of Self-Attention. The Self-Attention input is represented by matrix X, and then Q, K, V can be calculated using linear variable matrix matrices WQ, WK, WV. Wherein, the number of columns of the Q and K matrix, namely vector dimension, has the specific formula: / >；

After the matrixes Q, K and V are obtained, the output of Self-Attention can be calculated according to the formula, and the specific calculation flow is as follows:

obtainingThe attention coefficient of each vector for the other vectors is then calculated using Softmax, where Softmax is performed on each row of the matrix, i.e. the sum of each row becomes 1.

As shown in fig. 7, an example of the relative position encoding table of the present invention includes two parts: relative position index, and relative positionThe coding table and the construction mode of the relative position coding table are as follows: establishing a relative position coding table, initializing data in the table, taking the marked crack image as sample data, taking the detection accuracy of the sample data as a standard, training parameters in the relative position coding table by using a machine learning method to obtain the relative position coding table, and adopting the following calculation formula:；

wherein sign (x) functions are used to return the sign of x. If x is a positive number, sign (x) returns to 1; if x is a negative number, sign (x) returns to-1; if x is equal to 0, sign (x) returns 0, and y is the same.、/>Indicating the x, y offset values.

As shown in fig. 8: taking the two-head attention as an example, the first self-attention part calculates b (i, 1), b (i, 2) is obtained by the same method, Calculating the output bi->Is a transformation matrix. In the figure, ei is a position code.

S324: processing the feature graphs output through self-attention calculation by using the constructed feature pyramid network, extracting features with different scales in the feature graphs, fusing the features, and outputting the fused feature graphs;

the network schematic diagram of the feature pyramid constructed by the embodiment of the invention is shown in fig. 9:

a feature pyramid network (Feature Pyramid Network, FPN for short) is a generic network architecture for multi-scale object detection. FPN was originally proposed by Facebook AI Research in 2017 with the aim of using feature maps at different scales to detect targets of different sizes.

FPN exploits two different levels of features: one is the underlying shallow layer, such features are high in resolution but less semantic; the other is high-level and deep, and the feature resolution is low but the semantics are much. They combine to form a multi-scale feature map that can detect targets of different scales. The FPN has two core steps: firstly, constructing a pyramid-like structure from bottom to top to generate initial feature maps with different resolutions; and the second step is to up-sample, fuse and adjust the initial feature map from top to bottom to obtain better feature representation. Specifically, the FPN transforms the bottom feature map to the same resolution as the upper layer feature map by upsampling and then weights and sums them to get a richer feature representation.

The specific network structure diagram of the feature pyramid is shown in fig. 10:

the input image is convolved using a backbone network, with 3 channels input, and multiple convolution kernels. For each convolution kernel, respectively carrying out convolution on 3 channels, and then adding up 3 channel results to obtain convolution output, so as to obtain a plurality of feature images with different layers, wherein the size and the channel number of each feature image are different.

A top-down path and cross-connect are then used to construct a feature pyramid, each pyramid layer containing an up-sampling operation and a 1x1 convolution operation for adding the feature map of the previous layer to the feature map of the current layer, thereby fusing the high-level and low-level information. The upsampling operation here uses bilinear interpolation, which is a method of calculating each pixel value in a high resolution image based on linear interpolation in two directions. Assuming a low resolution image, of size mxn, it needs to be upsampled to a high resolution image, of size mxn. For each pixel location in the high resolution imageIt is necessary to calculate its corresponding value. First, the corresponding position in the low resolution image is calculated . The specific calculation method is as follows: />。

Then, find the four nearest pixels (x, y) in the low resolution image, respectively). Wherein (1)>And->Is a maximum integer not greater than x and y, ">Andis a minimum integer not less than x and y. Next, the position +_in the high resolution image is calculated by the following formula>Pixel values of (2):

wherein,

，/>representing the position in a low resolution image +.>Is a pixel value of (a). Through the above calculation, the low resolution image can be up-sampled to a high resolution image using a linear interpolation method.

Finally, each pyramid layer is smoothed using a 3x3 convolution operation so that each pyramid layer has the same number of channels and aliasing effects due to upsampling are eliminated. The target detection model of the FPN can obtain better multi-scale feature images to detect targets with different sizes, and the target detection accuracy can be effectively improved.

S325: and predicting the fused characteristic map by using a constructed Region Proposal Network (RPN) layer to obtain a series of candidate regions.

S33: classifying and regressing each candidate region to obtain a final detection frame, and predicting a pixel level mask in each detection frame, including:

S331: pooling the candidate regions by using the constructed RoIAlign layer, wherein the RoIAlign layer pools the candidate regions to the same size by adopting a bilinear interpolation method, and the specific operation is as follows: sliding a fixed size 3*3 pooling window over the input feature map; for each pooling window, calculating an average of all values within the window; taking the average value as a value of a corresponding position in the output characteristic diagram; and maintaining the resolution and position accuracy of the feature map, outputting the Rol as the candidate region after the filling, and outputting a feature vector with a fixed size to represent the feature of the Rol.

Accurately aligning the features of the RoI on the feature map using RoIAlign in step S331; the RoIAlign layer divides the RoI area into small lattices with fixed sizes, and calculates the pixel value of each small lattice on the feature map by using a bilinear interpolation method; a feature map corresponding to the RoI is obtained.

S332: the feature vector of each RoI is classified and regressed, and a full connection layer and a softmax layer are used, wherein the full connection layer carries out integrated regression on the features of each candidate region, and the softmax layer classifies the features of each candidate region to predict the category and the position offset of each RoI. Then outputting a category label and a position offset representing the final detection frame of each Rol;

S333: mask prediction is performed. The feature vector of each Rol is subjected to mask operation, a convolution layer and a sigmoid layer which are improved aiming at cracks are used, the convolution layer is used for extracting features of a feature image in each detection frame, a constructed transverse edge convolution kernel and a constructed longitudinal edge convolution kernel are arranged in the convolution layer to further extract detailed parts of the crack image, the extracted features are binarized by a sigmoid activation function, a binary matrix consisting of 0 and 1 is output, wherein 0 and 1 respectively represent two states or categories in the matrix to predict a pixel level mask in each detection frame, and the binary matrix represents foreground and background pixels in each detection frame. And finally, combining the original image and outputting a detection result with a target detection frame and image segmentation.

The Swin transducer network structure constructed by the invention is shown in figure 11:

the Swin transducer network adopts a window translation-based method to model and process the visual data. The whole is composed of multi-head self-attention (W-MSA), shift window multi-head self-attention (SW-MSA) and multi-layer perceptron (MLP). Inserting a Layerrnorm (LN) layer in the middle can make training more stable and use residual connections after each module. The specific formula is as follows:

The core idea of Swin transform is to model the image using a hierarchical window mechanism, each window processing a portion of the image, and then integrate the results of the window processing by a transform-based method. The Swin transform divides the image into a plurality of large windows, each of which is considered a feature map, and images of different sizes, after being assigned to different numbers of windows, produce a set of multi-resolution feature maps, thereby reducing the temporal complexity and enhancing the feature extraction capabilities of the model. Meanwhile, the Swin transducer uses cross-window local convolution to extract features from a non-local view, and absolute position codes and relative position codes are introduced into the window, so that the model can establish a position-related relation in a coordinate system when processing an image with spatial information.

According to the invention, a Python-based 'Pytorch' frame is used for constructing a Swin Mask RCNN-based network algorithm model, a virtual environment suitable for training of the model is constructed by selecting CUDA, mmdetection, COCO related kits compatible with MMCV versions and the like, a suitable weight file is selected according to the number and the size of images, and 2000 pictures are divided into a training set, a test set and a verification set according to the proportion of 6:2:2. And manually marking crack positions, specific outlines, label information and the like. After model parameters are modified according to the size of an image, the number of detection categories, the number of pictures and the number of the pictures, a training set is put into a pre-training model to be trained, a trained weight file is obtained after 700 rounds of training, a Python programming program is used for calling the trained weight file to test by using a testing set, namely, a picture is input, a testing result is given, the model mAP is about 41%, and the comparison result of the image processing-based crack detection method and Mask RCNN network extraction is shown in fig. 12, fig. 13, fig. 14 and fig. 15, and fig. 12 to fig. 15 are respectively four different crack original diagrams and comparison diagrams of crack detection results respectively processed by the Mask RCNN network and the image processing-based crack detection method provided by the invention; the left image is an original image of each crack, the middle image is a crack detection result image extracted through a Mask RCNN network, and the right image is a crack detection result image processed by the crack detection method based on image processing.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and computer program products of methods and embodiments of the application. It will be understood that each of the flows in the flowchart may be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows.

Claims

1. The crack detection method based on image processing is characterized by comprising the following steps of: comprising the following steps:

s101: acquiring a crack image to be detected;

；

2. The image processing-based crack detection method as claimed in claim 1, wherein: when window sliding is performed on the small pane in the feature map after the first division by using the constructed Shift Window Attention mechanism module to obtain window sliding of the small pane in the feature map after the second division, when the small pane in the feature map after the first division slides along the horizontal or vertical direction, the window sliding distance is smaller than the side length of the small pane, and when the small pane in the feature map after the first division slides along the diagonal line of the small pane, the window sliding distance is smaller than the diagonal line distance of the small pane.

3. The image processing-based crack detection method as claimed in claim 1, wherein: the processing of the feature map output by multi-head self-attention calculation by using the constructed feature pyramid network comprises the following steps:

4. A crack detection method based on image processing as claimed in claim 3, characterized in that: the size and number of channels of each of the plurality of different levels of feature maps are different.

5. The image processing-based crack detection method as claimed in claim 1, wherein: classifying and regressing each candidate region, obtaining a final detection frame according to the position offset of each candidate region, predicting a pixel level mask in each detection frame, and outputting a detection result with a target detection frame and image segmentation, wherein the step of outputting the detection result comprises the following steps:

6. The image processing-based crack detection method as set forth in claim 5, wherein: the RoIAlign layer pools candidate regions using a bilinear interpolation method.

7. The image processing-based crack detection method as set forth in claim 5, wherein: the classifying and regressing the feature vector of each candidate region specifically comprises the following steps: the features of each candidate region are subjected to integrated regression by using a full connection layer, and the features of each candidate region are classified by using a softmax layer.

8. The image processing-based crack detection method as set forth in claim 5, wherein: the masking operation is performed on the feature vectors in the final detection frame of each candidate region to predict the pixel level mask in each detection frame, specifically, a convolution layer is used to perform feature extraction on the feature map in the detection frame, and a sigmoid activation function is used to binarize the extracted features and output a binary matrix.

9. The image processing-based crack detection method as set forth in claim 8, wherein: the convolution layer in the feature extraction of the feature map in the detection frame by using one convolution layer comprises a constructed transverse edge convolution kernel and a constructed longitudinal edge convolution kernel.

10. A storage medium, characterized by: the storage medium has stored thereon a computer program which, when executed by a processor, implements the steps of a crack detection method based on image processing as claimed in any one of claims 1 to 9.