CN114973246A

CN114973246A - Crack detection method of cross mode neural network based on optical flow alignment

Info

Publication number: CN114973246A
Application number: CN202210643687.5A
Authority: CN
Inventors: 骆霖轩; 韩晓东; 黄嘉浩; 吴欢娱; 王嘉萁; 陈采吟
Original assignee: Minjiang University
Current assignee: Minjiang University
Priority date: 2022-06-09
Filing date: 2022-06-09
Publication date: 2022-08-30

Abstract

The invention provides a crack detection method of a cross mode neural network based on optical flow alignment, which is characterized in that the method can perform cross-mode crack detection by fusing RGB image and depth image characteristic information based on the cross mode semantic segmentation neural network FAC-Net of the optical flow alignment, and comprises the following steps; s0, constructing a data set, training a cross-modal characteristic neural network FAC-Net, and S1, acquiring an RGB image and a depth image of a target to be detected; s2, obtaining a classification result image from the RGB image and the depth image of the same target to be detected through a pre-trained cross-modal characteristic neural network FAC-Net, wherein the classification result comprises crack semantic pixels and background semantic pixels; the method can realize the fusion of the RGB image and the depth image characteristic information and perform the crack detection in a cross-modal way.

Description

Crack detection method of cross mode neural network based on optical flow alignment

Technical Field

The invention relates to the technical field of concrete structure crack identification, in particular to a crack detection method of a cross mode neural network based on optical flow alignment.

Background

Non-human disasters such as flood, earthquake, typhoon and the like occur every year, and the disasters not only threaten the lives of people but also cause irreversible damage to concrete structures such as roads, bridges and building buildings, and the like, so that an originally good house can be changed into a dangerous house with a seriously damaged house structure, and the dangerous house can be unpredictably risked. For example, a very important project for identifying a dangerous room is observation of a crack in a building, and most of the existing dangerous room exploration is manual exploration which can generate great risks to the life safety of exploration personnel. Therefore, in the process of searching and judging the position of the concrete crack, a machine can be used for collecting the site image of the house, the crack is found out through background processing, and the approximate crack position is marked in the image, so that the concrete crack can be effectively detected under the safe condition.

With the rapid development of deep learning, especially the development of image processing, target detection and computer vision technology, the image-based nondestructive detection technology has become a research hotspot for defect detection at home and abroad. The detection method mostly adopts a digital image processing technology and a machine learning algorithm, and can detect some simple structural damages.

However, some problems still existing in the actual scene in the prior art cannot be effectively solved, and particularly, the crack cannot be accurately detected under the condition of a complex environment, such as water stain which is easily confused with the crack, manually-applied notes and the like. Because the above methods ignore the unique physical characteristics of the crack that are different from the background environment, the physical characteristics are essentially different in that the crack is deep, changes in depth rapidly, and has a large amplitude compared with other surrounding environments.

Optical flow (optical flow) is an important method for motion image analysis, and its concept was first proposed by James j. Gibson in the 40 th century, referring to the velocity of mode motion in time-varying images. Because when an object is in motion, the luminance pattern of its corresponding point on the image is also in motion. The apparent motion (apparent motion) of the image brightness pattern is the optical flow. The optical flow expresses the change of the image, and since it contains information on the movement of the object, it can be used by the observer to determine the movement of the object. The optical flow definition can be used to extend the optical flow field, which is a two-dimensional (2D) instantaneous velocity field formed by all pixel points in an image, wherein the two-dimensional velocity vector is the projection of the three-dimensional velocity vector of a visible point in a scene on the imaging surface. The optical flow contains not only motion information of the observed object but also rich information about the three-dimensional structure of the scene.

Disclosure of Invention

The invention provides a crack detection method of a cross mode neural network based on optical flow alignment, which can realize the cross-modal crack detection by fusing RGB image and depth image characteristic information.

The invention adopts the following technical scheme.

A crack detection method of a cross mode neural network based on optical flow alignment is disclosed, the method is based on the cross mode semantic segmentation neural network FAC-Net of optical flow alignment, and can perform crack detection in a cross mode by fusing RGB image and depth image feature information, and comprises the following steps;

step S0, constructing a data set, training a cross-modal characteristic neural network FAC-Net,

s1, acquiring an RGB image and a depth image of a target to be detected;

s2, obtaining a classification result image from the RGB image and the depth image of the same target to be detected through a pre-trained cross-modal characteristic neural network FAC-Net, wherein the classification result comprises crack semantic pixels and background semantic pixels;

and identifying the crack region according to the crack semantic pixel and the background semantic pixel.

In the step S0, the specific method includes:

step S01, shooting the concrete object by using a camera with a depth perception technology and an RGB sensor, and acquiring the original image data of the concrete object to form a data set;

step S02, classifying and marking the data of the data set into a crack pixel class and a background pixel class;

step S03, dividing the data set into training set and testing set according to the proportion;

step S04, training the FAC-Net neural network by using a data set;

and step S05, arranging the trained neural network model on back-end equipment for crack detection.

In the step S01, the concrete objects include the concrete structure surfaces of roads, bridges and building structures;

in step S02, the types of the crack pixels and the background pixels can be expanded into a data set by data enhancement;

in step S03, the image data set includes RGB image data, depth image data, and tag data corresponding thereto.

In step S1, a camera having a depth sensing technology and an RGB sensor is used to acquire an RGB image and a depth image of the target to be detected, and the image acquisition is not limited to a shooting mode, so that the obtained image is clear and can be regarded as a qualified standard.

In step S2, the captured RGB image and depth image are transmitted to the back-end device through a network or a wired manner, and the back-end device inputs the RGB image and depth image into the trained FAC-Net neural network to obtain a corresponding concrete crack detection result.

The FAC-Net network structure is an encoder-decoder structure; the encoder is provided with 2 branches and 1 fusion area, wherein the branches are RGB branches and Depth branches respectively;

after each branch unit of the FAC-Net network, inputting the RGB characteristics and the Depth characteristics into an FAA fusion module to obtain fusion characteristics, and then adding the fusion characteristics to the original branch characteristics; after 4 units, inputting the highest level fusion features into a feature pyramid PPM to further extract features;

the decoder part uses an FA module and is used for enabling the low-resolution high-semantic feature image to flow to the high-resolution low-semantic feature image, and an output image of a classification result is obtained through 3 modules;

all branches of the encoder use a classical resnet50 network as a backbone network;

in the encoder backbone network structure, a first unit is sequentially provided with a convolution layer, a maximum pooling layer and a comprehensive convolution layer; the second unit is a comprehensive convolution layer three-layer structure; the third unit is a three-layer structure of a comprehensive rolling layer.

In the encoder backbone network structure, the convolution layer of the first unit comprises 64 convolution kernels, the size of the convolution kernels is 7 x 7, the step length is 2, and the padding is 3; the pooling core size of the maximum pooling layer of the first unit is 3 × 3, and the step length is 2; the integrated convolutional layers of the first unit are of a three-layer structure, wherein the first convolutional layer comprises 64 convolution kernels, the size of the convolution kernels is 1 x 1, the second convolutional layer comprises 64 convolution kernels, the size of the convolution kernels is 3 x 3, the padding is 1, the third convolutional layer comprises 256 convolution kernels, and the size of the convolution kernels is 1 x 1;

the integrated convolutional layers of the second unit are of a three-layer structure, the first convolutional layer comprises 128 convolution kernels, the size of the convolution kernels is 1 x 1, the second convolutional layer comprises 128 convolution kernels, the size of the convolution kernels is 3 x 3, the padding is 1, the third convolutional layer comprises 512 convolution kernels, and the size of the convolution kernels is 1 x 1;

the comprehensive convolutional layer of the third unit is of a three-layer structure, the first convolutional layer comprises 256 convolutional kernels, the size of the convolutional kernels is 1 x 1, the second convolutional layer comprises 256 convolutional kernels, the size of the convolutional kernels is 3 x 3, the padding is 1, the third convolutional layer comprises 1024 convolutional kernels, and the size of the convolutional kernels is 1 x 1;

the integrated convolutional layer of the fourth unit has a three-layer structure, the first convolutional layer comprises 512 convolution kernels, the size of the convolution kernels is 1 x 1, the second convolutional layer comprises 512 convolution kernels, the size of the convolution kernels is 3 x 3, the padding is 1, the third convolutional layer comprises 2048 convolution kernels, and the size of the convolution kernels is 1 x 1.

In the detection method, the depth picture and the picture information of the RGB picture are aligned and then fused by using the optical flow characteristic of an FAA fusion module;

the FAA fusion module comprises an RGB branch and a Depth branch;

when the RGB branches work, the expression mode of the RGB characteristics of the image is H multiplied by W multiplied by C, namely the height multiplied by the width multiplied by the number of channels, the RGB characteristics firstly pass through a first convolution layer of the RGB branches, the convolution kernel size of the first convolution layer is 1 multiplied by 1, the step length is 1, the filling is 0, and the image characteristic graph size is H multiplied by W multiplied by 1 after the first convolution layer processing;

when the Depth branch works, the expression mode of the Depth feature of the image is H multiplied by W multiplied by C, namely the height multiplied by the width multiplied by the channel number, the Depth feature passes through a first convolution layer of the Depth branch, 2 features after convolution are spliced in the channel direction to be H multiplied by W multiplied by 2, and a fusion convolution layer with the convolution kernel size of 3 multiplied by 3, the step length of 1 and the filling of 1 is further processed to obtain an H multiplied by W multiplied by 4 fusion light flow graph;

2 layers of the fused optical flow graph are RGB branch optical flow graphs, the other 2 layers of the fused optical flow graph are Depth branch optical flow graphs, then the fused optical flow graph is split into 2H multiplied by W multiplied by 2 optical flow graphs, and the original RGB characteristics of H multiplied by W multiplied by C and the original Depth characteristics of H multiplied by W multiplied by C are respectively subjected to characteristic correction through corresponding optical flow graphs; inputting corrected RGB (red, green and blue) features of H multiplied by W multiplied by C and Depth features of H multiplied by W multiplied by C into a correlation mechanism of a space attention module to further extract fusion features, specifically, carrying out averaging and maximum pooling operation on 2 features along a channel direction to generate 4 features with the size of H multiplied by W multiplied by 1, splicing the 4 features to one H multiplied by W multiplied by 4 feature along the channel direction, and then carrying out convolution kernel with the size of 3 multiplied by 3, the step length of 1, the convolution layer filled with 1 and a Sigmoid activation function to obtain a space attention weight matrix of H multiplied by W multiplied by 2;

finally, splitting the H multiplied by W multiplied by 2 spatial attention weight matrix into 2H multiplied by W multiplied by 1 spatial attention weight matrices; multiplying the original RGB characteristics of H multiplied by W multiplied by C and the original Depth characteristics of H multiplied by W multiplied by C with the corresponding weight matrix, then adding 2 weight-weighted characteristics, and finally outputting the obtained added characteristics through a ReLU activation function to obtain FAA fusion characteristics of H multiplied by W multiplied by C; the input RGB features, Depth features and output FAA fusion features are all the same size.

In the detection method, an FA module predicts a flow field by using a method of optical flow correction through jump connection of an encoder and a decoder to enable a low-resolution high-semantic feature image to flow to a high-resolution low-semantic feature image, so that the detail features of a crack are reserved as much as possible, and the specific method comprises the following steps: compressing the number of channels of the low-resolution high-semantic feature image and the high-resolution low-semantic feature image to 2 through a first convolution layer with convolution kernel size of 1 multiplied by 1, step length of 1 and filling of 0, performing bilinear interpolation on the H/2 multiplied by W/2 multiplied by 2 low-resolution high-semantic feature image to enable the size of the low-resolution high-semantic feature image to be the same as that of the H multiplied by W multiplied by 2 high-resolution low-semantic feature image, and then splicing the low-resolution high-semantic feature image and the high-semantic feature image in the channel direction;

after splicing, 2 fusion convolution layers with convolution kernel size of 3 multiplied by 3, step length of 1 and filling of 1 are respectively processed to obtain H multiplied by W multiplied by 2 light flow graph and H multiplied by W multiplied by 1 space attention weight matrix; multiplying the low semantic feature image by a space attention weight matrix, carrying out feature correction on the high semantic image through an optical flow graph, and adding the two to obtain a combined feature.

The method for predicting the flow field by using the optical flow correction is used for enabling the low-resolution high-semantic feature image to flow to the high-resolution low-semantic feature image, and in the prediction process of the neural network, the feature size of the image changes as follows: inputting RGB images with the size of 512 multiplied by 3 and Depth pictures with the size of 512 multiplied by 1 for detection, obtaining image fusion characteristics with the size of 128 multiplied by 256 after fusion of the first unit, and obtaining new RGB characteristics and new Depth with the size of 128 multiplied by 256 after the addition of the fusion characteristics and the original characteristics; obtaining fusion characteristics with the size of 64 multiplied by 512 through a second unit, and obtaining new RGB characteristics and new Depth characteristics with the size of 64 multiplied by 512 after the addition of the fusion characteristics and the original characteristics; obtaining fusion characteristics with the size of 32 multiplied by 1024 through a third unit, and obtaining new RGB characteristics and new Depth characteristics with the size of 32 multiplied by 1024 after the addition of the fusion characteristics and the original characteristics; obtaining fusion features with the size of 16 multiplied by 2048 through a fourth unit;

then, further extracting the characteristics of the fusion characteristics with the size of 16 multiplied by 2048 through a characteristic pyramid to obtain new fusion characteristics with the size of 16 multiplied by 2048; inputting the new fusion features with the size of 16 multiplied by 2048 and the fusion features with the size of 32 multiplied by 1024 into an FA module for optical flow correction to obtain optical flow alignment fusion features with the size of 32 multiplied by 1024; inputting the optical flow alignment fusion features with the size of 32 multiplied by 1024 and the fusion features with the size of 64 multiplied by 512 into an FA module for optical flow correction to obtain new optical flow alignment fusion features with the size of 64 multiplied by 512; inputting the optical flow alignment fusion features with the size of 64 multiplied by 512 and the fusion features with the size of 128 multiplied by 256 into an FA module for optical flow correction to obtain new optical flow alignment fusion features with the size of 128 multiplied by 256; carrying out bilinear interpolation on the optical flow alignment fusion features with the size of 128 multiplied by 256 to enlarge the size of 512 multiplied by 256; and (3) subjecting the optical flow alignment fusion features with the size of 512 multiplied by 256 to convolution layers with the convolution kernel size of 1 multiplied by 1 and the step length of 1 to obtain a final 512 multiplied by 1 predicted image, wherein semantic information of the predicted image pixels is divided into two types of crack pixels and background pixels.

In step S0, a pre-prepared crack image dataset is sent to a cross-modal eigen neural network FAC-Net for training, the crack image dataset includes N crack RGB images and N corresponding crack depth images, and the data set is calculated by a method of 2: the proportion of 8 is divided into a test set and a training set, the training set is used for training FAC-Net, and the test set is used for observing the accuracy rate of the neural network model for identifying cracks in the training process;

setting three indexes as comparison benchmarks of the observation neural network model, wherein the indexes are accuracy, recall rate and F1 scores; the index calculation formula is as follows:

the parameter TP, i.e. the amount of occurrence considered and indeed true;

the parameter FP, i.e. the amount of occurrence that is considered true but actually false;

parameter FN, the amount of occurrence that is considered false but actually true;

the parameter TN, the occurrence considered false and indeed true;

accuracy = TP/(TP + FP)

Recall = TP/(TP + FN)

F1 score =2 × precision × recall/(precision + recall).

In the scheme, the FAA fusion Module is a Chinese shorthand of Flow Alignment attribute Module; the characteristic Pyramid PPM is a Chinese shorthand of Pyramid Pooling Module; the FA Module is the Chinese shorthand of Flow Alignment Module.

Compared with the prior art, the invention has the technical advantages that:

1. cross-modal crack detection is achieved using RGB images and depth images as data.

Due to the different physical characteristics between the crack and the confounding background, the essential differences are that the crack is deep, changes in depth quickly, and is large in magnitude compared to other surrounding environments. The cross-modal detection method fusing the RGB image and the depth image is used, and due to the complementarity of the RGB image and the depth image, the difference between the crack and the confusable background environment can be better recognized by combining the color semantic information of the RGB image and the depth semantic information of the depth image compared with the method only utilizing the color semantic information of the RGB image.

2. It is proposed to align RGB information and depth information using an optical flow alignment method.

The depth picture and the RGB picture are not consistent in information area expressed by the depth picture and the RGB picture because the depth picture may have objective effects such as edge blurring and misalignment with the RGB picture when the depth picture and the RGB picture are taken. If the addition fusion is simple, the feature extraction of the crack edge information by the neural network can be misled. The FAA fusion module structure is designed by utilizing the characteristics of the optical flow, and aims to align the depth picture and the information area expressed by the RGB picture in an optical flow alignment mode, so that the depth picture and the RGB picture can obtain better detailed feature information during fusion.

The invention can carry out real-time crack marking and detection on the inner walls, the ground and the like of buildings such as factories, houses, commercial buildings and the like, and is mainly applied to the fields of dangerous house exploration, urban planning and the like.

The invention realizes a real-time crack semantic segmentation algorithm based on target detection. The algorithm is characterized in that a real-time feature extraction module is added on the basis of a YOLOv5 target detection algorithm, when the size of an interested area predicted by target detection is smaller than a threshold value, an Ostu algorithm is used for rapidly detecting a small interested area, when the size of the interested area is larger than the threshold value, a real-time semantic segmentation network FANet network is used for carrying out real-time slice fusion detection on a large interested area, and features detected by all interested areas are mapped to an original image position to form a final crack feature map so as to achieve the purpose of detecting the specific position of a crack on the basis of guaranteeing real-time performance and accuracy.

Drawings

The invention is described in further detail below with reference to the following figures and detailed description:

FIG. 1 is a schematic diagram of the structure of a FAC-Net network;

FIG. 2 is a schematic structural diagram of a FAA fusion module;

FIG. 3 is a schematic diagram of a spatial attention module;

FIG. 4 is a schematic diagram of the FA module;

FIG. 5 is a schematic flow diagram of the present invention;

FIG. 6 is a pictorial illustration of a fracture picture dataset;

fig. 7 is a schematic diagram showing comparison between an image of an object to be detected and a detection result.

Detailed Description

As shown in the figure, the method for detecting the cracks of the cross mode neural network based on the optical flow alignment is characterized in that the neural network FAC-Net is segmented based on the cross mode semantic of the optical flow alignment, and the crack detection can be carried out in a cross mode by fusing RGB (red, green and blue) images and depth image feature information, and comprises the following steps;

s1, acquiring an RGB image and a depth image of a target to be detected;

in the step S0, the specific method includes:

step S04, training the FAC-Net neural network by using a data set;

In the step S01, the concrete objects include concrete structure surfaces of roads, bridges and building buildings;

As shown in fig. 1, the FAC-Net network structure is an encoder-decoder structure; the encoder is provided with 2 branches and 1 fusion area, wherein the branches are RGB branches and Depth branches respectively;

As shown in fig. 2, in the detection method, the depth picture and the picture information of the RGB picture are aligned by the optical flow characteristic of the FAA fusion module and then fused;

the FAA fusion module comprises an RGB branch and a Depth branch;

when RGB branches work, the expression mode of RGB characteristics of an image is H multiplied by W multiplied by C, namely the height multiplied by the width multiplied by the number of channels, the RGB characteristics firstly pass through a first convolution layer of the RGB branches, the size of a convolution kernel of the first convolution layer is 1 multiplied by 1, the step length is 1, the filling is 0, and the size of an image characteristic graph is changed into H multiplied by W multiplied by 1 after the first convolution layer processing;

when the Depth branch works, the expression mode of the Depth characteristics of the image is H multiplied by W multiplied by C, namely the height multiplied by the width multiplied by the number of channels, the Depth characteristics pass through a first convolution layer of the Depth branch, 2 characteristics after convolution are spliced in the channel direction to be H multiplied by W multiplied by 2, and a fusion convolution layer with the convolution kernel size of 3 multiplied by 3, the step length of 1 and the filling of 1 is further processed to obtain an H multiplied by W multiplied by 4 fusion light flow graph;

2 layers of the fused optical flow graph are RGB branch optical flow graphs, the other 2 layers of the fused optical flow graph are Depth branch optical flow graphs, then as shown in FIG. 3, the fused optical flow graph is split into 2H multiplied by W multiplied by 2 optical flow graphs, and firstly, the original RGB characteristics of H multiplied by W multiplied by C and the Depth characteristics of H multiplied by W multiplied by C are respectively subjected to characteristic correction through corresponding optical flow graphs; inputting corrected RGB (red, green and blue) features of H multiplied by W multiplied by C and Depth features of H multiplied by W multiplied by C into a correlation mechanism of a space attention module to further extract fusion features, specifically, carrying out averaging and maximum pooling operation on 2 features along a channel direction to generate 4 features with the size of H multiplied by W multiplied by 1, splicing the 4 features to one H multiplied by W multiplied by 4 feature along the channel direction, and then carrying out convolution kernel with the size of 3 multiplied by 3, the step length of 1, the convolution layer filled with 1 and a Sigmoid activation function to obtain a space attention weight matrix of H multiplied by W multiplied by 2;

As shown in fig. 4, in the detection method, an FA module uses a skip connection between an encoder and a decoder and predicts a flow field by using an optical flow correction method to flow a low-resolution high-semantic feature image to a high-resolution low-semantic feature image, so as to retain the detail features of the crack as much as possible, and the specific method is as follows: compressing the number of channels of the low-resolution high-semantic feature image and the high-resolution low-semantic feature image to 2 through a first convolution layer with convolution kernel size of 1 multiplied by 1, step length of 1 and filling of 0, performing bilinear interpolation on the H/2 multiplied by W/2 multiplied by 2 low-resolution high-semantic feature image to enable the size of the low-resolution high-semantic feature image to be the same as that of the H multiplied by W multiplied by 2 high-resolution low-semantic feature image, and then splicing the low-resolution high-semantic feature image and the high-semantic feature image in the channel direction;

setting three indexes as comparison benchmarks of the observation neural network model, wherein the three indexes are respectively an accuracy rate, a recall rate and an F1 score; the index calculation formula is as follows:

the parameter TP, i.e. the amount of occurrence considered and indeed true;

the parameter TN, the occurrence considered false and indeed true;

accuracy = TP/(TP + FP)

Recall = TP/(TP + FN)

F1 score =2 × precision × recall/(precision + recall).

The following table shows the comparison result of the crack detection index of the model FAC-Net and the classical semantic segmentation network U-Net under different environmental backgrounds:

the data in the table show that the cross-mode neural network designed by the invention can better identify the characteristic difference between the crack and the confusable background under the condition of different environment backgrounds by using the cross-mode semantic segmentation neural network FAC-Net based on the alignment of the optical flow, thereby achieving a better crack detection effect and having stronger robustness.

In the middle of the left part and the middle of the right part of fig. 7, there is a pixel region identified as a crack. It can be seen that the FAC-Net can still exclude various confusable information from accurately detecting the position of the crack under the condition of more complex background colors.

In the above, the FAA fusion Module is a chinese shorthand of Flow Alignment attribute Module; the characteristic Pyramid PPM is a Chinese shorthand of Pyramid Pooling Module; the FA Module is the Chinese shorthand of Flow Alignment Module.

Claims

1. A crack detection method of a cross mode neural network based on optical flow alignment is characterized by comprising the following steps: the method is based on the cross mode semantic segmentation neural network FAC-Net of optical flow alignment, can perform crack detection in a cross mode by fusing RGB image and depth image characteristic information, and comprises the following steps;

s1, acquiring an RGB image and a depth image of a target to be detected;

in the step S0, the specific method includes:

step S04, training the FAC-Net neural network by using the data set;

2. The method of claim 1, wherein the method comprises: in the step S01, the concrete objects include concrete structure surfaces of roads, bridges and building buildings;

3. The method of claim 1, wherein the method comprises: in step S1, a camera having a depth sensing technology and an RGB sensor is used to acquire an RGB image and a depth image of the target to be detected, and the image acquisition is not limited to a shooting mode, so that the obtained image is clear and can be regarded as a qualified standard.

4. The method of claim 3, wherein the method comprises: in step S2, the captured RGB image and depth image are transmitted to the back-end device through a network or a wired manner, and the back-end device inputs the RGB image and depth image into the trained FAC-Net neural network to obtain a corresponding concrete crack detection result.

5. The method of claim 1, wherein the method comprises: the FAC-Net network structure is an encoder-decoder structure; the encoder is provided with 2 branches and 1 fusion area, wherein the branches are RGB branches and Depth branches respectively;

6. The method of claim 5, wherein the method comprises: in the encoder backbone network structure, the convolution layer of the first unit comprises 64 convolution kernels, the size of the convolution kernels is 7 x 7, the step length is 2, and the padding is 3; the pooling core size of the maximum pooling layer of the first unit is 3 × 3, and the step length is 2; the integrated convolutional layers of the first unit are of a three-layer structure, wherein the first convolutional layer comprises 64 convolution kernels, the size of the convolution kernels is 1 x 1, the second convolutional layer comprises 64 convolution kernels, the size of the convolution kernels is 3 x 3, the padding is 1, the third convolutional layer comprises 256 convolution kernels, and the size of the convolution kernels is 1 x 1;

the integrated convolutional layer of the fourth unit has a three-layer structure, the first convolutional layer comprises 512 convolutional kernels, the size of the convolutional kernels is 1 × 1, the second convolutional layer comprises 512 convolutional kernels, the size of the convolutional kernels is 3 × 3, the padding is 1, the third convolutional layer comprises 2048 convolutional kernels, and the size of the convolutional kernels is 1 × 1.

7. The method of claim 5, wherein the method comprises: in the detection method, the depth picture and the picture information of the RGB picture are aligned and then fused by using the optical flow characteristic of an FAA fusion module;

the FAA fusion module comprises an RGB branch and a Depth branch;

8. The method of claim 7, wherein the method comprises: in the detection method, an FA module predicts a flow field by using a method of optical flow correction through jump connection of an encoder and a decoder to enable a low-resolution high-semantic feature image to flow to a high-resolution low-semantic feature image, so that the detail features of a crack are reserved as much as possible, and the specific method comprises the following steps: compressing the number of channels of the low-resolution high-semantic feature image and the high-resolution low-semantic feature image to 2 through a first convolution layer with convolution kernel size of 1 multiplied by 1, step length of 1 and filling of 0, performing bilinear interpolation on the H/2 multiplied by W/2 multiplied by 2 low-resolution high-semantic feature image to enable the size of the low-resolution high-semantic feature image to be the same as that of the H multiplied by W multiplied by 2 high-resolution low-semantic feature image, and then splicing the low-resolution high-semantic feature image and the high-semantic feature image in the channel direction;

after splicing, obtaining an H multiplied by W multiplied by 2 light flow graph and an H multiplied by W multiplied by 1 space attention weight matrix through 2 fusion convolution layers with convolution kernel size of 3 multiplied by 3, step length of 1 and filling of 1; multiplying the low semantic feature image by a space attention weight matrix, carrying out feature correction on the high semantic image through an optical flow graph, and adding the two to obtain a combined feature.

9. The method of claim 8, wherein the method comprises: the method for predicting the flow field by using the optical flow correction is used for enabling the low-resolution high-semantic feature image to flow to the high-resolution low-semantic feature image, and in the prediction process of the neural network, the feature size of the image changes as follows: inputting RGB images with the size of 512 multiplied by 3 and Depth pictures with the size of 512 multiplied by 1 for detection, obtaining image fusion characteristics with the size of 128 multiplied by 256 after fusion of the first unit, and obtaining new RGB characteristics and new Depth with the size of 128 multiplied by 256 after the addition of the fusion characteristics and the original characteristics; obtaining a fusion feature with the size of 64 multiplied by 512 through a second unit, and obtaining a new RGB feature and a new Depth feature with the size of 64 multiplied by 512 through the addition of the fusion feature and the original feature; obtaining fusion characteristics with the size of 32 multiplied by 1024 through a third unit, and obtaining new RGB characteristics and new Depth characteristics with the size of 32 multiplied by 1024 after the fusion characteristics and the original characteristics are added; obtaining fusion features with the size of 16 multiplied by 2048 through a fourth unit;

10. The method of claim 1, wherein the method comprises: in step S0, a pre-prepared crack image dataset is sent to a cross-modal eigen neural network FAC-Net for training, the crack image dataset includes N crack RGB images and N corresponding crack depth images, and the data set is calculated by a method of 2: the proportion of 8 is divided into a test set and a training set, the training set is used for training FAC-Net, and the test set is used for observing the accuracy rate of the neural network model for identifying cracks in the training process;

the parameter TP, i.e. the amount of occurrence considered and indeed true;

the parameter TN, the occurrence considered false and indeed true;

accuracy = TP/(TP + FP)

Recall = TP/(TP + FN)

F1 score =2 × precision × recall/(precision + recall).