CN117746136A

CN117746136A - Pavement disease detection method, neural network training method, device and equipment

Info

Publication number: CN117746136A
Application number: CN202311767749.4A
Authority: CN
Inventors: 徐浩; 彭炜
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2023-12-20
Filing date: 2023-12-20
Publication date: 2024-03-22

Abstract

The disclosure provides a pavement disease detection method, a training method of a neural network, a training device of the neural network and equipment, relates to the field of artificial intelligence, and particularly relates to the technical fields of computer vision, deep learning and the like. The neural network comprises a feature coding sub-network, a boundary prediction sub-network and a segmentation prediction sub-network, and the training method comprises the following steps: acquiring a sample image and a real label of pavement diseases in the sample image; inputting the sample image into a feature coding sub-network to obtain sample image features; inputting the characteristics of the sample image into a boundary prediction sub-network to obtain a first prediction result, wherein the first prediction result represents the boundary of the pavement diseases in the sample image; inputting the characteristics of the sample image into a segmentation prediction sub-network to obtain a second prediction result, wherein the second prediction result represents an example segmentation area of the pavement diseases in the sample image; and adjusting parameters of the neural network based on the first prediction result, the second prediction result and the real label.

Description

Pavement disease detection method, neural network training method, device and equipment

Technical Field

The disclosure relates to the technical field of artificial intelligence, in particular to the technical fields of computer vision, deep learning and the like, and in particular relates to a training method of a neural network for detecting road surface diseases, a road surface disease detection method, a training device of the neural network for detecting road surface diseases, a road surface disease detection device, electronic equipment, a computer readable storage medium and a computer program product.

Background

Artificial intelligence is the discipline of studying the process of making a computer mimic certain mental processes and intelligent behaviors (e.g., learning, reasoning, thinking, planning, etc.) of a person, both hardware-level and software-level techniques. Artificial intelligence hardware technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing, and the like; the artificial intelligence software technology mainly comprises a computer vision technology, a voice recognition technology, a natural language processing technology, a machine learning/deep learning technology, a big data processing technology, a knowledge graph technology and the like.

Road defects include road cracks, pits, crazes, road frame differences, etc., which reduce the load-carrying capacity of the road and increase the risk of vehicle travel. By detecting the pavement diseases, the problems can be found and repaired in time, and the safety and reliability of the road are ensured.

The approaches described in this section are not necessarily approaches that have been previously conceived or pursued. Unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section. Similarly, the problems mentioned in this section should not be considered as having been recognized in any prior art unless otherwise indicated.

Disclosure of Invention

The present disclosure provides a training method of a neural network for road surface failure detection, a road surface failure detection method, a training apparatus of a neural network for road surface failure detection, a road surface failure detection apparatus, an electronic device, a computer-readable storage medium, and a computer program product.

According to an aspect of the present disclosure, there is provided a training method of a neural network for road surface disease detection, the neural network including a feature coding sub-network, a boundary prediction sub-network, and a segmentation prediction sub-network, the training method including: acquiring a sample image and a real label of pavement diseases in the sample image; inputting the sample image into a feature coding sub-network to obtain sample image features; inputting the characteristics of the sample image into a boundary prediction sub-network to obtain a first prediction result, wherein the first prediction result represents the boundary of the pavement diseases in the sample image; inputting the characteristics of the sample image into a segmentation prediction sub-network to obtain a second prediction result, wherein the second prediction result represents an example segmentation area of the pavement diseases in the sample image; and adjusting parameters of the neural network based on the first prediction result, the second prediction result and the real label.

According to another aspect of the present disclosure, there is provided a road surface fault detection method, a neural network for road surface fault detection including a feature coding sub-network, a boundary prediction sub-network, and a segmentation prediction sub-network, the method including: acquiring an image to be detected; inputting an image to be detected into a characteristic coding sub-network for inputting a sample image to obtain the characteristic of the image to be detected; inputting the image features to be detected into a boundary prediction sub-network to obtain a boundary prediction result, wherein the boundary prediction result represents the boundary of the pavement defect in the image to be detected; inputting the image features to be detected into a segmentation prediction sub-network to obtain a segmentation prediction result, wherein the segmentation prediction result represents an example segmentation area of the pavement diseases in the image to be detected; and obtaining a pavement disease detection result of the image to be detected based on the boundary prediction result and the segmentation prediction result.

According to another aspect of the present disclosure, there is provided a training apparatus for a neural network for road surface fault detection, the neural network including a feature encoding sub-network, a boundary prediction sub-network, and a segmentation prediction sub-network, the apparatus comprising: a first acquisition unit configured to acquire a sample image and a real label of a road surface disease in the sample image; a first feature encoding unit configured to input a sample image into a feature encoding sub-network to obtain a sample image feature; the first boundary prediction unit is configured to input the characteristics of the sample image into the boundary prediction sub-network to obtain a first prediction result, and the first prediction result represents the boundary of the pavement defect in the sample image; the first segmentation prediction unit is configured to input the characteristics of the sample image into a segmentation prediction sub-network to obtain a second prediction result, and the second prediction result represents an example segmentation area of the pavement disease in the sample image; and a parameter tuning unit configured to adjust parameters of the neural network based on the first prediction result, the second prediction result, and the real label.

According to another aspect of the present disclosure, there is provided a road surface fault detection device, a neural network for road surface fault detection including a feature coding sub-network, a boundary prediction sub-network, and a segmentation prediction sub-network, the device including: a second acquisition unit configured to acquire an image to be detected; the second feature coding unit is configured to input an image to be detected into the feature coding sub-network for inputting the sample image to obtain the feature of the image to be detected; the second boundary prediction unit is configured to input the image features to be detected into the boundary prediction sub-network to obtain a boundary prediction result, and the boundary prediction result represents the boundary of the pavement defect in the image to be detected; the second segmentation prediction unit is configured to input the image features to be detected into a segmentation prediction sub-network to obtain a segmentation prediction result, and the segmentation prediction result represents an example segmentation area of the pavement diseases in the image to be detected; and a detection unit configured to obtain a road surface defect detection result of the image to be detected based on the boundary prediction result and the segmentation prediction result.

According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method described above.

According to another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the above-described method.

According to another aspect of the present disclosure, a computer program product is provided, comprising a computer program, wherein the computer program, when executed by a processor, implements the above-described method.

According to one or more embodiments of the present disclosure, by extracting sample image features of a sample image, obtaining a first prediction result representing a boundary of a road surface disease based on the sample image features by using a boundary prediction sub-network, and obtaining a second prediction result representing an example segmentation area of the road surface disease based on the sample image features by using a segmentation prediction sub-network, parameters of a neural network are adjusted by using the two prediction results and a real label of the road surface disease, so that the neural network can be supervised from two angles of boundary and example segmentation, and detection capability of the trained neural network is improved.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The accompanying drawings illustrate exemplary embodiments and, together with the description, serve to explain exemplary implementations of the embodiments. The illustrated embodiments are for exemplary purposes only and do not limit the scope of the claims. Throughout the drawings, identical reference numerals designate similar, but not necessarily identical, elements.

FIG. 1 illustrates a schematic diagram of an exemplary system in which various methods described herein may be implemented, in accordance with an embodiment of the present disclosure;

FIG. 2 illustrates a flowchart of a training method of a neural network for road surface fault detection, according to an exemplary embodiment of the present disclosure;

FIG. 3 illustrates a schematic diagram of a neural network for road surface fault detection according to an exemplary embodiment of the present disclosure;

FIG. 4 illustrates a flowchart of a process of obtaining a first prediction result according to an exemplary embodiment of the present disclosure;

FIG. 5 illustrates a flowchart of a process of obtaining a second prediction result according to an exemplary embodiment of the present disclosure;

FIG. 6 illustrates a flowchart of a process of acquiring decoded features of a plurality of feature decoding block outputs according to an exemplary embodiment of the present disclosure;

FIG. 7 illustrates a flowchart of a training method for a neural network for road surface fault detection, according to an exemplary embodiment of the present disclosure;

Fig. 8 illustrates a flowchart of a process of acquiring depth information features output by a plurality of depth information decoding blocks according to an exemplary embodiment of the present disclosure;

FIG. 9 illustrates a schematic diagram of a neural network for road surface fault detection according to an exemplary embodiment of the present disclosure;

FIG. 10 illustrates a flowchart of a process of adjusting parameters of a neural network, according to an exemplary embodiment of the present disclosure;

fig. 11 shows a flowchart of a pavement damage detection method according to an exemplary embodiment of the present disclosure

FIG. 12 illustrates a block diagram of a training apparatus of a neural network, according to an exemplary embodiment of the present disclosure;

fig. 13 shows a block diagram of a road surface fault detection device according to an exemplary embodiment of the present disclosure; and

fig. 14 illustrates a block diagram of an exemplary electronic device that can be used to implement embodiments of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

In the present disclosure, the use of the terms "first," "second," and the like to describe various elements is not intended to limit the positional relationship, timing relationship, or importance relationship of the elements, unless otherwise indicated, and such terms are merely used to distinguish one element from another. In some examples, a first element and a second element may refer to the same instance of the element, and in some cases, they may also refer to different instances based on the description of the context.

The terminology used in the description of the various illustrated examples in this disclosure is for the purpose of describing particular examples only and is not intended to be limiting. Unless the context clearly indicates otherwise, the elements may be one or more if the number of the elements is not specifically limited. Furthermore, the term "and/or" as used in this disclosure encompasses any and all possible combinations of the listed items.

In the related art, the existing pavement damage detection method has poor detection effect.

In order to solve the problems, the method and the device for detecting the road surface disease in the road by using the boundary prediction sub-network obtain a first prediction result of a boundary representing the road surface disease based on the sample image characteristics by extracting the sample image characteristics of the sample image, and obtain a second prediction result of an example segmentation area representing the road surface disease based on the sample image characteristics by using the segmentation prediction sub-network, and further adjust parameters of the neural network by using the two prediction results and a real label of the road surface disease, so that the neural network can be supervised from two angles of boundary and example segmentation, and the detection capability of the trained neural network is improved.

Embodiments of the present disclosure will be described in detail below with reference to the accompanying drawings.

Fig. 1 illustrates a schematic diagram of an exemplary system 100 in which various methods and apparatus described herein may be implemented, in accordance with an embodiment of the present disclosure. Referring to fig. 1, the system 100 includes one or more client devices 101, 102, 103, 104, 105, and 106, a server 120, and one or more communication networks 110 coupling the one or more client devices to the server 120. Client devices 101, 102, 103, 104, 105, and 106 may be configured to execute one or more applications.

In an embodiment of the present disclosure, the server 120 may run one or more services or software applications that enable execution of the methods of the present disclosure.

In some embodiments, server 120 may also provide other services or software applications that may include non-virtual environments and virtual environments. In some embodiments, these services may be provided as web-based services or cloud services, such as provided to users of client devices 101, 102, 103, 104, 105, and/or 106 under a software as a service (SaaS) network.

In the configuration shown in fig. 1, server 120 may include one or more components that implement the functions performed by server 120. These components may include software components, hardware components, or a combination thereof that are executable by one or more processors. A user operating client devices 101, 102, 103, 104, 105, and/or 106 may in turn utilize one or more client applications to interact with server 120 to utilize the services provided by these components. It should be appreciated that a variety of different system configurations are possible, which may differ from system 100. Accordingly, FIG. 1 is one example of a system for implementing the various methods described herein and is not intended to be limiting.

The user may use client devices 101, 102, 103, 104, 105, and/or 106 for human-machine interaction. The client device may provide an interface that enables a user of the client device to interact with the client device. The client device may also output information to the user via the interface. Although fig. 1 depicts only six client devices, those skilled in the art will appreciate that the present disclosure may support any number of client devices.

Client devices 101, 102, 103, 104, 105, and/or 106 may include various types of computer devices, such as portable handheld devices, general purpose computers (such as personal computers and laptop computers), workstation computers, wearable devices, smart screen devices, self-service terminal devices, service robots, gaming systems, thin clients, various messaging devices, sensors or other sensing devices, and the like. These computer devices may run various types and versions of software applications and operating systems, such as MICROSOFT Windows, APPLE iOS, UNIX-like operating systems, linux, or Linux-like operating systems (e.g., GOOGLE Chrome OS); or include various mobile operating systems such as MICROSOFT Windows Mobile OS, iOS, windows Phone, android. Portable handheld devices may include cellular telephones, smart phones, tablet computers, personal Digital Assistants (PDAs), and the like. Wearable devices may include head mounted displays (such as smart glasses) and other devices. The gaming system may include various handheld gaming devices, internet-enabled gaming devices, and the like. The client device is capable of executing a variety of different applications, such as various Internet-related applications, communication applications (e.g., email applications), short Message Service (SMS) applications, and may use a variety of communication protocols.

Network 110 may be any type of network known to those skilled in the art that may support data communications using any of a number of available protocols, including but not limited to TCP/IP, SNA, IPX, etc. For example only, the one or more networks 110 may be a Local Area Network (LAN), an ethernet-based network, a token ring, a Wide Area Network (WAN), the internet, a virtual network, a Virtual Private Network (VPN), an intranet, an extranet, a Public Switched Telephone Network (PSTN), an infrared network, a wireless network (e.g., bluetooth, WIFI), and/or any combination of these and/or other networks.

The server 120 may include one or more general purpose computers, special purpose server computers (e.g., PC (personal computer) servers, UNIX servers, mid-end servers), blade servers, mainframe computers, server clusters, or any other suitable arrangement and/or combination. The server 120 may include one or more virtual machines running a virtual operating system, or other computing architecture that involves virtualization (e.g., one or more flexible pools of logical storage devices that may be virtualized to maintain virtual storage devices of the server). In various embodiments, server 120 may run one or more services or software applications that provide the functionality described below.

The computing units in server 120 may run one or more operating systems including any of the operating systems described above as well as any commercially available server operating systems. Server 120 may also run any of a variety of additional server applications and/or middle tier applications, including HTTP servers, FTP servers, CGI servers, JAVA servers, database servers, etc.

In some implementations, server 120 may include one or more applications to analyze and consolidate data feeds and/or event updates received from users of client devices 101, 102, 103, 104, 105, and 106. Server 120 may also include one or more applications to display data feeds and/or real-time events via one or more display devices of client devices 101, 102, 103, 104, 105, and 106.

In some implementations, the server 120 may be a server of a distributed system or a server that incorporates a blockchain. The server 120 may also be a cloud server, or an intelligent cloud computing server or intelligent cloud host with artificial intelligence technology. The cloud server is a host product in a cloud computing service system, so as to solve the defects of large management difficulty and weak service expansibility in the traditional physical host and virtual private server (VPS, virtual Private Server) service.

The system 100 may also include one or more databases 130. In some embodiments, these databases may be used to store data and other information. For example, one or more of databases 130 may be used to store information such as audio files and video files. Database 130 may reside in various locations. For example, the data store used by the server 120 may be local to the server 120, or may be remote from the server 120 and may communicate with the server 120 via a network-based or dedicated connection. Database 130 may be of different types. In some embodiments, the database used by server 120 may be a database, such as a relational database. One or more of these databases may store, update, and retrieve the databases and data from the databases in response to the commands.

In some embodiments, one or more of databases 130 may also be used by applications to store application data. The databases used by the application may be different types of databases, such as key value stores, object stores, or conventional stores supported by the file system.

The system 100 of fig. 1 may be configured and operated in various ways to enable application of the various methods and apparatus described in accordance with the present disclosure.

According to an aspect of the present disclosure, a training method of a neural network for road surface fault detection is provided. The neural network includes a feature encoding sub-network, a boundary prediction sub-network, and a segmentation prediction sub-network. Fig. 2 shows a flowchart of a training method 200 of a neural network for road surface fault detection according to an exemplary embodiment of the present disclosure. The method 200 comprises the following steps: step S201, obtaining a sample image and a real label of pavement diseases in the sample image; step S202, inputting the sample image into a feature coding sub-network to obtain sample image features; step S203, inputting the characteristics of the sample image into a boundary prediction sub-network to obtain a first prediction result, wherein the first prediction result represents the boundary of the pavement defect in the sample image; step S204, inputting the characteristics of the sample image into a segmentation prediction sub-network to obtain a second prediction result, wherein the second prediction result represents an example segmentation area of the pavement diseases in the sample image; and step S205, adjusting parameters of the neural network based on the first prediction result, the second prediction result and the real label.

Therefore, the first prediction result of the boundary representing the road surface diseases is obtained based on the sample image characteristics by extracting the sample image characteristics of the sample image, the second prediction result of the example segmentation area representing the road surface diseases is obtained based on the sample image characteristics by utilizing the boundary prediction sub-network, and the parameters of the neural network are adjusted by utilizing the two prediction results and the real labels of the road surface diseases, so that the neural network can be supervised from the two angles of boundary and example segmentation, and the detection capability of the trained neural network is improved.

In addition, by the method, the problem of insufficient segmentation possibly occurring when only the image is subjected to the instance segmentation can be avoided, and the accuracy of the instance segmentation is improved after the constraint of the boundary information is introduced.

In some embodiments, the technique adopted in pavement disease detection of the present disclosure may be an example segmentation technique, which is similar to object detection, only segments a foreground object and distinguishes different examples of similar objects, the object detection task generally outputs a bounding box, and the example segmentation may output a contour of the object, so that it can be seen that the example segmentation is a combination of object detection and semantic segmentation.

Example segmentation methods fall into two main categories: one is a top-down approach and the other is a bottom-up approach. The former, firstly locating the region of interest, then dividing on the basis; the latter first performs classification at the pixel level, and then instantiates the pixel group as the target. However, applying it directly in road surface disease detection is generally faced with a series of problems: 1) The targets of the road pits are smaller, and different forms are presented at different visual angles; 2) The length of the crack is variable, the crack extends towards all directions, the appearance of the net-shaped crack is complex, and the net-shaped crack is formed by interweaving a plurality of fine cracks, so that a grid-like mode is formed. The cracks can take the form of crossing, intersection and branching, a complex network structure is formed, and perfect segmentation of the mesh cracks cannot be solved by using the current existing scheme; 3) The real-time requirement is generally met by adopting the structure of an encoder and a decoder in the split network, and the structure cannot meet the real-time requirement on terminal equipment. Therefore, the neural network obtained by training the neural network training method for detecting the road surface diseases can solve the three problems and realize accurate detection of the key diseases.

Fig. 3 shows a schematic diagram of a neural network for road surface fault detection according to an exemplary embodiment of the present disclosure. As shown in fig. 3, the neural network 300 includes three sub-networks of a feature encoding sub-network 310, a boundary prediction sub-network 320 and a segmentation prediction sub-network 330, wherein the main function of the feature encoding sub-network 310 is to perform feature extraction on the input image 302, and provide semantic information for subsequent tasks; the main function of the boundary prediction sub-network 320 is to use the semantic boundary to perform explicit supervision on the neural network, so as to guide the neural network to learn an accurate semantic boundary, and enable the neural network to have the capability of semantic boundary recognition and stronger inter-class distinguishing capability; the function of the segmentation prediction sub-network 330 is to decode the boundary-supervised features to obtain more accurate instance segmentation results.

In some embodiments, the sample image acquired in step S201 may be, for example, an image containing a road surface defect. The real label (ground route) of the road surface fault of the sample image may include a real semantic boundary and a real instance segmentation area of the road surface fault. In one exemplary embodiment, the real semantic boundary may be, for example, obtained by boundary extraction of the binarized real instance segmentation region using a canny operator.

In some embodiments, the feature encoding sub-network used in step S202 may use an existing image encoding backbone network (e.g., res net), for example, a structure in which a plurality of residual blocks are connected in series may be adopted, and features are fused by adopting the idea of feature pyramids, so as to fuse multi-scale feature information, and thus the finally obtained sample image features have richer semantic and spatial location information. The identification functions will vary from layer to layer of the backbone network. For the lower layer, the network has more accurate spatial position information, but the receptive field is small and is difficult to have strong semantic information; for high layers, the method has large receptive field and strong semantic information, but the accurate spatial position information is lost; thus fusing them together well combines their advantages. And a high-resolution feature map can be obtained through the feature pyramid, so that the method is convenient for a boundary prediction task and an image segmentation task.

It will be appreciated that the feature encoding subnetwork may also be constructed in other ways, not limited herein.

According to some embodiments, the feature encoding subnetwork may be derived from unsupervised training of the data and text encoding subnetwork using graphics. A large amount of graphic pair data may be obtained, each graphic pair data comprising an image and corresponding tag text. The feature coding sub-network can be used for coding images in the image-text pairs to obtain image features, the additional text coding sub-network is used for coding texts in the image-text pairs to obtain text features, and further the feature coding sub-network and the text coding sub-network are subjected to unsupervised training based on similarity measurement of the text features and the image features. In this way, the image encoding capabilities of the feature encoding subnetwork can be trained at low cost without requiring large amounts of annotation data.

In existing approaches, neural networks are limited to segmenting objects from a closed set of classes. The neural network requires full supervision retraining and re-labeling of the data whenever a new data set appears. Because of the complexity of road conditions and the limitation of calculation force, a neural network is often required to bear detection and segmentation of various scenes, and if the scenes are frequently added, the traditional scheme consumes great cost. In some embodiments, text is used to supervise and calculate similarity to the input image when a new category is needed, thereby finding data of that interest.

In some embodiments, the boundary prediction sub-network used in step S203 may perform a boundary prediction task based on the sample image features output by the feature encoding sub-network to obtain a prediction result of the boundary of the road surface fault in the sample image, that is, a first prediction result.

In road disease detection, adhesion phenomenon caused by incomplete segmentation is easy to occur due to the tight arrangement of the netlike cracks. Thus, it is necessary to obtain a characteristic that is large in distinction. And (3) carrying out explicit supervision on the neural network by utilizing the semantic boundary to guide the neural network to learn an accurate semantic boundary, so that the neural network has the capability of semantic boundary identification and stronger inter-class distinguishing capability.

According to some embodiments, the feature encoding sub-network may include a plurality of feature encoding blocks connected in sequence, and the sample image features may include a plurality of semantic-level image features output by the plurality of feature encoding blocks. Each of the feature encoding blocks may employ a residual block structure. The boundary prediction sub-network may include a plurality of feature decoupling blocks and boundary prediction blocks connected in sequence.

Fig. 4 shows a flowchart of a process 400 of obtaining a first prediction result according to an exemplary embodiment of the present disclosure. Process 400 may be used to implement step S203 in method 200. The process 400 includes: step S401, sequentially inputting image features of a plurality of semantic levels from a low semantic level to a high semantic level into corresponding feature decoupling blocks in a plurality of feature decoupling blocks, and acquiring decoupling features output by the last feature decoupling block; and step S402, inputting the decoupling characteristic output by the last characteristic decoupling block into the boundary prediction block to obtain a first prediction result.

In order to enable the neural network to obtain the semantic boundaries rich in information, the above manner adopts a combination of a plurality of feature decoupling blocks from bottom to top (the semantic hierarchy is from low to high) and boundary prediction blocks to construct a boundary prediction sub-network. The sub-network has both lower level more detailed information and higher level strong semantic information, so that the boundary has strong semantics. In addition, by employing a plurality of feature decoupling blocks from bottom to top and providing boundary prediction blocks at the highest semantic level for supervision, deep features in the feature encoding subnetwork can be better constrained.

In some embodiments, the plurality of feature encoding blocks connected in sequence includes N feature encoding blocks, and then the features output by the feature encoding blocks may be named as first to nth image features. The first image features are image features of the lowest semantic level, and the Nth image features are image features of the highest semantic level. A first feature decoupling block of the plurality of feature decoupling blocks connected in sequence may receive the first image feature, or receive the first image feature and the second image feature, and output the decoupling feature; each feature decoupling block following the first feature decoupling block may receive image features of a corresponding semantic hierarchy and receive the decoupling features output by the previous feature decoupling block. In other words, each feature decoupling block may receive one low-level semantic feature, which may be a decoupling feature output by a previous feature decoupling block, and a high-level semantic feature, which may be an image feature of the corresponding semantic level (the first feature decoupling block may not receive the low-level semantic feature, or the first image feature may be a low-level semantic feature and the second image feature may be a high-level semantic feature).

The feature decoupling block may be composed of a plurality of residual modules. In one exemplary embodiment, the feature decoupling block may include: the first residual error module is used for strengthening the received low-level semantic features to obtain the strengthened low-level semantic features; the second residual error module is used for strengthening the received high-level semantic features to obtain the strengthened high-level semantic features; the third fusion module is used for fusing the reinforced low-level semantic features and the reinforced high-level semantic features to obtain fusion features; and a third residual error module, configured to strengthen the fusion feature to obtain a decoupling feature.

The residual module may include a plurality of convolution layers and residual connections. In one exemplary embodiment, the residual module may include: a first convolution layer, 1×1 in size, for transforming the number of channels of the feature map; a second convolution layer having a size of 3 x 3; batch normalization (Batch Normalization) layers; a first activation function (e.g., a linear rectifying unit ReLU); a third convolution layer having a size of 3 x 3; residual connection from after the first convolutional layer to after the third convolutional layer; a second activation function (e.g., linear rectifying unit ReLU).

In some embodiments, the first prediction result may be a boundary image. In the boundary image, each pixel indicates whether the pixel is a boundary of a road surface defect using a preset pixel value. For different road surface defects in the image input to the neural network, different pixel values may be used to represent the boundaries of the different road surface defects. The boundary prediction block may employ a convolutional neural network or other network structure to enable generation of a boundary image based on the decoupling features (e.g., feature maps).

In some embodiments, the segmentation prediction sub-network used in step S204 may perform a task based on the sample image features output by the feature encoding sub-network to obtain a prediction result, i.e., a second prediction result, of the example segmentation region of the road surface fault in the sample image.

According to some embodiments, the split prediction sub-network may include a plurality of feature decoding blocks and split prediction blocks connected in sequence.

Fig. 5 shows a flowchart of a process 500 of obtaining a second prediction result according to an exemplary embodiment of the present disclosure. Process 500 may be used to implement step S204 in method 200. Process 500 may include: step S501, inputting image features of a plurality of semantic levels into corresponding feature decoding blocks in a plurality of feature decoding blocks in sequence from a high semantic level to a low semantic level, and acquiring decoding features output by the last feature decoding block; and step S502, inputting the decoding characteristic output by the last characteristic decoding block into the segmentation prediction block to obtain a second prediction result.

In performing such a task as splitting, the neural network needs to be focused on a small number of diseases. The high-level features and the low-level features are gradually fused in a top-down (semantic hierarchy is from high to low) mode, so that the high-level features with strong consistency semantic information guide the low-level instance segmentation prediction, obtain high-resolution decoding features with rich semantics and more discriminant, and finally obtain accurate instance segmentation prediction results.

According to some embodiments, the plurality of feature decoding blocks may correspond one-to-one to a plurality of semantic hierarchies. Fig. 6 illustrates a flowchart of a process 600 of acquiring decoding features of a plurality of feature decoding block outputs according to an exemplary embodiment of the present disclosure. Process 600 may be used to implement step S501 in process 500. Process 600 may include: step S601, inputting the image feature of the highest semantic level in the image features of the plurality of semantic levels into a first feature decoding block in a plurality of feature decoding blocks to obtain decoding features output by the first feature decoding block; and step S602, for each feature decoding block after the first feature decoding block, inputting the image feature of the semantic hierarchy corresponding to the feature decoding block and the decoding feature output by the previous feature decoding block into the feature decoding block to obtain the decoding feature output by the feature decoding block.

In some embodiments, the plurality of feature encoding blocks connected in sequence includes N feature encoding blocks, and then the features output by the feature encoding blocks may be named as first to nth image features. The first image features are image features of the lowest semantic level, and the Nth image features are image features of the highest semantic level. A first feature decoding block of the sequentially connected feature decoding blocks may receive an nth image feature and output a decoded feature; the feature decoding block following the first feature decoding block may receive image features of the corresponding semantic hierarchy and receive decoded features output by a previous feature decoding block. In one exemplary embodiment, the ith feature decoding block receives the Nth-i+1 image feature, i ε [1, N ].

Therefore, by arranging the plurality of feature decoding blocks corresponding to the plurality of semantic levels one by one, the image features of different semantic levels can be fully considered when the instance segmentation is carried out, and the accuracy of the finally obtained instance segmentation result is improved.

According to some embodiments, the neural network further comprises a global feature encoding sub-network. The method 200 may further include (not shown): and inputting the image features of the highest semantic level included in the sample image features into a global feature coding sub-network to obtain global image features.

Step S601, inputting the image feature of the highest semantic level among the image features of the plurality of semantic levels into the first feature decoding block of the plurality of feature decoding blocks to obtain the decoded feature output by the first feature decoding block may include: the image features of the highest semantic hierarchy and the global image features are input into a first feature decoding block to obtain decoding features output by the first feature decoding block.

Therefore, global image features are obtained based on the image features of the highest semantic level, and are input into the first feature decoding block, so that the prediction capability of the decoding features output by the first feature decoding block can be further improved, and the accuracy of the finally obtained example segmentation result is improved.

In some embodiments, each feature decoding block may receive one low-level semantic feature and one high-level semantic feature. For the first feature decoding block, the low-level semantic features may be image features of the highest semantic level, and the high-level semantic features may be global image features; for each feature decoding block following the first feature decoding block, the low-level semantic features may be image features of the corresponding semantic hierarchy, and the high-level semantic features may be decoding features output by the previous feature decoding block.

According to some embodiments, each of the plurality of feature decoding blocks may comprise: the first fusion module is configured to fuse low-level semantic features and high-level semantic features to obtain first fusion features, the low-level semantic features are based on image features of a semantic level corresponding to the feature decoding block, and the high-level semantic features are based on decoding features output by the global image features or the previous feature decoding block; the first attention module is configured to obtain a channel attention weight graph based on the first fusion feature, and obtain enhanced low-level semantic features based on the low-level semantic features and the channel attention weight graph; the second fusion module is configured to fuse the reinforced low-level semantic features and the reinforced high-level semantic features to obtain second fusion features; and a second attention module configured to obtain a spatial attention weight map based on the second fused feature and to obtain a third fused feature based on the second fused feature and the spatial attention weight map. The decoded features output by the feature decoding block may be based on a third fusion feature.

Therefore, through the modules, the characteristics of the high layer and the low layer can be screened from the channel dimension and the space dimension, the characteristic with stronger discrimination capability is selected, the problems of noise introduction and characteristic weakening after multi-scale characteristic fusion are solved, and the prediction capability of the neural network is improved.

In some embodiments, the low-level semantic features corresponding to the feature decoding block may directly adopt the image features of the corresponding semantic hierarchy, or may be features obtained by processing the image features. In one exemplary embodiment, the residual module described above may be utilized to process image features of a semantic hierarchy corresponding to a feature decoding block to obtain corresponding low-level semantic features of the feature decoding block.

In some embodiments, the high-level semantic features corresponding to the feature decoding block may be global image features or decoded features output by a previous feature decoding block, or may be features obtained by processing the features (refer to a manner of obtaining low-level semantic features), which is not limited herein.

In some embodiments, the first fusion module may be configured to splice the low-level semantic features and the high-level semantic features along the channel direction to obtain the first fusion features. It will be appreciated that the first fusion module may also be configured to fuse in other ways, such as direct addition, multiplication, weighted summation, processing by a small neural network, or combinations thereof, without limitation.

In some embodiments, the first attention module may further include: a first global average pooling layer (global average pooling, GAP); a fourth convolution layer having a size of 1 x 1; a third activation function (e.g., linear rectifying unit ReLU); a fifth convolution layer having a size of 1 x 1; a fourth activation function (e.g., a sigmoid function); the first fusion sub-module is used for obtaining the enhanced low-level semantic features based on the low-level semantic features and the channel attention weight graph output by the fourth activation function.

It can be understood that the above sub-modules and the network structure included in the first attention module may be implemented by other similar methods, and the first attention module may also be implemented by other structures to obtain a channel attention weight map based on the first fusion feature, and obtain the enhanced low-level semantic features based on the low-level semantic features and the channel attention weight map, which are not limited herein.

In some embodiments, the second fusion module may be configured to directly add the enhanced low-level semantic features and the high-level semantic features to obtain a second fusion feature. It will be appreciated that the second fusion module may also be configured to fuse in other ways, such as direct addition, multiplication, weighted summation, processing by a small neural network, or combinations of ways, without limitation.

In an exemplary embodiment, the first fusion feature first obtains a vector of 1×1×c1 by global averaging and pooling in the spatial domain, C1 being the number of channels. The global average pooling layer has good global context priori knowledge and has the strongest consistency, and meanwhile, the global average pooling layer summarizes spatial information, so that the input spatial conversion is more robust. The vector is then passed through two 1 x 1 convolutional layers (activated between the two convolutional layers using a ReLU), and the values are mapped to 0,1 via a sigmoid activation function to obtain a channel attention weight map. Multiplying the first fusion feature by the channel attention weight graph, and adding the first fusion feature with the high-level semantic feature to obtain a second fusion feature.

In some embodiments, the second attention module may further include: a second global average pooling layer (global average pooling, GAP); a channel domain global maximum pooling layer (global maximum pooling, GMP); the second fusion submodule is used for fusing the characteristics output by the second global tie pooling layer and the channel domain global maximum pooling layer respectively; a sixth convolution layer having a size of 7 x 7; a fifth activation function (e.g., a sigmoid function); and the third fusion sub-module is used for obtaining a third fusion characteristic based on the second fusion characteristic and the spatial attention weight graph output by the fifth activation function.

It will be appreciated that the above sub-modules and network structures included in the second attention module may be implemented by other similar methods, and the second attention module may also be implemented by other structures to obtain a spatial attention weight map based on the second fusion feature, and obtain a decoding feature based on the second fusion feature and the spatial attention weight map, which are not limited herein.

In an exemplary embodiment, the second fused features input to the second attention module are respectively subjected to channel domain global average pooling and channel domain global maximum pooling to obtain two single-channel features, the two features are spliced on the channel domain, then the two single-channel features are subjected to 7×7 convolution layers, and the values are mapped to [0,1] through a sigmoid activation function, so as to finally obtain a spatial attention weight graph. And multiplying the graph with the second fusion feature to finally obtain a feature with optimized spatial information, namely a third fusion feature.

The feature recognition capability of the high and low layers of the neural network is different, so that the prediction consistency is different, and the feature expression capability after fusion is reduced. The two attention modules can be used for enhancing the characteristic with strong discrimination capability and inhibiting the characteristic with weak discrimination capability, so that the expression capability and the prediction consistency of the fused characteristic are improved, and finally the prediction capability of the neural network is improved.

In some embodiments, the third fused feature may be directly used as the decoded feature output by the feature decoding block, or may be further processed (e.g., enhanced using the residual module described above) to obtain the decoded feature.

In some embodiments, the second prediction result may be an example segmented image. In the example divided image, each pixel indicates whether the pixel belongs to a road surface disease using a preset pixel value. Different pixel values may be used to represent different road surface defects in the image input to the neural network. The depth prediction block may employ a convolutional neural network or other network structure to enable the generation of an instance segmented image based on decoded features (e.g., feature maps).

After the example segmentation image is obtained, the geographic coordinates of the diseases can be determined according to the internal parameters and the external parameters of the camera, so that the position positioning of the diseases is realized, and the actual area of the diseases can be calculated according to the parameters, so that a depth estimation branch can be added on the basis of the example segmentation and used for estimating the depth information of the image.

According to some embodiments, the neural network may further include a depth estimation sub-network including a plurality of depth information decoding blocks and depth prediction blocks connected in sequence, the real label including a third real label corresponding to an image depth of the road surface fault in the sample image.

Fig. 7 shows a flowchart of a training method of a neural network for road surface fault detection according to an exemplary embodiment of the present disclosure. As shown in fig. 7, method 700 includes: step S706, inputting a plurality of decoding features output by a plurality of feature decoding blocks into corresponding depth information decoding blocks in a plurality of depth information decoding blocks in sequence from a shallow layer to a deep layer, and obtaining the depth information feature output by the last depth information decoding block; step S707, inputting the depth information feature output by the last depth information decoding block into a depth prediction block to obtain a third prediction result, wherein the third prediction result represents the image depth of the pavement defect in the sample image; and step S708, adjusting parameters of the neural network based on the third prediction result and the third real label.

It is to be understood that the operations of step S701-step S705 in the method 700 may refer to the description of step S201 and step S205 in the method 200, which are not described herein.

Therefore, the training of the depth prediction capability of the neural network is realized through the mode, so that the trained neural network can output the image depth of the pavement diseases in the image.

According to some embodiments, the plurality of depth information decoding blocks are in one-to-one correspondence with the plurality of feature decoding blocks. Fig. 8 shows a flowchart of a process 800 of acquiring depth information features output by a plurality of depth information decoding blocks according to an exemplary embodiment of the present disclosure. Process 800 may be used to implement step S706 in method 700. As shown in fig. 8, process 800 includes: step S801, inputting the decoding feature output by the first feature decoding block into a first depth information decoding block in a plurality of depth information decoding blocks to obtain the depth information feature output by the first depth information decoding block; and step S802, for each depth information decoding block after the first depth information decoding block, inputting the decoding feature output by the feature decoding block corresponding to the depth information decoding block and the depth information feature output by the previous depth information decoding block into the depth information decoding block to obtain the depth information feature output by the depth information decoding block.

In some embodiments, each of the sequentially connected plurality of depth information decoding blocks may receive a decoded feature output by the corresponding feature decoding block. Specifically, the i-th depth information decoding block may receive the decoded feature output by the i-th feature decoding block, i e [1, n ]. The decoding features received by each depth information decoding block may be referred to as low-level semantic features received by the depth information decoding block. Each depth information decoding block following the first depth information decoding block may also receive a depth information feature output by a previous depth information decoding block, which may be referred to as a high-level semantic feature received by the depth information decoding block.

Therefore, by arranging the plurality of depth information decoding blocks which are in one-to-one correspondence with the plurality of feature decoding blocks, decoding features output by different feature decoding blocks can be fully considered when depth information feature extraction is carried out, and therefore the accuracy of the finally obtained image depth is improved.

The depth information decoupling block may have the same structure as the feature decoupling block, i.e. be composed of a plurality of residual modules. In one exemplary embodiment, the depth information decoupling block may include: the fourth residual error module is used for strengthening the received low-level semantic features to obtain the strengthened low-level semantic features; a fifth residual error module, configured to strengthen the received high-level semantic features to obtain strengthened high-level semantic features (the module may be skipped for the first depth information decoding block); the fourth fusion module is used for fusing the reinforced low-level semantic features and the reinforced high-level semantic features to obtain fusion features; and a sixth residual error module, configured to strengthen the fusion feature to obtain a depth information feature.

In some embodiments, the third prediction result may be a depth map. A depth map is an image that describes the image depth of different areas in the image, typically using grey values or color coding to represent the distance, i.e. the image depth, of objects in the image. The depth prediction block may employ a convolutional neural network or other network structure to enable depth map generation based on depth information features (feature maps).

In the prediction stage, the image depth may be used to calculate the actual size of the road surface fault in the image input to the neural network, as will be described below.

Fig. 9 shows a schematic diagram of a neural network for road surface fault detection according to an exemplary embodiment of the present disclosure. As shown in fig. 9, the neural network 900 includes a feature encoding sub-network 910, a boundary prediction sub-network 920, a segmentation prediction sub-network 930, a depth estimation sub-network 940, and a global feature encoding sub-network 950.

The feature encoding subnetwork 910 further includes a plurality of feature encoding blocks 912 connected in sequence. The plurality of feature encoding blocks 912 perform feature extraction on the input image 902 to obtain a plurality of image features of a plurality of semantic hierarchies.

The global feature encoding sub-network 950 receives the image feature of the highest semantic level among the plurality of image features output by the feature encoding sub-network 910, resulting in a global image feature.

The boundary prediction subnetwork 920 further includes a plurality of feature decoupling blocks 922 and boundary prediction blocks 924 connected in series. The plurality of feature decoupling blocks 922 sequentially receive the plurality of image features from the low semantic level to the high semantic level and output the decoupled features. The boundary prediction block 924 receives the decoupling feature output by the last feature decoupling block and outputs a first prediction result 926 that characterizes the boundary of road surface defects in the input image.

The split prediction sub-network 930 further includes a plurality of feature decoding blocks 932 and a split prediction block 934, which are connected in sequence. The plurality of feature decoding blocks 932 sequentially receive the global image feature and the plurality of image features from the high semantic level to the low semantic level and output decoded features. The segmentation prediction block 934 receives the decoded feature output by the last feature decoding block and outputs a second prediction result 936 that characterizes an example segmented region of road surface defects in the input image.

The depth estimation sub-network 940 further includes a plurality of depth information decoding blocks 942 and depth prediction blocks 944 that are sequentially connected. The plurality of depth information decoding blocks 942 sequentially receive the plurality of decoding features output from the plurality of feature decoding blocks 932 from the low semantic hierarchy to the high semantic hierarchy and output depth information features. The depth prediction block 944 receives the depth information feature output by the last depth information decoding block and outputs a third prediction result 946 that characterizes the image depth (distance) of the road surface imperfection in the input image.

According to some embodiments, the real labels may include a first real label corresponding to the boundary and a second real label corresponding to the instance split area.

Fig. 10 shows a flowchart of a process 1000 of adjusting parameters of a neural network, according to an exemplary embodiment of the present disclosure. Process 1000 may be used to implement step S205 in method 200. Process 1000 may include: step S1001, adjusting parameters of a feature coding sub-network and a boundary prediction sub-network based on a first prediction result and a first real label; step S1002, freezing parameters of a characteristic coding sub-network and a boundary prediction sub-network; and step S1003, adjusting parameters of the split prediction sub-network based on the second prediction result and the second real label.

Therefore, the feature coding sub-network and the boundary prediction sub-network are firstly subjected to parameter adjustment, parameters of the feature coding sub-network and the boundary prediction sub-network are frozen, and then the boundary prediction sub-network is subjected to parameter adjustment, so that the training difficulty can be reduced, the training effect can be improved, and more accurate boundary prediction results and example segmentation results can be output by the trained neural network.

In some embodiments, it is also possible to freeze only the parameters of the boundary prediction sub-network at step S1002 and adjust the parameters of the partition prediction sub-network and the parameters of the feature encoding sub-network at step S1003. In addition, in step S1003, parameters of the global feature encoding sub-network may also be adjusted.

According to some embodiments, step S708, adjusting parameters of the neural network based on the third prediction result and the third real label may include: and adjusting parameters of the depth estimation sub-network based on the third prediction result and the third real label, wherein parameters of other sub-networks in the neural network are frozen before adjusting parameters of the depth estimation sub-network.

Therefore, the feature coding sub-network, the boundary prediction sub-network and the segmentation prediction sub-network are firstly subjected to parameter adjustment, and the depth estimation sub-network is subjected to parameter adjustment after the parameters of the feature coding sub-network, the boundary prediction sub-network and the segmentation prediction sub-network are frozen, so that the training difficulty can be reduced, the training effect can be improved, and a more accurate depth prediction result can be output by the trained neural network. In addition, through the mode, the prediction of the boundaries of the depth map, the example segmentation map and the pavement diseases can be realized through one neural network, and the prediction effect and the reasoning performance are greatly improved.

According to another aspect of the present disclosure, a pavement damage detection method is provided. The neural network for road surface disease detection comprises a characteristic coding sub-network, a boundary prediction sub-network and a segmentation prediction sub-network. Fig. 11 shows a flowchart of a pavement damage detection method 1100 according to an exemplary embodiment of the present disclosure. The method 1100 includes: step 1101, obtaining an image to be detected; step 1102, inputting an image to be detected into a feature encoding sub-network for inputting a sample image to obtain features of the image to be detected; step S1103, inputting the feature of the image to be detected into a boundary prediction sub-network to obtain a boundary prediction result, where the boundary prediction result characterizes the boundary of the pavement defect in the image to be detected; step S1104, inputting the image features to be detected into a segmentation prediction sub-network to obtain a segmentation prediction result, wherein the segmentation prediction result represents an example segmentation area of the pavement diseases in the image to be detected; and step S1105, obtaining a pavement defect detection result of the image to be detected based on the boundary prediction result and the segmentation prediction result.

In some embodiments, the neural network used in method 1100 may be trained using method 200 or method 700 described above, and the structure of the neural network may be as described above. It is to be understood that the operations of step S1102-step S1104 in the method 1100 may refer to the descriptions of step S202-step S204 in the method 200, which are not described herein.

The method comprises the steps of extracting image features to be detected of an image to be detected, obtaining boundary prediction results of boundaries representing road surface diseases based on the image features to be detected by utilizing a boundary prediction sub-network, obtaining segmentation prediction results of example segmentation areas representing the road surface diseases based on the image features to be detected by utilizing a segmentation prediction sub-network, and finally obtaining road surface disease detection results based on the two prediction results. In this way, it is possible to perform road surface deterioration detection by dividing two branches using boundary prediction and an example, and to obtain a more accurate road surface deterioration detection result based on both.

In some embodiments, the image to be detected may be a road surface image acquired by various means. The video frames in the video shot by the inspection vehicle can be used as the images to be detected, so that real-time detection is realized. In an exemplary embodiment, the video uses a high-frame-rate high-definition camera as a data source, and if the visual range of the camera is 5m in front of the car and the inspection speed in the urban road is less than 70km/h, the minimum frame rate of the camera is 4 frames, so that the frame rate of the input video stream can be set to 10fps and the resolution is 1920×1080. The output of the camera can be directly connected to a detection system deployed with a neural network, so that the road surface disease condition can be analyzed in real time.

According to some embodiments, the feature encoding sub-network may include a plurality of feature encoding blocks connected in series, the image feature to be detected may include a plurality of semantic-level image features output by the plurality of feature encoding blocks, and the boundary prediction sub-network may include a plurality of feature decoupling blocks and boundary prediction blocks connected in sequence. Step S1102, inputting the image feature to be detected into the boundary prediction sub-network to obtain the boundary prediction result may include: sequentially inputting the image features of the plurality of semantic levels from the low semantic level to the high semantic level into corresponding feature decoupling blocks in the plurality of feature decoupling blocks, and acquiring decoupling features output by the last feature decoupling block; and inputting the decoupling characteristic output by the last characteristic decoupling block into a boundary prediction block to obtain a boundary prediction result.

In order to enable the neural network to obtain the semantic boundaries rich in information, the above manner adopts a combination of a plurality of feature decoupling blocks from bottom to top (the semantic hierarchy is from low to high) and boundary prediction blocks to construct a boundary prediction sub-network. The sub-network has both lower-layer more detailed information and higher-layer strong semantic information, so that the boundary has strong semantic, and finally accurate boundary information can be obtained.

According to some embodiments, the split prediction sub-network may include a plurality of feature decoding blocks and split prediction blocks connected in sequence. Step S1103, inputting the image feature to be detected into the segmentation prediction sub-network to obtain a segmentation prediction result may include: sequentially inputting the image features of the plurality of semantic levels from the high semantic level to the low semantic level into corresponding feature decoding blocks in the plurality of feature decoding blocks, and acquiring decoding features output by the last feature decoding block; and inputting the decoded feature output by the last feature decoding block into the segmentation prediction block to obtain a segmentation prediction result.

In some embodiments, in step S1105, the boundary prediction result and the segmentation prediction result may be fused in various manners, to obtain an accurate pavement defect detection result. In one exemplary embodiment, the road surface fault detection result may include an area where the road surface fault is located. The intersection taking operation, the union taking operation or other operations can be performed on the area of the pavement defect indicated by the boundary prediction result and the area of the pavement defect indicated by the segmentation prediction result, or the boundary prediction sub-network and the segmentation prediction sub-network can respectively output the confidence of the boundary prediction result and the confidence of the segmentation prediction result and fuse the boundary prediction result and the segmentation prediction result based on the confidence of the boundary prediction sub-network and the confidence of the segmentation prediction sub-network to obtain the final pavement defect detection result.

In some embodiments, in the prediction stage, the segmentation prediction result may be generated using only the feature encoding sub-network and the segmentation prediction sub-network, and the result may be used as the road surface disease detection result. In other words, the boundary prediction subnetwork may be used to supervise the feature encoding subnetwork during the training phase and not be used during the prediction phase.

According to some embodiments, the neural network may further comprise a depth estimation sub-network, which may comprise sequentially concatenating the plurality of depth information decoding blocks and the depth prediction block. The method 1100 may further include (not shown): sequentially inputting a plurality of decoding features output by the plurality of feature decoding blocks into corresponding depth information decoding blocks in the plurality of depth information decoding blocks from a shallow layer to a deep layer, and acquiring depth information features output by the last depth information decoding block; inputting the depth information characteristics output by the last depth information decoding block into a depth prediction block to obtain a depth prediction result, wherein the depth prediction result represents the image depth of the pavement defect in the image to be detected; and determining the actual size of the pavement defect based on the image depth of the pavement defect.

Therefore, through the mode, the neural network can output the image depth of the pavement diseases in the image, and the actual size of the pavement diseases can be determined based on the image depth, so that the comprehensive detection of the pavement diseases is realized.

In some embodiments, the actual size of the object in the depth map may be calculated after the image depth is obtained, and the actual size is calculated according to the internal parameters of the camera (the internal parameters of the camera include parameters of focal length, principal point, etc., which determine the mapping relationship between the pixels and the actual distance), by using the following formula:

the depth value is an image depth, for example, may be a depth value of a specific pixel in the depth map. The sensor physical size may be an actual physical size of the camera sensor. The focal length may be a focal length of the camera. The pixel width may be the number of pixels in the horizontal direction in the depth map.

The pavement damage detection also has the following problems: 1) The camera shoots multiple frames of the same disease from far to near, but only one frame is needed to be uploaded in the reporting process, and the disease deduplication in space is related. 2) The same disease passes at different time without repeated reporting, and the disease de-duplication in time is involved.

In some embodiments, the similarity of the pavement diseases is calculated by establishing data of the same disease at different visual angles, so that disease deduplication across space and time is realized. In an exemplary embodiment, the same disease at different moments may be aggregated, feature extraction is performed through an image feature extraction backbone network (for example, res net 50), the number of the disease is output by the training stage backbone network and the full-connection layer, the full-connection layer is removed to take the feature of the previous layer as the feature representation of the road surface disease in practical application, and then distance measurement (for example, cosine similarity) is performed with the feature representation of the road surface disease in the previous frame or the historical frame, so as to perform matching and deduplication of the road surface disease.

Because the severity of different road surface diseases is different, a database can be established for observing lighter road surface diseases for a long time. In some embodiments, the pavement slab may be graded based on its physical characteristics (length, width, area, etc.), texture characteristics, etc.

In one exemplary embodiment, the severity of road surface damage can be classified into ten classes according to the recommendations of a professional care provider, the damage is classified using end-to-end regression, feature extraction is performed through an image feature extraction backbone network (e.g., resNet 50), then a full connection layer (e.g., layer 1) is followed, then the score is mapped to [0,1] via an activation function (e.g., sigmoid function), and the loss is gradient returned using mean square error.

In some embodiments, the road surface diseases across space and time can be uniquely identified according to the re-identification result, so that when the device passes through the same place for multiple times, the road surface diseases can be tracked to determine the state (repaired, continuously deteriorated, kept unchanged, etc.).

According to another aspect of the present disclosure, a training device for a neural network for road surface fault detection is provided. The neural network includes a feature encoding sub-network, a boundary prediction sub-network, and a segmentation prediction sub-network. Fig. 12 shows a block diagram of a training device 1200 for a neural network for road surface fault detection according to an exemplary embodiment of the present disclosure. The apparatus 1200 includes: a first acquisition unit 1210 configured to acquire a sample image and a real label of a road surface disease in the sample image; a first feature encoding unit 1220 configured to input the sample image into a feature encoding sub-network to obtain sample image features; a first boundary prediction unit 1230 configured to input the sample image features into a boundary prediction sub-network to obtain a first prediction result, the first prediction result characterizing a boundary of a road surface defect in the sample image; a first segmentation prediction unit 1240 configured to input the sample image features into a segmentation prediction sub-network to obtain a second prediction result, the second prediction result characterizing an example segmentation region of the road surface disease in the sample image; and a parameter tuning unit 1250 configured to adjust parameters of the neural network based on the first prediction result, the second prediction result, and the real label.

It is to be understood that the operations of the units 1210-1250 in the apparatus 1200 may refer to the descriptions of the steps S201-S205 in the method 200 above, and are not described herein.

According to another aspect of the present disclosure, there is provided a road surface disease detection device. The neural network for road surface disease detection comprises a characteristic coding sub-network, a boundary prediction sub-network and a segmentation prediction sub-network. Fig. 13 shows a block diagram of a road surface fault detection device 1300 according to an exemplary embodiment of the present disclosure. The apparatus 1300 includes: a second acquisition unit 1310 configured to acquire an image to be detected; a second feature encoding unit 1320 configured to input the image to be detected into a feature encoding sub-network to which the sample image is input, to obtain features of the image to be detected; a second boundary prediction unit 1330 configured to input the image feature to be detected into a boundary prediction sub-network to obtain a boundary prediction result, where the boundary prediction result characterizes a boundary of the pavement defect in the image to be detected; a second segmentation prediction unit 1340 configured to input the features of the image to be detected into a segmentation prediction sub-network to obtain a segmentation prediction result, the segmentation prediction result representing an example segmentation region of the road surface disease in the image to be detected; and a detection unit 1350 configured to obtain a road surface defect detection result of the image to be detected based on the boundary prediction result and the segmentation prediction result.

It is to be understood that the operations of the units 1310-1350 in the apparatus 1300 may refer to the descriptions of the steps S1101-S1105 in the method 1100, which are not described herein.

In the technical scheme of the disclosure, the related processes of collecting, storing, using, processing, transmitting, providing, disclosing and the like of the personal information of the user accord with the regulations of related laws and regulations, and the public order colloquial is not violated.

According to embodiments of the present disclosure, there is also provided an electronic device, a readable storage medium and a computer program product.

Referring to fig. 14, a block diagram of an electronic device 1400 that may be a server or a client of the present disclosure, which is an example of a hardware device that may be applied to aspects of the present disclosure, will now be described. Electronic devices are intended to represent various forms of digital electronic computer devices, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other suitable computers. The electronic device may also represent various forms of mobile apparatuses, such as personal digital assistants, cellular telephones, smartphones, wearable devices, and other similar computing apparatuses. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 14, the apparatus 1400 includes a computing unit 1401 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 1402 or a computer program loaded from a storage unit 1408 into a Random Access Memory (RAM) 1403. In the RAM 1403, various programs and data required for the operation of the device 1400 can also be stored. The computing unit 1401, the ROM 1402, and the RAM 1403 are connected to each other through a bus 1404. An input/output (I/O) interface 1405 is also connected to the bus 1404.

Various components in device 1400 are connected to I/O interface 1405, including: an input unit 1406, an output unit 1407, a storage unit 1408, and a communication unit 1409. The input unit 1406 may be any type capable of inputting information to the device 1400A type of device, the input unit 1406 may receive input numeric or character information and generate key signal inputs related to user settings and/or function controls of the electronic device, and may include, but is not limited to, a mouse, a keyboard, a touch screen, a trackpad, a trackball, a joystick, a microphone, and/or a remote control. The output unit 1407 may be any type of device capable of presenting information and may include, but is not limited to, a display, speakers, video/audio output terminals, vibrators, and/or printers. Storage unit 1408 may include, but is not limited to, magnetic disks, optical disks. The communication unit 1409 allows the device 1400 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunications networks, and may include, but is not limited to, a modem, a network card, an infrared communication device, a wireless communication transceiver, and/or a chipset, such as bluetooth ^TM Devices, 802.11 devices, wiFi devices, wiMax devices, cellular communication devices, and/or the like.

The computing unit 1401 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 1401 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning network algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 1401 performs the respective methods and processes described above, for example, a training method of a neural network and/or a road surface disease detection method. For example, in some embodiments, the training method of the neural network and/or the road surface fault detection method may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 1408. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 1400 via the ROM 1402 and/or the communication unit 1409. When the computer program is loaded into the RAM 1403 and executed by the computing unit 1401, one or more steps of the training method of the neural network and/or the road surface damage detection method described above may be performed. Alternatively, in other embodiments, the computing unit 1401 may be configured to perform the training method of the neural network and/or the road surface fault detection method in any other suitable way (e.g. by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service ("Virtual Private Server" or simply "VPS") are overcome. The server may also be a server of a distributed system or a server that incorporates a blockchain.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel, sequentially or in a different order, provided that the desired results of the disclosed aspects are achieved, and are not limited herein.

Although embodiments or examples of the present disclosure have been described with reference to the accompanying drawings, it is to be understood that the foregoing methods, systems, and apparatus are merely exemplary embodiments or examples, and that the scope of the present invention is not limited by these embodiments or examples but only by the claims following the grant and their equivalents. Various elements of the embodiments or examples may be omitted or replaced with equivalent elements thereof. Furthermore, the steps may be performed in a different order than described in the present disclosure. Further, various elements of the embodiments or examples may be combined in various ways. It is important that as technology evolves, many of the elements described herein may be replaced by equivalent elements that appear after the disclosure.

Claims

1. A training method for a neural network for road surface fault detection, the neural network comprising a feature encoding sub-network, a boundary prediction sub-network, and a segmentation prediction sub-network, the method comprising:

Acquiring a sample image and a real label of pavement diseases in the sample image;

inputting the sample image into the feature coding sub-network to obtain sample image features;

inputting the sample image characteristics into the boundary prediction sub-network to obtain a first prediction result, wherein the first prediction result represents the boundary of the pavement diseases in the sample image;

inputting the sample image features into the segmentation prediction sub-network to obtain a second prediction result, wherein the second prediction result represents an example segmentation area of the pavement diseases in the sample image; and

and adjusting parameters of the neural network based on the first prediction result, the second prediction result and the real label.

2. The method of claim 1, wherein the feature encoding sub-network comprises a plurality of feature encoding blocks connected in sequence, the sample image features comprise image features of a plurality of semantic levels output by the plurality of feature encoding blocks, the boundary prediction sub-network comprises a plurality of feature decoupling blocks and boundary prediction blocks connected in sequence,

wherein inputting the sample image features into the boundary prediction sub-network to obtain a first prediction result comprises:

Sequentially inputting the image features of the plurality of semantic levels from a low semantic level to a high semantic level into corresponding feature decoupling blocks in the plurality of feature decoupling blocks, and acquiring decoupling features output by the last feature decoupling block; and

and inputting the decoupling characteristic output by the last characteristic decoupling block into the boundary prediction block to obtain the first prediction result.

3. The method of claim 2, wherein the partitioned prediction sub-network comprises a plurality of feature decoding blocks and partitioned prediction blocks connected in sequence, wherein inputting the sample image features into the partitioned prediction sub-network to obtain a second prediction result comprises:

sequentially inputting the image features of the plurality of semantic levels from a high semantic level to a low semantic level into corresponding feature decoding blocks in the plurality of feature decoding blocks, and acquiring decoding features output by the last feature decoding block; and

and inputting the decoding characteristic output by the last characteristic decoding block into the segmentation prediction block to obtain the second prediction result.

4. The method of claim 3, wherein the plurality of feature decoding blocks are in one-to-one correspondence with the plurality of semantic levels, sequentially inputting image features of the plurality of semantic levels from a high semantic level to a low semantic level into corresponding feature decoding blocks of the plurality of feature decoding blocks, and obtaining decoded features output by a last feature decoding block comprises:

Inputting the image feature of the highest semantic level in the image features of the plurality of semantic levels into a first feature decoding block in the plurality of feature decoding blocks to obtain decoding features output by the first feature decoding block; and

and inputting the image characteristics of the semantic level corresponding to the characteristic decoding block and the decoding characteristics output by the previous characteristic decoding block into the characteristic decoding block for each characteristic decoding block after the first characteristic decoding block so as to obtain the decoding characteristics output by the characteristic decoding block.

5. The method of claim 4, wherein the neural network further comprises a global feature encoding sub-network, the method further comprising:

inputting the image features of the highest semantic level included in the sample image features into the global feature encoding sub-network to obtain global image features,

wherein inputting an image feature of a highest semantic level of the plurality of semantic level image features into a first feature decoding block of the plurality of feature decoding blocks to obtain a decoded feature output by the first feature decoding block comprises:

and inputting the image features of the highest semantic level and the global image features into the first feature decoding block to obtain decoding features output by the first feature decoding block.

6. The method of claim 1, wherein the real labels comprise a first real label corresponding to the boundary and a second real label corresponding to the instance split region, and adjusting parameters of the neural network based on the first prediction result, the second prediction result, and the real label comprises:

adjusting parameters of the feature coding sub-network and the boundary prediction sub-network based on the first prediction result and the first real label;

freezing parameters of the feature encoding sub-network and the boundary prediction sub-network; and

and adjusting parameters of the segmentation prediction sub-network based on the second prediction result and the second real label.

7. The method of claim 3, wherein the neural network further comprises a depth estimation sub-network comprising sequentially concatenating a plurality of depth information decoding blocks and depth prediction blocks, the real label comprising a third real label corresponding to an image depth of a road surface defect in the sample image, the method further comprising:

sequentially inputting a plurality of decoding features output by the plurality of feature decoding blocks into corresponding depth information decoding blocks in the plurality of depth information decoding blocks from a shallow layer to a deep layer, and acquiring depth information features output by the last depth information decoding block;

Inputting the depth information characteristics output by the last depth information decoding block into the depth prediction block to obtain a third prediction result, wherein the third prediction result represents the image depth of the pavement diseases in the sample image; and

and adjusting parameters of the neural network based on the third prediction result and the third real label.

8. The method of claim 7, wherein the plurality of depth information decoding blocks are in one-to-one correspondence with the plurality of feature decoding blocks, sequentially inputting a plurality of decoding features output by the plurality of feature decoding blocks into corresponding ones of the plurality of depth information decoding blocks from a shallow layer to a deep layer, and acquiring a depth information feature output by a last depth information decoding block comprises:

inputting the decoding feature output by the first feature decoding block into a first depth information decoding block in the plurality of depth information decoding blocks to obtain the depth information feature output by the first depth information decoding block; and

and inputting the decoding characteristics output by the characteristic decoding block corresponding to the depth information decoding block and the depth information characteristics output by the previous depth information decoding block into the depth information decoding block for each depth information decoding block after the first depth information decoding block so as to obtain the depth information characteristics output by the depth information decoding block.

9. The method of claim 7, wherein adjusting parameters of the neural network based on the third prediction result and the third real label comprises:

and adjusting parameters of the depth estimation sub-network based on the third prediction result and the third real label, wherein parameters of other sub-networks in the neural network are frozen before the parameters of the depth estimation sub-network are adjusted.

10. The method of claim 5, wherein each of the plurality of feature decoding blocks comprises:

the first fusion module is configured to fuse low-level semantic features and high-level semantic features to obtain first fusion features, the low-level semantic features are based on image features of a semantic level corresponding to the feature decoding block, and the high-level semantic features are based on decoding features output by a global image feature or a previous feature decoding block;

the first attention module is configured to obtain a channel attention weight graph based on the first fusion feature, and obtain enhanced low-level semantic features based on the low-level semantic features and the channel attention weight graph;

the second fusion module is configured to fuse the enhanced low-level semantic features and the high-level semantic features to obtain second fusion features; and

A second attention module configured to obtain a spatial attention weight map based on the second fused feature and a third fused feature based on the second fused feature and the spatial attention weight map,

wherein the decoding feature output by the feature decoding block is based on the third fusion feature.

11. The method according to any of claims 1-10, wherein the feature encoding sub-network is derived from unsupervised training of data and text encoding sub-networks using graphics.

12. A road surface failure detection method, a neural network for road surface failure detection including a feature coding sub-network, a boundary prediction sub-network, and a segmentation prediction sub-network, the method comprising:

acquiring an image to be detected;

inputting the image to be detected into the characteristic coding sub-network to obtain the characteristic of the image to be detected;

inputting the image features to be detected into the boundary prediction sub-network to obtain a boundary prediction result, wherein the boundary detection result represents the boundary of the pavement defect in the image to be detected;

inputting the image features to be detected into the segmentation prediction sub-network to obtain a segmentation prediction result, wherein the segmentation prediction result represents an example segmentation area of the pavement diseases in the image to be detected; and

And obtaining a pavement disease detection result of the image to be detected based on the boundary prediction result and the segmentation prediction result.

13. The method of claim 12, wherein the feature encoding sub-network comprises a plurality of feature encoding blocks in series, the image feature to be detected comprises a plurality of semantic level image features output by the plurality of feature encoding blocks, the boundary prediction sub-network comprises a plurality of feature decoupling blocks and boundary prediction blocks connected in sequence,

the step of inputting the image features to be detected into the boundary prediction sub-network to obtain a boundary prediction result comprises the following steps:

and inputting the decoupling characteristic output by the last characteristic decoupling block into the boundary prediction block to obtain the boundary prediction result.

14. The method of claim 13, wherein the split prediction sub-network comprises a plurality of feature decoding blocks and split prediction blocks connected in sequence, wherein inputting the image feature to be detected into the split prediction sub-network to obtain a split prediction result comprises:

and inputting the decoding characteristic output by the last characteristic decoding block into the segmentation prediction block to obtain the segmentation prediction result.

15. The method of claim 14, wherein the neural network further comprises a depth estimation sub-network comprising sequentially concatenating a plurality of depth information decoding blocks and depth prediction blocks, the method further comprising:

inputting the depth information characteristics output by the last depth information decoding block into the depth prediction block to obtain the depth prediction result, wherein the depth prediction result represents the image depth of the pavement diseases in the image to be detected; and

and determining the actual size of the pavement damage based on the image depth of the pavement damage.

16. A training device for a neural network for road surface fault detection, the neural network comprising a feature encoding sub-network, a boundary prediction sub-network, and a segmentation prediction sub-network, the device comprising:

a first acquisition unit configured to acquire a sample image and a real label of a road surface disease in the sample image;

a first feature encoding unit configured to input the sample image into the feature encoding sub-network to obtain a sample image feature;

a first boundary prediction unit configured to input the sample image features into the boundary prediction sub-network to obtain a first prediction result, the first prediction result characterizing a boundary of a road surface defect in the sample image;

a first segmentation prediction unit configured to input the sample image features into the segmentation prediction sub-network to obtain a second prediction result, the second prediction result characterizing an example segmentation region of the road surface disease in the sample image; and

and the parameter adjusting unit is configured to adjust parameters of the neural network based on the first prediction result, the second prediction result and the real label.

17. A road surface fault detection device, a neural network for road surface fault detection comprising a feature encoding sub-network, a boundary prediction sub-network, and a segmentation prediction sub-network, the device comprising:

A second acquisition unit configured to acquire an image to be detected;

the second feature coding unit is configured to input the image to be detected into the feature coding sub-network and input the sample image into the feature coding sub-network so as to obtain the feature of the image to be detected;

the second boundary prediction unit is configured to input the image characteristics to be detected into the boundary prediction sub-network to obtain a boundary prediction result, and the boundary prediction result represents the boundary of the pavement defect in the image to be detected;

the second segmentation prediction unit is configured to input the image features to be detected into the segmentation prediction sub-network to obtain a segmentation prediction result, and the segmentation prediction result represents an example segmentation area of the pavement diseases in the image to be detected; and

and the detection unit is configured to obtain a pavement disease detection result of the image to be detected based on the boundary prediction result and the segmentation prediction result.

18. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the method comprises the steps of

The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-15.

19. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-15.

20. A computer program product comprising a computer program, wherein the computer program, when executed by a processor, implements the method of any of claims 1-15.