CN112395952A

CN112395952A - A unmanned aerial vehicle for rail defect detection

Info

Publication number: CN112395952A
Application number: CN202011145523.7A
Authority: CN
Inventors: 刘建虢; 尹晓雪
Original assignee: Xian Cresun Innovation Technology Co Ltd
Current assignee: Xian Cresun Innovation Technology Co Ltd
Priority date: 2020-10-23
Filing date: 2020-10-23
Publication date: 2021-02-23

Abstract

The invention discloses an unmanned aerial vehicle for rail defect detection, which comprises: the power system is used for providing power required by the flight of the unmanned aerial vehicle; the flight control system is connected with the power system and used for controlling the attitude stability of the unmanned aerial vehicle, managing the unmanned aerial vehicle to execute tasks and processing emergency situations; the communication navigation system is connected with the flight control system and is used for information transmission in the working process of the unmanned aerial vehicle; the mission load system is connected with the flight control system, is pre-loaded with a program of a rail defect detection method, and determines the position and the defect type of a rail defect according to the method; and the launching and recovery system is used for ensuring that the unmanned aerial vehicle smoothly flies at a safe height and speed and safely falls back to the ground from the sky after the mission is finished. The unmanned aerial vehicle can detect the defects of different scales of the rail, can judge the types of the defects and solve the problem of missing detection of tiny defects; meanwhile, the detection speed and precision can be improved, and real-time detection is realized.

Description

A unmanned aerial vehicle for rail defect detection

Technical Field

The invention belongs to the field of defect detection, and particularly relates to an unmanned aerial vehicle for rail defect detection.

Background

Transportation has become a part of our lives today, particularly rail transportation. China railway transportation is in a rapid development stage, and the speed of a train is greatly improved, so that high requirements are put forward on the safety of the railway. Due to weather, heavy transport loads, etc., rails are subject to various degrees of wear over time, including geometric defects, rail component defects, rail surface defects, common rail surface defects including scars, cracks, ripping, wrinkles, flaking, wear, indentation, etc., which if not timely maintained for replacement, can develop into internal defects that affect proper rail use and threaten safe train operation.

In China, the detection of the rail defects always depends on manual inspection and visual inspection for a long time, and the efficiency is low; meanwhile, the detection result is greatly influenced by factors such as artificial subjective factors, weather illumination and the like, false detection and missed detection of some tiny defects can occur, and real-time detection cannot be achieved.

Therefore, how to realize high-precision and high-real-time detection of rail defects is a problem to be solved in the field.

Disclosure of Invention

In order to realize high-precision and high-real-time detection of rail defects, the embodiment of the invention provides an unmanned aerial vehicle for rail defect detection.

The specific technical scheme is as follows:

a drone for rail defect detection, comprising:

the power system is used for providing power required by the flight of the unmanned aerial vehicle;

the flight control system is connected with the power system and used for controlling the attitude stability of the unmanned aerial vehicle, managing the unmanned aerial vehicle to execute tasks and processing emergency situations;

the communication navigation system is connected with the flight control system and is used for information transmission in the working process of the unmanned aerial vehicle;

the mission load system is connected with the flight control system, is pre-loaded with a program of a rail defect detection method, and determines the position and the defect type of a rail defect according to the method;

and the launching and recovery system is used for ensuring that the unmanned aerial vehicle smoothly flies at a safe height and speed and safely falls back to the ground from the sky after the mission is finished.

In one embodiment of the present invention, the rail defect detecting method includes:

acquiring a target rail image to be detected;

inputting the target rail image into an improved YOLOv3 network obtained by pre-training, and performing feature extraction on the target rail image by using a backbone network to obtain x feature maps with different scales; x is a natural number of 4 or more;

carrying out feature fusion in a top-down and dense connection mode on the x feature graphs with different scales by using an improved FPN (field programmable gate array) network to obtain a prediction result corresponding to each scale;

obtaining attribute information of the target rail image based on all prediction results, wherein the attribute information comprises the position and the category of a target in the target rail image;

wherein the improved YOLOv3 network comprises a trunk network and the improved FPN network which are connected in sequence; the improved YOLOv3 network is formed by increasing a feature extraction scale, optimizing a feature fusion mode of an FPN network, pruning and combining knowledge distillation to guide network recovery processing on the basis of a YOLOv3 network; the improved YOLOv3 network is trained according to sample images and the positions and the types of the targets corresponding to the sample images.

In one embodiment of the present invention, the backbone network of the improved YOLOv3 network includes:

y residual modules connected in series; y is a natural number of 4 or more; y is greater than or equal to x;

the method for extracting features by using the backbone network to obtain x feature maps with different scales comprises the following steps:

and performing feature extraction on the target rail image by utilizing y residual modules connected in series to obtain x feature maps which are output by the x residual modules in the reverse direction along the input direction and have sequentially increased scales.

In an embodiment of the present invention, the performing, by using an improved FPN network, feature fusion in a top-down dense connection manner on the x feature maps with different scales includes:

for predicted branch Y_iAcquiring feature maps with corresponding scales from the x feature maps and performing convolution processing;

the feature map after convolution processing and the prediction branch Y_i-1～Y₁Performing cascade fusion on the feature maps subjected to the upsampling treatment respectively;

wherein the improved FPN network comprises x prediction branches Y with sequentially increased scale₁～Y_x(ii) a The prediction branch Y₁～Y_xThe scales of the x feature maps correspond to the scales of the x feature maps one by one; prediction branch Y_i-jHas an upsampling multiple of 2^j(ii) a i is 2, 3, …, x; j is a natural number smaller than i.

In an embodiment of the present invention, the performing pruning and guiding network recovery processing in combination with knowledge distillation includes:

in a network obtained by adding a feature extraction scale on the basis of the YOLOv3 network and optimizing a feature fusion mode of the FPN network, carrying out layer pruning on a residual error module of a backbone network to obtain a YOLOv3-1 network;

carrying out sparse training on the YOLOv3-1 network to obtain a YOLOv3-2 network with BN layer scaling coefficients in sparse distribution;

performing channel pruning on the YOLOv3-2 network to obtain a YOLOv3-3 network;

knowledge distillation is carried out on the YOLOv3-3 network to obtain the improved YOLOv3 network.

In an embodiment of the present invention, before training the modified YOLOv3 network, the method further includes:

determining the quantity to be clustered aiming at the size of the anchor box in the sample image;

acquiring a plurality of sample images with marked target frame sizes;

based on a plurality of sample images marked with the size of the target frame, acquiring a clustering result of the size of the anchor box in the sample images by using a K-Means clustering method;

writing the clustering result into a configuration file of the improved YOLOv3 network.

In one embodiment of the invention, the improved YOLOv3 network further includes a classification network and a non-maxima suppression module.

In an embodiment of the present invention, the obtaining attribute information of the target rail image based on all the prediction results includes:

and classifying all prediction results through the classification network, and then performing prediction frame deduplication processing through the non-maximum suppression module to obtain attribute information of the target rail image.

In one embodiment of the invention, the classification network comprises a SoftMax classifier.

In one embodiment of the invention, the loss function of sparse training is:

wherein the content of the first and second substances,

representing the loss function of the network origin, (x, y) representing input data and target data of the training process, W representing trainable weights,

regularization term added for scale factor, g (γ)) The penalty function is used for carrying out sparse training on the scale coefficient, and lambda is weight. The penalty function selects the L1 norm since the scaling factor γ is to be sparse.

The invention provides an unmanned aerial vehicle for detecting rail defects, which adopts a rail defect detection method, transmits a feature map from shallow to deep through an improved YOLOv3 network, extracts feature maps of at least four scales, enables the network to detect defects of different scales, especially tiny defects, by increasing the feature extraction scale of fine granularity, and simultaneously realizes the accurate classification of the defects.

The unmanned aerial vehicle disclosed by the invention has the advantages that the characteristic fusion mode of the FPN is changed, the characteristic fusion is carried out on the characteristic graphs extracted from the main network in a top-down and dense connection mode, the deep-layer characteristics are directly subjected to upsampling with different multiples, so that all the transmitted characteristic graphs have the same size, the characteristic graphs and the shallow-layer characteristic graphs are fused in a series connection mode, more original information can be utilized, high-dimensional semantic information participates in the shallow network, and the detection precision is improved; meanwhile, more specific characteristics can be obtained by directly receiving the characteristics of a shallower network, the loss of the characteristics can be effectively reduced, the parameter quantity needing to be calculated can be reduced, the detection speed is improved, and real-time detection is realized.

According to the unmanned aerial vehicle, the pre-trained network is subjected to layer pruning, sparse training, channel pruning and knowledge distillation processing, optimized processing parameters are selected in each processing process, the network volume can be reduced, most redundant calculation is eliminated, and the detection speed can be greatly improved under the condition that the detection accuracy is maintained.

Of course, not all of the advantages described above need to be achieved at the same time in the practice of any one product or method of the invention.

The present invention will be described in further detail with reference to the accompanying drawings and examples.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a schematic structural diagram of an unmanned aerial vehicle for rail defect detection according to an embodiment of the present invention;

fig. 2 is a schematic flow chart of a rail defect detection method according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a prior art YOLOv3 network;

fig. 4 is a schematic structural diagram of an improved YOLOv3 network according to an embodiment of the present invention;

FIG. 5-1 is a graph of weight shift for parameter set 5 selected by an embodiment of the present invention; fig. 5-2 is a weight overlap graph of a parameter combination 5 selected by an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be described below with reference to the drawings in the embodiments of the present invention.

Referring to fig. 1, fig. 1 is a schematic structural diagram of an unmanned aerial vehicle for detecting a rail defect according to an embodiment of the present invention. As shown in fig. 1, an unmanned aerial vehicle for rail defect detection provided by an embodiment of the present invention includes:

power system 101 for provide the required power of unmanned aerial vehicle flight, make unmanned aerial vehicle can carry out each item flight activity safely.

The power system 101 includes: the battery, motor, electronic governor, screw can realize hovering, functions such as variable speed.

And the flight control system 102 is connected with the power system and used for controlling the attitude stability of the unmanned aerial vehicle, managing the unmanned aerial vehicle to execute tasks and processing emergency situations.

The flight control system 102 may be said to be the "brain" of the drone, which plays a decisive role in the flight performance of the drone.

And the communication navigation system 103 is connected with the flight control system and used for information transmission in the working process of the unmanned aerial vehicle.

The functions of the communication navigation system 103 mainly include: the remote control device is used for ensuring that the remote control command can be accurately transmitted, and the unmanned aerial vehicle can timely, reliably and accurately receive and send information so as to ensure the reliability, accuracy, instantaneity and effectiveness of information feedback.

A mission load system 104 connected to the flight control system, wherein a program of a rail defect detection method is pre-installed in the mission load system, and the position and the defect type of the rail defect are determined according to the method;

and the launching and recovery system 105 is used for ensuring that the unmanned aerial vehicle smoothly ascends to the air to achieve safe height and speed flight, and safely falls back to the ground from the sky after the mission is finished.

The following description mainly refers to a method for detecting a rail defect pre-installed in the mission load system 104, and specific structures of the remaining modules may refer to the related prior art, which is not described herein again.

Referring to fig. 2, fig. 2 is a schematic flow chart of a rail defect detection method according to an embodiment of the present invention. As shown in fig. 2, a rail defect detecting method provided by an embodiment of the present invention may include the following steps:

s1, acquiring a target rail image to be detected;

the target rail image is an image shot by the image acquisition equipment for the rail to be detected.

The image acquisition equipment is deployed in a task load system of the unmanned aerial vehicle and used for executing a detection task of rail defects.

The image acquisition device may include a camera, a video camera, a still camera, a mobile phone, etc.; in an alternative embodiment, the image capture device may be a high resolution camera.

In the embodiment of the present invention, the size of the target rail image is 416 × 416 × 3. Thus, at this step, in one embodiment, the 416 x 3 size target rail image may be obtained directly from the image capture end; in another embodiment, an image of any size sent by the image acquisition end can be obtained, and then the obtained image is subjected to certain size scaling processing to obtain a target rail image of 416 × 416 × 3 size.

In the two embodiments, the obtained image may be subjected to image enhancement operations such as cropping, stitching, smoothing, filtering, edge filling, and the like, so as to enhance features of interest in the image and expand the generalization capability of the data set.

S2, inputting the target rail image into an improved YOLOv3 network obtained by pre-training, and extracting the features of the target rail image by using a backbone network to obtain x feature maps with different scales; x is a natural number of 4 or more;

to facilitate understanding of the network structure of the improved YOLOv3 network proposed in the embodiment of the present invention, first, a network structure of a YOLOv3 network in the prior art is introduced, please refer to fig. 3, and fig. 3 is a schematic structural diagram of a YOLOv3 network in the prior art. In fig. 3, the part inside the dashed box is the YOLOv3 network. Wherein the part in the dotted line frame is a backbone (backbone) network of the YOLOv3 network, namely a darknet-53 network; the backbone network of the YOLOv3 network is formed by connecting CBL modules and 5 resn modules in series. The CBL module is a Convolutional network module, and includes a conv layer (convolutive layer, convolutive layer for short), a BN (Batch Normalization) layer and an leakage relu layer corresponding to an activation function leakage relu, which are connected in series, and the CBL represents conv + BN + leakage relu. The resn module is a residual error module, n represents a natural number, and specifically, as shown in fig. 2, res1, res2, res8, res8, and res4 are sequentially arranged along the input direction; the resn module comprises a zero padding (zero padding) layer, a CBL module and a Residual error unit group which are connected in series, the Residual error unit group is represented by Res unit n, the Residual error unit group comprises n Residual error units, each Residual error unit comprises a plurality of CBL modules which are connected in a Residual error Network (ResNet) connection mode, and the feature fusion mode adopts a parallel mode, namely an add mode.

The rest of the network outside the main network is a Feature Pyramid (FPN) network, which is divided into three prediction branches Y₁～Y₃Predicting branch Y₁～Y₃The scales of (2) are in one-to-one correspondence with the scales of the feature maps output by the 3 residual error modules res4, res8, res8 in the reverse direction of the input, respectively. The prediction results of the prediction branches are respectively represented by Y1, Y2 and Y3, and the scales of Y1, Y2 and Y3 are increased in sequence.

Each prediction branch of the FPN network includes a convolutional network module group, specifically includes 5 convolutional network modules, that is, CBL × 5 in fig. 3. In addition, the US (up sampling) module is an up sampling module; concat represents that the feature fusion adopts a cascade mode, and concat is short for concatenate.

For the specific structure of each main module in the YOLOv3 network, please refer to the schematic diagram below the dashed box in fig. 3.

In the embodiment of the invention, the improved YOLOv3 network comprises a backbone network and an improved FPN network; the improved YOLOv3 network is formed by increasing a feature extraction scale, optimizing a feature fusion mode of an FPN network, pruning and combining knowledge distillation to guide network recovery processing on the basis of a YOLOv3 network; the improved YOLOv3 network is trained according to the sample image and the position and the category of the target corresponding to the sample image. The network training process is described later.

To facilitate understanding of the present invention, the structure of the modified YOLOv3 network is described below.

For example, referring to fig. 4, in the embodiment of the present invention, the backbone network extracts at least feature maps of 4 scales to perform feature fusion of subsequent prediction branches, so that the number y of residual error modules is greater than or equal to 4, so as to correspondingly fuse feature maps output by the backbone network into each prediction branch. It can be seen that the improved YOLOv3 network obviously adds at least one finer-grained feature extraction scale in the backbone network compared with the YOLOv3 network. Please refer to fig. 3, compared with the YOLOv3 network, the feature map output by the fourth residual module in the reverse direction of the input is extracted for subsequent feature fusion. Therefore, the four densely connected modules of the backbone network along the reverse input direction respectively output corresponding feature maps, and the scales of the four feature maps are sequentially increased. Specifically, the scale of each feature map is 13 × 13 × 72, 26 × 26 × 72, 52 × 52 × 72, and 104 × 104 × 72, respectively.

Of course, in an alternative embodiment, five feature extraction scales may be set, that is, the feature map output by the fifth densely connected module with the extraction direction reversed along the input direction is added for subsequent feature fusion, and so on.

Specifically, for the step S2, obtaining x feature maps with different scales includes:

and obtaining x characteristic graphs which are output by the x dense connection modules along the input reverse direction and have sequentially increased scales.

Referring to fig. 3, feature maps respectively output by the first residual module to the fourth densely connected module in the reverse direction of the input are obtained, and the sizes of the four feature maps are sequentially increased.

The improved YOLOv3 network transmits the feature maps from shallow to deep, extracts the feature maps with at least four scales, enables the network to detect defects with different scales, especially tiny defects, by increasing the feature extraction scale with fine granularity, and simultaneously realizes accurate classification of the defects.

S3, performing feature fusion in a top-down and dense connection mode on x feature graphs of different scales by using an improved FPN (field programmable gate array) network to obtain a prediction result corresponding to each scale;

the feature fusion mode of the top-down dense connection mode is described below with reference to the structure of the improved FPN network shown in fig. 3.

The improved FPN network comprises x prediction branches Y with sequentially increased scale₁～Y_x(ii) a Prediction branch Y₁～Y_xThe scales of the feature maps are in one-to-one correspondence with the scales of the x feature maps; illustratively, the modified FPN network of FIG. 3 has 4 prediction branches Y₁～Y₄The scales of the feature maps correspond to the scales of the 4 feature maps one to one.

Aiming at the step S3, the method for performing feature fusion in a top-down and dense connection mode on x feature graphs with different scales by using an improved FPN network comprises the following steps:

wherein branch Y is predicted_i-jHas an upsampling multiple of 2^j(ii) a i is 2, 3, …, x; j is a natural number smaller than i.

Referring to fig. 4, i is 3, i.e. the prediction branch Y₃For illustration, the feature maps for performing the cascade fusion process are derived from three aspects: on the first hand, a feature map with a corresponding scale is obtained from 4 feature maps and is subjected to convolution processing, namely the feature map output by a third dense connection module along the input reverse direction is subjected to CBL module processing, and the feature map can also be understood as being subjected to 1-time upsampling and has a size of 52 × 52 × 72; the second aspect derives from predicting branch Y₂(i.e. Y)_i-1＝Y₂) I.e. the characteristic map (size 26 x 72) output from the second densely connected block in the reverse direction of the input passes through the predicted branch Y₂The CBL module of (2)¹2 times the feature map after upsampling (size 52 × 52 × 72); the third aspect derives from the predicted branch Y₁(i.e. Y)_i-2＝Y₁) I.e. the characteristic map (size 13 x 72) output by the first densely-connected module in the reverse direction along the input is predicted for branch Y₁The CBL module of (2) is then passed²4 times the feature map after upsampling (size 52 × 52 × 72); then, as will be understood by those skilled in the art, after the above-mentioned process performs upsampling processing on 3 feature maps with different scales output by the backbone network, the sizes of the 3 feature maps to be cascaded and fused can be made to be consistent, and all the sizes are 52 × 52 × 72. Thus, branch Y is predicted₃After cascade fusion, convolution and other processes can be continued to obtain a prediction result Y3, wherein the size of Y3 is 52 × 52 × 72.

About predicted branch Y₂And Y₄See the prediction branch Y₃And will not be described herein. For the predicted branch Y₁And the subsequent prediction process is carried out by the first intensive connection module after the characteristic diagram output by the first intensive connection module along the input reverse direction is obtained, and the characteristic diagrams of other prediction branches are not received to be fused with the characteristic diagrams.

In the original FPN network feature fusion method of the YOLOv3 network, a method of adding deep layer and shallow layer network features and then performing upsampling is used, and after the features are added, a feature map is extracted through a convolutional layer, which may destroy some original feature information. In the embodiment, the feature fusion is combined with the horizontal mode and the top-down dense connection mode, in this mode, the original top-down mode is changed into a mode that the feature map of the prediction branch with a smaller scale directly transmits the feature of the prediction branch with a larger scale to each prediction branch, and the feature fusion mode is changed into a dense fusion method, that is, the deep features directly perform upsampling with different multiples, so that all the transmitted feature maps have the same size. The feature maps and the shallow feature map are fused in a serial connection mode, features are extracted again from the fusion result to eliminate noise in the feature maps, main information is reserved, and then prediction is carried out, so that more original information can be utilized, and high-dimensional semantic information participates in a shallow network. Therefore, the advantage that the dense connection network reserves more original semantic features of the feature map can be exerted, but for a top-down method, the reserved original semantic is higher-dimensional semantic information, so that the object classification is facilitated. By directly receiving the characteristics of the shallower layer network, more specific characteristics can be obtained, so that the loss of the characteristics can be effectively reduced, the parameter quantity needing operation can be reduced, and the prediction process is accelerated.

In the above, the feature fusion method is mainly described, each prediction branch is mainly predicted by using some convolution operations after feature fusion, and for how to obtain each prediction result, reference is made to related prior art, and no description is made here.

In the improved YOLOv3 network of the embodiment of the invention, 4 prediction branches output feature maps of four scales in total, which are respectively 13 × 13 × 72, 26 × 26 × 72, 52 × 52 × 72 and 104 × 104 × 72, and the smallest feature map of 13 × 13 × 72 is suitable for larger target detection due to the largest receptive field; the medium 26 × 26 × 72 feature map is suitable for detecting medium-sized targets due to the medium receptive field; the larger 52X 72 characteristic map is suitable for detecting smaller targets due to the smaller receptive field; the largest 104X 72 feature map is suitable for detecting smaller targets because the feature map has a smaller receptive field. Therefore, the image is divided more finely, and the prediction result is more targeted to the object with smaller size.

The network training process is described below. The network training is completed in the server, and the network training can comprise three processes of network pre-training, network pruning and network fine-tuning. The method specifically comprises the following steps:

firstly, building a network structure; the method can be improved on the basis of a YOLOv3 network, the feature extraction scale is increased, and the feature fusion mode of the FPN network is optimized to obtain the network structure shown in the figure 3 as a built network; wherein m is 4.

And (II) obtaining a plurality of sample images and the positions and the types of the targets corresponding to the sample images. In this process, the position and the category of the target corresponding to each sample image are known, and the manner of determining the position and the category of the target corresponding to each sample image may be: by manual recognition, or by other image recognition tools, and the like. Afterwards, the sample image needs to be marked, and an artificial marking mode can be adopted, and of course, other artificial intelligence methods can also be utilized to carry out non-artificial marking, which is reasonable. The position of each sample image corresponding to the target is marked in the form of a target frame containing the target, the target frame is real and accurate, and each target frame is marked with coordinate information so as to embody the position of the target in the image.

(III) determining the size of the anchor box in the sample image; may include the steps of:

a) determining the quantity to be clustered aiming at the size of the anchor box in the sample image;

in the field of target detection, an anchor box (anchor box) is a plurality of boxes with different sizes obtained by statistics or clustering from real boxes (ground route) in a training set; the anchor box actually restrains the predicted object range and adds the prior experience of the size, thereby realizing the aim of multi-scale learning. In the embodiment of the present invention, since a finer-grained feature extraction scale is desired to be added, the sizes of the target frames (i.e., real frames) marked in the sample image need to be clustered by using a clustering method, so as to obtain a suitable anchor box size suitable for the scene of the embodiment of the present invention.

Wherein, determining the quantity to be clustered aiming at the size of the anchor box in the sample image comprises the following steps:

determining the number of types of the anchor box size corresponding to each scale; and taking the product of the number of the types of the anchor box sizes corresponding to each scale and x as the quantity to be clustered of the anchor box sizes in the sample image.

Specifically, in the implementation of the present invention, the number of types of the anchor box size corresponding to each scale is selected to be 3; there are 4 scales, and then the number of anchor box sizes to be clustered in the obtained sample image is 3 × 4 or 12.

b) Acquiring a plurality of sample images with marked target frame sizes;

this step is actually to obtain the size of each target frame in the sample image.

c) Based on a plurality of sample images marked with the size of the target frame, acquiring a clustering result of the size of the anchor box in the sample images by using a K-Means clustering method;

specifically, the size of each target frame can be clustered by using a K-Means clustering method to obtain a clustering result of the size of the anchor box; no further details regarding the clustering process are provided herein.

Wherein, the definition of the distances of different anchor boxes is the Euclidean distance of the width and the height:

wherein d is_1,2Representing the Euclidean distance, w, of the two anchor boxes₁，w₂Width, h, of the anchor box₁，h₂Representing the height of the anchor box.

For the number of clusters to be clustered being 12, the anchor box size of each predicted branch can be obtained.

d) And writing the clustering result into a configuration file of the improved YOLOv3 network.

Those skilled in the art can understand that the clustering result is written into the configuration file of each predicted branch of the improved YOLOv3 network according to the anchor box size corresponding to different predicted branches, and then the network pre-training can be performed.

And (IV) pre-training the constructed network by utilizing each sample image and the position and the category of the target corresponding to each sample image, wherein the method comprises the following steps:

1) and taking the position and the type of the target corresponding to each sample image as a true value corresponding to the sample image, and training each sample image and the corresponding true value through a built network to obtain a training result of each sample image.

2) And comparing the training result of each sample image with the true value corresponding to the sample image to obtain the output result corresponding to the sample image.

3) And calculating the loss value of the network according to the output result corresponding to each sample image.

4) And adjusting parameters of the network according to the loss value, and repeating the steps 1) -3) until the loss value of the network reaches a certain convergence condition, namely the loss value reaches the minimum value, which means that the training result of each sample image is consistent with the true value corresponding to the sample image, thereby completing the pre-training of the network and obtaining a complex network with higher accuracy.

(V) network pruning and network fine adjustment; the process is to carry out pruning and guide network recovery processing by combining knowledge distillation.

Pruning and knowledge-based distillation guided network recovery processing are performed, and the method comprises the following steps:

firstly, in a network obtained by adding a feature extraction scale on the basis of a YOLOv3 network and optimizing a feature fusion mode of an FPN network, carrying out layer pruning on a residual error module of a main network to obtain a YOLOv3-1 network;

usually, channel pruning is directly performed in the simplified processing process of the YOLOv3 network, but the inventor finds in experiments that the effect of fast speed increase is still difficult to achieve only through channel pruning. Therefore, the treatment process of layer pruning is added before channel pruning.

Specifically, the step may perform layer pruning on a residual error module of the backbone network in the improved yollov 3 network, so as to obtain a yollov 3-1 network.

Sparsifying training the YOLOv3-1 network to obtain a YOLOv3-2 network with BN layer scaling coefficients sparsely distributed;

illustratively, a YOLOv3-1 network is subjected to sparse training to obtain a YOLOv3-2 network with a BN layer scaling coefficient in sparse distribution; the method can comprise the following steps:

carrying out sparse training on a YOLOv3-1 network, adding sparse regularization for a scaling factor gamma in the training process, wherein the loss function of the sparse training is as follows:

wherein the content of the first and second substances,

and g (gamma) is a penalty function for sparse training of the scale coefficient, and lambda is weight. The penalty function selects the L1 norm since the scaling factor γ is to be sparse. Meanwhile, because the proportion of the latter term is unknown, the lambda parameter is introduced for adjustment.

Because the value of the lambda is related to the convergence rate of sparse training, the application scenario of the embodiment of the invention is a rail detection scenario, the number of the types of the targets to be detected can be set to be 13, which is far smaller than 80 types in the original YOLOv3 data set, so that the value of the lambda can be a larger lambda value, the convergence rate of sparse training is not too slow, and the convergence can be further accelerated by a method for improving the model learning rate; however, considering that the accuracy of the network model is lost due to excessive parameter selection, after the learning rate and the lambda parameter are continuously adjusted, the combination with the learning rate of 0.25 x and the lambda of 0.1 x is finally determined to be the optimal parameter combination for sparse training. The preferred combination of the learning rate and the weight in the embodiment of the invention is more favorable for the distribution of the weight after the coefficient training, and the accuracy of the network model is higher.

Thirdly, channel pruning is carried out on the YOLOv3-2 network to obtain a YOLOv3-3 network;

after the sparsification training, a network model with the BN layer scaling coefficients distributed sparsely is obtained, so that the importance of which channels is smaller can be determined conveniently. These less important channels can thus be pruned by removing incoming and outgoing connections and the corresponding weights.

Performing a channel pruning operation on the network, pruning a channel corresponding to substantially removing all incoming and outgoing connections of the channel, may directly result in a lightweight network without the use of any special sparse computation packages. In the channel pruning process, the scaling factor serves as a proxy for channel selection; because they are jointly optimized with network weights, the network can automatically identify insignificant channels that can be safely removed without greatly impacting generalization performance.

Specifically, the step may include the steps of:

setting a channel pruning proportion in all channels of all layers, then arranging all BN layer scaling factors in the YOLOv3-2 network in an ascending order, and pruning channels corresponding to the BN layer scaling factors arranged in the front according to the channel pruning proportion.

In a preferred embodiment, the channel pruning proportion may be 60%.

Through channel pruning, redundant channels can be deleted, the calculated amount is reduced, and the detection speed is accelerated.

However, after channel pruning, some precision may be reduced due to the reduction of parameters, the influence of different pruning proportions on the network precision is analyzed, if the network pruning proportion is too large, the network volume is compressed more, but the network precision is also reduced sharply, a certain loss is caused to the network accuracy, so that the balance between the network compression proportion and the compressed network precision needs to be carried out, and therefore, a knowledge distillation strategy is introduced to finely adjust the network, so that the network accuracy is improved.

Fourthly, knowledge distillation is carried out on the YOLOv3-3 network to obtain an improved YOLOv3 network.

Through pruning, a more compact Yolov3-3 network model is obtained, and then fine tuning is needed to recover the precision. The strategy of knowledge distillation is introduced here.

Specifically, knowledge distillation is introduced into a YOLOv3-3 network, the complex network is used as a teacher network, a YOLOv3-3 network is used as a student network, and the teacher network guides the student network to carry out precision recovery and adjustment, so that an improved YOLOv3 network is obtained.

As a preferred embodiment, the output result before the Softmax layer of the complex network is divided by the temperature coefficient to soften the predicted value finally output by the teacher network, and then the student network uses the softened predicted value as a label to assist in training the YOLOv3-3 network, so that the precision of the YOLOv3-3 network is finally equivalent to that of the teacher network; the temperature coefficient is a preset value and does not change along with network training.

The reason for introducing the temperature parameter T is that a trained and highly accurate network is substantially consistent with the classification result of the input data and the real label. For example, with three classes, the true known training class label is [1,0,0], the prediction result may be [0.95,0.02,0.03], and the true label value is very close. Therefore, for the student network, the classification result of the teacher network is used for assisting training and the data is directly used for training, and the difference is not great. The temperature parameter T can be used to control the softening degree of the prediction tag, i.e. the deviation of the teacher's network classification result can be increased.

The fine adjustment process added with the knowledge distillation strategy is compared with the general fine adjustment process, and the network accuracy recovered through knowledge distillation adjustment is higher.

The method comprises the steps of performing layer pruning, sparse training, channel pruning and knowledge distillation processing on a pre-trained network, selecting optimized processing parameters in each processing process to obtain a simplified network, greatly reducing the volume of the network, eliminating most redundant calculation, obtaining the network which is an improved YOLOv3 network for detecting a target rail image subsequently, greatly improving the detection speed based on the network, and maintaining the detection precision. The method can meet the requirement on high detection real-time performance, and can be completely deployed in image acquisition equipment at an image acquisition end due to small network size and low resource demand.

S4, obtaining attribute information of the target rail image based on all the prediction results, wherein the attribute information comprises the position and the type of the target in the target rail image;

the improved YOLOv3 network further includes a classification network and a non-maxima suppression module; the classification network and the non-maximum suppression module are connected in series after the FPN network.

Obtaining attribute information of the target rail image based on all prediction results, wherein the attribute information comprises:

classifying all prediction results through a classification network, and then performing prediction frame duplicate removal through a non-maximum suppression module to obtain attribute information of the target rail image;

wherein the classification network comprises a SoftMax classifier. The purpose is to realize the mutually exclusive classification of a plurality of defect categories. Optionally, the classification network may also perform classification along a logistic regression using the YOLOv3 network to achieve multiple independent two classifications.

The non-maximum suppression module is configured to perform NMS (non _ max _ suppression) processing. The method is used for repeatedly selecting a plurality of prediction boxes of the same target, and the prediction boxes with relatively low confidence coefficient are excluded.

For the content of the classification network and the non-maximum suppression module, reference may be made to the related description of the prior art, and details thereof are not repeated here.

It should be noted that fig. 3 does not show the classification module and the non-maximum suppression module for the sake of simplicity.

For each target, the detection result is in the form of a vector, including the position of the prediction box, the confidence of the defect in the prediction box, and the category of the target in the prediction box. The position of the prediction frame is used for representing the position of the target in the target rail image; specifically, the position of each prediction frame is represented by four values, bx, by, bw and bh, bx and by are used for representing the position of the center point of the prediction frame, and bw and bh are used for representing the width and height of the prediction frame.

The target in the embodiment of the invention is characterized by the defect on the railway rail, and the category of the target is the category of the defect to which the target belongs, such as scar, crack, ripple scratch, wrinkle, peeling, abrasion, indentation and the like.

The existing YOLOv3 network contains many convolutional layers because there are 80 types of targets. In the embodiment of the invention, the targets are mainly defects on the rails, and the number of the types of the targets is small, so that a large number of convolution layers are not necessary, network resources are wasted, and the processing speed is reduced.

In addition, optionally, the improved YOLOv3 network may also be obtained by adjusting the value of k in the convolutional network module group of each prediction branch in the FPN network, that is, reducing k from 5 in the original YOLOv3 network to 4 or 3, that is, changing the original CBL 5 to CBL 4 or CBL 3; therefore, the number of the convolution layers in the FPN network can be reduced, the target rail image of the embodiment of the invention can integrally realize the simplification of the number of network layers under the condition of not influencing the network precision, and the network processing speed is improved.

In the scheme provided by the embodiment of the invention, on one hand, a plurality of feature extraction scales are adopted, the feature extraction scale with fine granularity is added for the small target, and the detection precision of the small target in the target rail image can be improved, so that the rail defect can be accurately detected and classified, and the problem of missed detection of the small defect can be solved. On the other hand, the feature fusion mode of the FPN is changed, feature fusion is carried out on feature graphs extracted from a main network in a top-down dense connection mode, deep features are directly subjected to upsampling of different multiples, all transmitted feature graphs have the same size, the feature graphs and shallow feature graphs are fused in a series connection mode, more original information can be utilized, high-dimensional semantic information participates in a shallow network, and the detection precision is improved; meanwhile, more specific characteristics can be obtained by directly receiving the characteristics of a shallower network, the loss of the characteristics can be effectively reduced, the parameter quantity needing to be calculated can be reduced, the detection speed is improved, and real-time detection is realized. In another aspect, layer pruning, sparsification training, channel pruning and knowledge distillation processing are carried out on the pre-trained network, optimized processing parameters are selected in each processing process, the network volume can be reduced, most redundant calculation is eliminated, and the detection speed can be greatly improved under the condition of maintaining the detection precision.

The network improvement and the rail image detection performance of the embodiment of the invention are described in the following by combining the experimental process of the inventor, so as to facilitate the deep understanding of the performance.

The invention aims at sparse training, and the learning rate and mu can be adjusted according to the trade-off so as to ensure the convergence speed and precision. The present solution attempts different learning rates and values of μ as shown in table 2. Finally, parameter combinations 5 are selected by comparing the gamma weight distribution situation graphs. See fig. 5-1 and 5-2 for a gamma weight distribution map for parameter set 5. FIG. 5-1 is a graph of weight shift for parameter set 5; fig. 5-2 is a weight overlap graph of the parameter combination 5.

TABLE 2 different learning rates and λ combinations

Combination of	Learning rate	λ
			1	1×	1×
2	1×	0.1×
			3	0.1×	1×
4	1×	0.025×
			5	0.25×	0.1×

In fact, the initial experimental design herein does not include pruning of the network layer, and the original plan is to perform channel pruning directly. However, according to the analysis of the channel pruning result, more than half of the dense connection layers are found to have weights close to 0, so that the channels of the whole layer are pruned according to the channel pruning rule. This indicates that there are redundant units in the 4 residual error modules designed above, so before channel pruning, layer pruning may be performed to greatly reduce redundancy, and then channel pruning of relatively finer granularity is performed. Because more than half of the dense connection units are redundant units, layer pruning is performed by subjecting the residual modules to layer pruning to obtain the YOLOv3-1 network.

Then, carrying out sparse training on the YOLOv3-1 network to obtain a YOLOv3-2 network with BN layer scaling coefficients in sparse distribution;

channel pruning is carried out on the YOLOv3-2 network to obtain a YOLOv3-3 network;

the channel pruning ratio may be 60%, because a small number of target types in the target rail image to be detected are greatly affected in the network compression process, which directly affects the mAP, and therefore, the data set and the network compression ratio are considered. For processing the data set, the embodiment of the present invention selects the type of the target with a smaller number of combinations to balance the number of different types, or directly adopts the data set with more balanced type distribution, which is consistent with the application scenario of the embodiment of the present invention. In addition, the compression ratio is controlled, and the prediction accuracy of the types with small quantity is ensured not to be reduced too much.

In addition to analyzing the influence of compression from precision, the relationship between the detection time and the model compression ratio is also considered, and the simulation is carried out on the time for detecting the road image on different platforms (Tesla V100 server and Jetson TX2 edge equipment) of the network model processed with different pruning ratios, so that according to the simulation result, the influence of different network compression ratios on the detection time is very weak, the influence on the time required by NMS (non-maximum suppression) is large, the detection speed is accelerated along with the network compression before the compression ratio reaches 60%, but the detection speed is slowed down after the compression ratio exceeds 60%. Thus, the final selected channel pruning ratio is 60%.

Knowledge distillation is carried out on the YOLOv3-3 network to obtain an improved YOLOv3 network.

Meanwhile, the detection performance of the improved YOLOv3 network and the original YOLOv3 network of the invention are simulated, and the results are shown in table 2.

TABLE 2 comparison of detection Performance of the modified YOLOv3 network model and the original YOLOv3

Network	mAP	Size of model	Detection time (Tesla V100)
				YOLOv3	0.73	236M	42.8ms
Improved YOLOv3	0.852	222M	36.3ms

As can be seen from table 2, the detection accuracy of the improved YOLOv3 network formed by adding fine-grained feature extraction scale to the original YOLOv3 network and replacing the original horizontally connected FPN with densely connected FPN is improved by 16.7%, the model volume is relatively reduced, and the detection speed is improved by 15%.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. The procedures or functions according to the embodiments of the invention are brought about in whole or in part when the computer program instructions are loaded and executed on a computer. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by wire (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wirelessly (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

For the embodiments of the electronic device and the computer-readable storage medium, since the contents of the related methods are substantially similar to those of the foregoing embodiments of the methods, the description is relatively simple, and for the relevant points, reference may be made to the partial description of the embodiments of the methods.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

All the embodiments in the present specification are described in a related manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the system embodiment, since it is substantially similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.

Claims

1. An unmanned aerial vehicle for rail defect detection, comprising:

2. The drone of claim 1, wherein the rail defect detection method comprises:

acquiring a target rail image to be detected;

3. The drone of claim 2, wherein the backbone network of the modified YOLOv3 network comprises:

4. The unmanned aerial vehicle of claim 2, wherein the top-down, densely connected feature fusion of the x feature maps of different scales using the modified FPN network comprises:

wherein the improved FPN network comprises x prediction branches Y with sequentially increased scale₁～Y_x(ii) a The above-mentionedPrediction branch Y₁～Y_xThe scales of the x feature maps correspond to the scales of the x feature maps one by one; prediction branch Y_i-jHas an upsampling multiple of 2^j(ii) a i is 2, 3, …, x; j is a natural number smaller than i.

5. The drone of claim 4, wherein the pruning and knowledge-based distillation guided network recovery process includes:

sparse training is carried out on the YOLOv3-1 network to obtain a YOLOv3-2 network with BN layer scaling coefficients in sparse distribution;

6. The drone of claim 2, further comprising, prior to training the modified YOLOv3 network:

acquiring a plurality of sample images with marked target frame sizes;

7. The drone of claim 2, wherein the modified YOLOv3 network further comprises a classification network and a non-maxima suppression module.

8. The drone of claim 7, wherein the deriving attribute information for the target rail image based on all of the predictions comprises:

9. The drone of claim 8, wherein the classification network includes a SoftMax classifier.

10. A drone according to claim 9, characterised in that the loss function of the sparse training is:

wherein the content of the first and second substances,

and g (gamma) is a penalty function for sparse training of the scale coefficient, and lambda is weight. The penalty function selects the L1 norm since the scaling factor γ is to be sparse.