CN113160219B

CN113160219B - Real-time railway scene analysis method for unmanned aerial vehicle remote sensing image

Info

Publication number: CN113160219B
Application number: CN202110518589.4A
Authority: CN
Inventors: 王志鹏; 童磊; 贾利民; 秦勇; 耿毅轩
Original assignee: Beijing Jiaotong University
Current assignee: Beijing Jiaotong University
Priority date: 2021-05-12
Filing date: 2021-05-12
Publication date: 2023-02-07
Anticipated expiration: 2041-05-12
Also published as: CN113160219A

Abstract

The invention provides a real-time railway scene analysis method for an unmanned aerial vehicle remote sensing image, which comprises the following steps: acquiring an unmanned aerial vehicle remote sensing image in real time, and acquiring and processing data of the image to obtain a data set; constructing a railway scene analysis network model, and training and verifying the railway scene analysis network model according to the obtained data set to obtain an optimal line loss proportional coefficient; and testing the model by adopting different computers according to the optimal linear loss proportion coefficient to obtain an analysis result, and comprehensively evaluating the analysis result. The method realizes the analysis of the railway scene in real time, fast and high-efficiency based on the unmanned aerial vehicle on-board computer with limited computing resources, so as to perform high-precision segmentation of the track area in the railway scene.

Description

Real-time railway scene analysis method for unmanned aerial vehicle remote sensing image

Technical Field

The invention relates to the field of rail transit operation safety and guarantee, in particular to a real-time railway scene analysis method for an unmanned aerial vehicle remote sensing image.

Background

Recently, drones are widely used in scene parsing tasks in many fields. As an important auxiliary inspection mode except manual inspection and rail inspection vehicle inspection, automatic inspection based on an unmanned aerial vehicle is an important development trend in the field of high-speed railway safety operation. The automatic routing inspection of the unmanned aerial vehicle has multiple advantages of flexibility, high efficiency, low cost and the like, has no influence on the normal operation of a train, and can provide advanced safety guarantee for railway operation. The unmanned aerial vehicle can carry load equipment such as a visible light camera and the like, can carry a small-sized on-board computer (on-board computer) with the volume and the quality meeting certain specification requirements simultaneously, analyze and process data such as video streams from the load equipment, and simultaneously can carry out more flexible and customized flight control on the unmanned aerial vehicle in real time according to requirements. Therefore, the unmanned aerial vehicle-based railway automatic inspection system has a wide application prospect and brings revolutionary progress to railway inspection.

In recent years, deep learning has been greatly developed, and its results are widely applied to various fields such as face recognition, industrial defect detection, and intelligent robots. In the field of railway automated intelligent inspection, constructing a deep learning model to effectively detect an area or an object concerned in the railway inspection process is also an important research topic. However, to realize automatic inspection of railways, effective analysis of railway scenes becomes a first-priority task. The real-time convolution neural network model constructed based on the deep learning technology is a deep model with great potential which can be operated on an unmanned aerial vehicle airborne computer for real-time railway scene analysis.

Therefore, a real-time railway scene analysis method for unmanned aerial vehicle remote sensing images by adopting a deep learning method is urgently needed

Disclosure of Invention

The invention provides a real-time railway scene analysis method for an unmanned aerial vehicle remote sensing image, which aims to overcome the defects in the prior art.

In order to achieve the purpose, the invention adopts the following technical scheme.

The embodiment of the invention provides a real-time railway scene analysis method for an unmanned aerial vehicle remote sensing image, which comprises the following steps:

acquiring an unmanned aerial vehicle remote sensing image in real time, and acquiring and processing data of the image to obtain a data set;

constructing a railway scene analysis network model, and training and verifying the railway scene analysis network model according to the obtained data set to obtain an optimal line loss proportional coefficient;

and testing the model by adopting different computers according to the optimal linear loss proportion coefficient to obtain an analysis result, and comprehensively evaluating the analysis result.

Preferably, the data acquisition and processing of the image are performed to obtain a data set, including screening the obtained image, and dividing the screened image into a training set, a verification set and a test set according to a certain proportion after performing data annotation on the image by using a label software.

Preferably, training and verifying the railway scene analysis network model according to the obtained data set comprises: when the data set only has two semantic categories of a track area and a non-track area, training and verifying the railway scene analysis network model only by adopting a line loss function; and when other semantic categories are included, training and verifying the railway scene analysis network model by adopting the integrated loss function.

Preferably, the integration loss function is represented by the following formula (1):

L＝(1-α)L _CE +αL _LL (1)

wherein L is _CE Representing the cross entropy loss function, L _LL Represents the line loss function and alpha represents the scaling factor.

Preferably, when the railway scene analysis network model is trained and verified according to the obtained data set, the track area in the unmanned aerial vehicle remote sensing image of the railway scene needs to be long-strip-shaped.

Preferably, the overall architecture for constructing the railway scenario analysis network model is as shown in table 1 below:

TABLE 1

Preferably, the line loss function is as shown in equation (2) below:

wherein, the pixel point sets corresponding to the track area and the non-track area in the image are respectively P _r And P _n Wherein | P _r |＝N，|P _n I = M, for pixel point p _i ∈P _r And p _j ∈P _n The membership degrees of the track areas are respectively 1/lambda _i And 1/lambda _j ，f _i And f _j Are respectively a pixel point p _i And p _j Probability of being predicted as a track region class.

Preferably, the degree of membership is calculated according to the following formula (3):

λ＝d/d ₀ (3)

when only a single track area exists in the image, the distance from the pixel point p to a track central line l is d, and the distance from a point on the edge of the track area to the central line l is d ₀ ；

When two or more strip-shaped track areas exist in the image, the pixel point p reaches the central line l _β D, from a point on the edge of the beta-th track area to the centre line l _β Is a distance d ₀ 。

Preferably, the comprehensive evaluation of the analysis result includes: and calculating the analytic result obtained by adopting the test set and the corresponding label truth value to obtain prediction precision evaluation, and evaluating the inference speed of the railway scene analytic network model.

Preferably, the prediction accuracy evaluation is calculated according to the following formulas (3) to (4):

the TP represents the number of pixels of a certain semantic category c predicted to be the category; in the TN indicates the pixel points of the category c, the number of the pixel points which are not predicted to be the category is predicted; the FP expresses the number of the pixels of the category c, which are predicted to be the category but are not the number of the pixels of the category in fact; FN represents the number of pixels of the category, but the number is not predicted to be the category but actually is predicted to be the number of the pixels of the category; the IoU represents the intersection ratio precision of the category c; in the formula (4), mlou represents the average cross-over ratio of the precision of all semantic categories, and C represents the number of semantic categories.

According to the technical scheme provided by the real-time railway scene analysis method for the remote sensing image of the unmanned aerial vehicle, the method is characterized in that a deep complete decoupling residual convolution network is designed, so that the real-time efficient analysis of the railway scene is realized within the calculation capacity range of an onboard computer of the unmanned aerial vehicle, and the automatic railway inspection work based on the unmanned aerial vehicle is supported to the greatest extent; by designing a customized auxiliary loss function, in the network model training process, the network is trained by using the auxiliary loss function, the segmentation of a track area and a non-track area can be simultaneously restricted under the condition of not increasing the computational complexity, so that the predicted track areas are accurately concentrated in a strip-shaped area, and the predicted track areas are prevented from appearing in other impossible local areas, an unmanned aerial vehicle-mounted computer based on limited computational resources is realized, real-time, rapid and efficient railway scene analysis can be carried out, and the high-precision segmentation of the track areas in the railway scene is carried out.

Additional aspects and advantages of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the description below are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a schematic flow chart of a real-time railway scene analysis method for remote sensing images of unmanned aerial vehicles according to an embodiment of the invention;

FIG. 2 is a schematic comparison of a fully decoupled convolution and standard convolution filter;

FIG. 3 is a schematic diagram comparing the fully decoupled residual module proposed in the present embodiment with the residual module of the prior art;

FIG. 4 is a diagram illustrating a comparison between a conventional pixel coordinate system and a normalized coordinate system;

FIG. 5 is a schematic illustration of a single track area and a dual track area and their centerlines;

FIG. 6 is a schematic diagram showing the variation of the proposed FDRNet and ERFNet different types of prediction accuracy with the line loss function ratio α;

FIG. 7 is a graph of FDRNet visual effects trained with an integration loss strategy.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the accompanying drawings are exemplary only for explaining the present invention and are not construed as limiting the present invention.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, but do not preclude the presence or addition of one or more other features, integers, steps, operations, and/or groups thereof. It should be understood that the term "and/or" as used herein includes any and all combinations of one or more of the associated listed items.

It will be understood by those skilled in the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

For the convenience of understanding of the embodiments of the present invention, the following detailed description will be given by way of example with reference to the accompanying drawings, and the embodiments are not limited to the embodiments of the present invention.

Examples

Fig. 1 is a schematic flow chart of a real-time railway scene analysis method for remote sensing images of unmanned aerial vehicles according to an embodiment of the present invention, and with reference to fig. 1, the method includes:

s1, acquiring a remote sensing image of the unmanned aerial vehicle in real time, and acquiring and processing data of the image to obtain a data set.

Under the condition of good weather conditions, the unmanned aerial vehicle flies over the railway line and acquires remote sensing images of the railway scene, and the acquired images are screened to remove inapplicable images. And marking the data of the screened image by utilizing the labeme software, and manually marking the interested semantic category in the image. In the present embodiment, the classification into 5 categories of track, plant, bare land, road, building and background is exemplary. Dividing the manufactured data set into a training set, a verification set and a test set according to a certain proportion, wherein the training set is used for training a network, the verification set is used for verifying the network performance in the training process, the test set is used for testing a trained network model so as to verify the performance of the constructed model, and the proportion of the training set, the verification set and the test set is schematically 7:2:1.

s2, a railway scene analysis network model is built, training and verification are carried out on the railway scene analysis network model according to the obtained data set, and the optimal line loss proportional coefficient is obtained.

Firstly, a lightweight railway scene analysis network model FDRNet (Fully depleted basic ConvNet) is constructed, so that the model can be operated at a certain speed on an on-board computer.

Constructing a railway scene analysis network model:

the railway scene analysis network model constructed in the embodiment is a complete decoupling residual convolution network model, and is specifically constructed through processing of a deep complete decoupling convolution and a deep complete decoupling residual block.

(1) Deep fully decoupled convolution

The basic idea of completely decoupling convolution is to further perform correlation decoupling on standard convolution, which means that parameters and calculation amount are greatly reduced, that is, on the basis of keeping the basic mapping correlation of convolution unchanged, the parameters can be greatly reduced, thereby avoiding spending excessive time on calculation and resource occupation. The fully decoupled convolution proposed by the present embodiment includes two aspects of correlation decoupling: (1) decoupling of cross-channel and spatial correlations; (2) decoupling of lateral and longitudinal spatial correlations.

The following assumptions are first proposed: two coupling correlation modes in the convolutional network can be completely decoupled, namely (1) cross-channel correlation and spatial correlation in the characteristic diagram can be completely decoupled; (2) Furthermore, the two spatial correlations (horizontal and vertical) in the feature map can also be completely decoupled. FIG. 2 is a schematic diagram comparing a fully decoupled convolution and a standard convolution filter, as shown in FIG. 2, comparing a filter bank of a fully decoupled convolution and a standard convolution. Fully decoupled convolution decomposes the standard convolution into three sequential steps: transverse 1D depth convolution, longitudinal 1D depth convolution and cross-channel 1x1 convolution. Wherein, M convolution kernels in the depth convolution of the first two different space dimensions respectively correspond to M channels of the input feature map. The final 1x1 convolution is a special case of a common standard convolution of size 1x 1. The method mainly completes the establishment of cross-channel correlation mapping relation in the convolution process, and can convert the number of channels of the input feature graph from M to N.

By using

Which represents a non-linear activation function,

additional offsets for the mth layer of the filter representing the transverse 1D depth convolution and the longitudinal 1D depth convolution of the fully decoupled convolution process, respectively, b _i ^p Representing the i-th filter additional offset across channel 1x1 convolution in the complete decoupling convolution process;

respectively, the vectors represented by the m-th layer weight parameters in the convolution kernels representing the transverse 1D depth convolution and the longitudinal 1D depth convolution in the complete decoupling convolution process,

representing a vector represented by the mth layer weight parameter of the cross-channel 1x1 convolution ith filter in the complete decoupling convolution process;

the vector represented by the mth layer of the input feature map is represented. Completely decoupling ith channel of convolution process output characteristic diagram

Input feature map a that can be expressed as ⁰ Is shown in formula (1), where denotes the convolution operation:

due to convolution kernel

Representing a 1x1 convolution with only one scalar parameter, the following equation (2) can be obtained:

(2) Deep complete decoupling residual block

The embodiment further provides a complete decoupling residual module by fully utilizing the decomposition form of the complete decoupling convolution. Fig. 3 is a schematic diagram comparing the fully decoupled residual block proposed in this embodiment with the residual block of the prior art, and the original remaining blocks (bottleneck and non-bottleneck versions) are proposed in ResNet before doing so, as shown in 3- (a) and 3- (b). Considering that a non-bottleneck design version may be obtained to bring higher precision, and also noticing that a bottleneck design also brings other degradation problems, ERFNet modifies a non-bottleneck residual module through one-dimensional decomposition to accelerate the model operation speed and simultaneously reduce the parameters of the original non-bottleneck residual module, which is called non-bottleneck residual-1D, as shown in fig. 3- (c). Here, the non-bottleneck 1D residual module is further modified to fully deconvolute with the proposed correlation to further reduce the number of parameters and the time cost, as shown in fig. 3- (D). The fully decoupled residual module, called non-bottomleneck-FD, proposed in this embodiment is composed of two fully decoupled convolutions and identity maps connected together. The 1D convolution in the non-bottleneck 1D is modified to two 1D depth convolutions that only consider spatial correlation, and an additional 1x1 convolution (also known as point-by-point convolution) is appended to achieve the final cross-channel correlation of the feature map. It should also be noted that a ReLu nonlinear activation function is also added after each convolution process.

(3) Deep complete decoupling residual error network

The architecture for constructing the railway scene analysis network model in the embodiment is shown in the following table 1, and is a compact and effective network architecture, and although a non-bottomleneck-FD module is adopted, network parameters can be greatly reduced, which inevitably causes network performance degradation. It is therefore possible to consider scaling up the network to compensate for this loss. The expansion of the network scale is carried out from two directions, one is to deepen the network, and the other is to widen the network. And simultaneously, the two directions are designed and tested, and an empirical result shows that the extended network is a more optimal direction more suitable for the current deep learning framework PyTorch. Analysis also showed that the increase in the number of intermediate loops in the convolution slows down the network speed, probably because PyTorch is more sensitive to the depth of the network than to the width. It can be noted that the number of intermediate loops of the fully decoupled convolution proposed by the present invention is increased (no additional batch normalization and subsequent non-linear activation layer is calculated) compared to the conventional standard convolution, which in itself will deepen the network and lead to an increase in forward inference time. From table 1, it can be seen that the encoder consists of 1-14 layers and the decoder consists of 15-19 layers. Inspired by ERFNet, the invention designs a wider convolutional network architecture, and the same downsampling modules are adopted in layers 1, 2 and 8 to respectively perform downsampling. In these downsampling modules, the maximum pooling result and a single 3x3 convolution result of span 2 are pieced together as the final result of downsampling to capture more rich features. The network adopts hole convolution with hole rates of 2, 4 and 8 to obtain more context information and global information. The present invention does not employ convolutional layers with a void ratio of 16, giving up the minimal gains in contextual characteristics that it may bring, while avoiding the continued increase in network depth. In addition, the ratio of Dropout used is 0.05, while the ratio in ERFNet is 0.03.Dropout is also included in our architecture as a regularization measure that yields a better characterization. For the upsampling step, three successive transposed convolutional upsampling modules, layers 15, 17 and 19, are employed to expand the resolution of the feature map to the original size of the input image.

TABLE 1

It should be noted that when the railway scene analysis network model is trained and verified according to the obtained data set, the track area in the unmanned aerial vehicle remote sensing image of the railway scene needs to be long.

When the data set only has two semantic categories of a track area and a non-track area, training and verifying the railway scene analysis network model by only adopting a line loss function; and when other semantic categories are included, training and verifying the railway scene analysis network model by adopting the integrated loss function.

(1) Line Loss function (LL, line Loss function)

The railway area is the main field of concern for unmanned aerial vehicle based automated railway routing inspection. Accurate prediction of this area plays an important role in future inspection work such as fastener inspection, rail inspection, and rail slab inspection. The embodiment provides a new line loss function to improve the accuracy of the railway category in the railway scene analysis task. The line loss function looks at the excellent characteristics of a long distance and relatively straight high speed railway. Usually for highways, the route designer always deliberately bends it to prevent visual fatigue of the driver. Unlike the concept of turning on a highway, which is intentionally designed by people, the longer the straight line of the highway, the better. Thus, in terms of straightness, railways far exceed highways. While at the same time it can be noted that in the local area of the railway covered by the remote sensing images of the unmanned aerial vehicle, the track area is very straight under most circumstances.

The traditional Cross Entropy (CE) loss function solves this task by classifying each discrete pixel point of the whole image, ignoring the inherent relationship between pixels. It should be noted that the track area in the unmanned aerial vehicle remote sensing image is always an elongated area. Therefore, the closer the pixel is to the center line of the strip-shaped railway area, the more likely it becomes a part of the railway; the further a pixel is from the centerline, the less likely it is to belong to a railroad area. To fully exploit this idea, the present embodiment proposes a line loss function. The proposed line loss function can largely correct the problem of railway area pixel classification errors inherent in the conventional loss function.

1) Normalized coordinate system

Fig. 4 is a schematic diagram comparing a conventional pixel coordinate system with a normalized coordinate system, in the conventional sense, an image coordinate system is established based on pixel points in an image, and one of the pixel points is taken as a unit length, as shown in fig. 4 (a). In such a coordinate system, when an image needs to be zoomed, the distance between two points at corresponding positions in the image before and after the zooming cannot satisfy the good characteristic that the distance remains unchanged. Therefore, in order to solve this problem, the present embodiment establishes a normalized coordinate system as shown in fig. 4 (b). The coordinate system takes the length and width of the entire image as unit length. For a resolution of w ₀ ×h ₀ For the image of (1), if the coordinates of a point in the conventional coordinate system are (w, h), the coordinates in the normalized coordinate system are (w/w) ₀ ,h/h ₀ ). Two pixels p can be calculated under the normalized coordinate system ₁ and p ₂ D (p) between ₁ ,p ₂ ) Where d represents the euclidean distance. Obviously, the normalized coordinate system has the following characteristics: (1) The distance between two pixels is invariant to the scaling transformation of the image; (2) The maximum distance between two pixels in the normalized coordinate system

2) Basic assumptions

Suppose one: if a track area exists in the railway scene image acquired by the unmanned aerial vehicle, two boundary lines of the area are parallel to each other.

Suppose two: if there are two track areas in the railway scene image captured by the drone, then these two track areas are also parallel to each other.

The line loss function is shown in the following equation (2):

wherein, the pixel point sets corresponding to the track area and the non-track area in the image are respectively P _r And P _n Wherein | P _r |＝N，|P _n I = M, for pixel point p _i ∈P _r And p _j ∈P _n The Membership degrees (DoM) of the track areas are respectively 1/lambda _i And 1/lambda _j ，f _i And f _j Are respectively a pixel point p _i And p _j Probability of being predicted as a track region class. With the help of the feature map (also called probe-class) of the last layer of the CNN network, the probability of classifying each pixel can be calculated using the softmax function.

The degree of membership is calculated according to the following formula (3):

λ＝d/d ₀ (3)

FIG. 5 is a schematic diagram of a single track area, a dual track area and their center lines, wherein, when there is only a single track area in the image, as shown in FIG. 5 (a), the distance from pixel point p to track center line l is d, and the distance from the point on the edge of the track area to center line l is d ₀ ；

When two or more strip-shaped track regions exist in the image, as shown in 5 (b), the pixel point p is located at the central line l _β D, point on the edge of the beta-th orbital region to the centre line l _β Is a distance d ₀ 。

The membership degrees of the non-orbit area pixel points are all smaller than 1, and the membership degrees of the orbit area pixel points are all larger than 1.

Because the LL function can only perform a binary task, the LL function is used alone to train the network, and the trained network can only distinguish orbital regions from non-orbital regions. Thus, to enable multi-classification of a network, one can consider training the network in conjunction with the use of a cross-entropy loss function. Therefore, the CE loss function is used for training the network to realize multi-classification tasks, and meanwhile, the auxiliary line loss function LL can be used for carrying out stricter constraint on the track area to realize more accurate track area segmentation. The integration loss function is shown in the following equation (4):

L＝(1-α)L _CE +αL _LL (4)

The above equation (4) shows that if the line loss function is to be reduced, f _i Must be larger and larger, f _j It must be smaller and smaller. If and only if f _i ＝1，f _j =0, the line loss function takes the ideal minimum value 0. It is worth pointing out that the line loss function only applies to the task of two classification of track areas and non-track areas. Therefore, a network trained using only the line loss function can only be used to distinguish between track and non-track areas. To implement multi-classification, the present embodiment trains the model by using a set of line loss function and cross entropy loss function.

And S3, testing the model by adopting different computers according to the optimal linear loss proportion coefficient to obtain an analysis result, and comprehensively evaluating the analysis result.

And (3) fully and effectively training the railway scene analysis network model FDRNet by using the training set and the verification set in the step (S1) and adopting the line loss function, and after the training is finished, selecting the network model under the optimal proportional coefficient alpha of the line loss function for testing. Inputting the pictures in the test set into the trained model, predicting all the pictures one by one, and comprehensively evaluating the analysis result, wherein the method comprises the following steps: and calculating the analytic result obtained by adopting the test set and the corresponding label truth value to obtain prediction precision evaluation, and evaluating the inference speed of the railway scene analytic network model.

The prediction accuracy evaluation was calculated according to the following formulas (5) to (6):

wherein, TP (true positive) represents the number of pixels of a certain semantic category c predicted as the category; the TN (true negative) represents the number of the pixel points of the category c which are predicted not to be the category; the FP expresses the number of the pixels of the category c, which are predicted to be the category but are not the number of the pixels of the category in fact; FN (false positive) indicates that the number of pixels in the category is not predicted to be the category but is actually the number of the pixels in the category; the IoU represents the intersection ratio precision of the category c; in the formula (4), mlou represents the average cross-over ratio of the precision of all semantic categories, and C represents the number of semantic categories.

The following is a simulation case adopting the method of the embodiment, and a certain three positions A, B and C of the Jingu high-speed railway gallery are selected to acquire railway scene pictures, so that the weather condition of the data acquisition on the day is better and sufficient in light. Based on the data, a railway scene analysis data set based on the unmanned aerial vehicle is constructed, the constructed real-time railway scene analysis network model is trained by using the data set, and finally, an evaluation test is completed on an airborne computer of the unmanned aerial vehicle.

The method comprises the following steps: the method comprises the steps of obtaining an unmanned aerial vehicle remote sensing image in real time, and carrying out data acquisition and processing on the image to obtain a data set.

The dataset has five semantic categories (i.e. background, track, building, vegetation and road), with pixel-level annotation for each image in the dataset. All unmanned aerial vehicle remote sensing images collected in the data set are collected from a certain three positions A, B and C of a Jinghushi high-speed railway gallery section. To demonstrate the performance of the different models, 3000 images from zones a and B were used to construct training data (2700) and validation data (300), while 300 images from zone C were used for testing. Here, region C is completely different from regions a and B to avoid the possible high similarity of images between training data and test data. And training the proposed model on a training set.

Step two: and constructing a railway scene analysis network model, and training and verifying the railway scene analysis network model according to the obtained data set to obtain the optimal line loss proportional coefficient.

And (4) fully training the constructed lightweight railway scene analysis network model by utilizing the integrated loss function and the data set constructed in the step one.

In the present simulation example, the training model uses a batch size of 2 (batch size = 2), the parameter optimization is performed using an Adam optimizer, the momentum parameter size is 0.9, and the weight attenuation is 2e ^-4 And initial learning rate 5e ^-4 . In the present simulation example, 100 rounds (epoch = 100) of training were performed on all networks. A consistent random seed is set in the code implementation so that all networks can be trained using a fixed sequence of pictures to ensure repeatability of the work. Meanwhile, the line loss proportion coefficient alpha in the integrated loss function is used for changing from 0 to 0.9 at intervals of 0.1 in the training process of the network, and the best result is selected.

(1) Optimum ratio of line loss function

According to experiments, the proposed line loss function was found to be much stronger than the cross loss function in constraining the network to accurately predict the railway area. But the line loss function can only classify each pixel in two. Therefore, the training process of the network is constrained by using the line loss function, and the CE loss function is adopted to perform multi-class classification. Therefore, how to properly combine CE and line loss function functions is a key issue in terms of how to limit the training process to both railway and non-railway regions. In this regard, a completely different training strategy is implemented compared to the conventional training process, which performs back propagation with integration penalty in each training step. For this strategy, different proportions a of the line loss function are chosen, varying from 0 to 0.9 at intervals of 0.1.

FIG. 6 is a schematic diagram showing the variation of the proposed FDRNet prediction accuracy with the line loss function ratio α, i.e., the tendency of the prediction accuracy IoU values and mIoU values with α. The proposed FDRNet, in terms of orbit region prediction accuracy and overall mlou, can improve its accuracy simultaneously by the line loss function when the appropriate α is employed. It was also found that the accuracy reached a maximum when α = 0.7.

In addition, as shown in table 2, the accuracy performance trend in all classes where the FDRNet online loss function ratio varies from 0 to 0.9 is also described, and the maximum accuracy increment and the increment when α =0.7 are illustrated. It can be concluded with certainty that the line loss function can improve the prediction effect not only for the track region but also for the non-track region. It is easily understood that when the prediction accuracy of the track area is lowered, it means that more non-track pixels are predicted as the track (FP case) or more track pixels are predicted as the non-track (FN case), eventually resulting in a reduction in the prediction accuracy of the non-track area. Table 2 shows the increase in accuracy for different classes of pixels. It can be seen that the maximum accuracy increase for the track category is 7.58% due to its relatively strong linear characteristic, while the maximum accuracy increase for the road category is 6.66%. Typically, the accuracy of the mlou is improved by 3.36%.

More details on the quantitative comparison between the accuracy of the original FDRNet and the FDRNet trained with the integration loss strategy with α =0.7 are listed in table 3. It can be seen that the accuracy of all classes rises to a higher level. Background, building, vegetation, track, road accuracy increased from 54.98% to 57.04%,46.20% to 47.34%,58.98% to 61.12%,70.72% to 78.30% and 44.77% to 48.67%, respectively. Eventually, the mlou increased from 55.13% to 58.49%. In particular, the track class achieves the largest increments.

TABLE 2

TABLE 3

In summary, the line loss function proposed in this embodiment has the following advantages: (1) The integrated loss strategy can greatly improve the precision of a railway area and a non-railway area, and finally improve the overall precision. (2) According to the exact definition of the LL function, the model trained using the line loss function can concentrate the predicted track areas in the image in the strip-shaped areas while suppressing the appearance of track areas at other impossible locations. (3) The line loss function is more explanatory for the railway region segmentation.

Step three: and testing the model by adopting different computers according to the optimal line loss proportional coefficient to obtain an analysis result, and comprehensively evaluating the analysis result.

To demonstrate the advantage of the method in reasoning speed, comparison experiments were performed at different resolutions on a single NVIDIA Jetson TX2 embedded device and a single NVIDIA geource RTX 2060 card, respectively. Tables 4 and 5 show the comparison of the inference speed on a single embedded device TX2 and the comparison of the inference speed on a single GEFORCE RTX 2060 in the method, wherein ms is the number of milliseconds needed for inferring a picture, and fps is the number of pictures that can be inferred per second

It can be seen that with the TX2 device, FDRNet reaches a peak of 12.8 at a resolution of 512x 256. For the RTX 2060GPU card, it can be noted that FDRNet achieves a peak fps of 90.9 at a resolution of 512x 256.

TABLE 4

Table 5:

as shown in table 4, the results of FDRNet under different training strategies are given while comparing the precision with other models, where "r.iou" represents the precision of the railway track area category, "mliou" represents the comprehensive evaluation of the precision of all categories, and "scratch" represents the model as being trained from scratch without using a pre-trained model. "with LL" means training the network with the integrated loss function, "pretrained, with LL" means training the model with the pre-trained model, and with the integrated loss function.

Table 6 shows the comparison between the overall performance of the FDRNet and other models, and as can be seen from table 6 below, the FDRNet does not use the line loss function in the training stage and cannot obtain satisfactory accuracy when training is performed from the beginning, the track class prediction accuracy is 70.72%, and the overall prediction accuracy mlou is 55.13%, which is not very good.

However, once the integrated loss backhaul strategy is adopted, the proposed method will improve rapidly and outperform all other algorithms, with an mlio u of 58.49%. Notably, the orbit class accuracy increased by 7.58%. In addition, by adopting a pre-training model on Cityscapes, the method of the embodiment can finally reach 80.99% and 58.82% of optimal R.IoU and mIoU.

TABLE 6

FIG. 7 is a graph of the visual comparison of FDRNet and ERFNet trained with an integrated loss-back strategy, as shown in FIG. 7, showing several examples of track region predictions generated by FDRNet (c), which can produce more accurate segmentation results in local details and edges when trained with an integrated loss-back strategy. Most of the predicted track regions are constrained to the expected two elongated regions, which is the purpose of the proposed customized Line Loss (LL) function. When low luminance pictures are processed, as shown in lines 5 and 6 of fig. 7. The method has the advantages that the system structure is more suitable for different illumination conditions, the track area can be accurately and effectively analyzed, the model has higher robustness in different working environments, and the unmanned aerial vehicle-based railway automatic inspection can be better served.

In conclusion, the method can realize rapid and efficient railway scene analysis based on the unmanned aerial vehicle. The network architecture provided can be used for real-time operation of an airborne computer of the unmanned aerial vehicle, greatly improves the accuracy of track area segmentation and extraction according to the characteristics that the track area is relatively straight and is often concentrated in a strip-shaped area, and has obvious application value under the condition of railway automation routing inspection based on the unmanned aerial vehicle.

It will be appreciated by those skilled in the art that the foregoing types of applications are merely exemplary, and that other types of applications, whether presently existing or later to be developed, that may be suitable for use with the embodiments of the present invention, are also intended to be encompassed within the scope of the present invention and are hereby incorporated by reference.

It should be understood by those skilled in the art that the foregoing example of determining the invoking policy according to the user information is only for better explaining the technical solution of the embodiment of the present invention, and is not limited to the embodiment of the present invention. Any method of determining the invoking policy based on the user attributes is included in the scope of embodiments of the present invention.

From the above description of the embodiments, it is clear to those skilled in the art that the present invention can be implemented by software plus necessary general hardware platform. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which may be stored in a storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments or some parts of the embodiments.

While the invention has been described with reference to specific preferred embodiments, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention as defined in the following claims. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A real-time railway scene analysis method for unmanned aerial vehicle remote sensing images is characterized by comprising the following steps:

the overall architecture for constructing the railway scene analysis network model is shown in the following table 1:

TABLE 1

The training and verifying of the railway scene analysis network model according to the obtained data set comprises the following steps: when the data set only has two semantic categories of a track area and a non-track area, training and verifying the railway scene analysis network model only by adopting a line loss function; when other semantic categories are included, training and verifying the railway scene analysis network model by adopting an integrated loss function;

the integration loss function is shown in the following formula (1):

L＝(1-α)L _CE +αL _LL (1)

wherein L is _CE Representing the cross entropy loss function, L _LL The method comprises the steps that a line loss function is represented, alpha represents a proportionality coefficient, and the optimal line loss proportionality coefficient is obtained by selecting alpha; the line loss function is shown in the following formula (2):

wherein, the pixel point sets corresponding to the track area and the non-track area in the image are respectively P _r And P _n Wherein | P _r |＝N，|P _n I = M, for pixel point p _i ∈P _r And p _j ∈P _n The membership degrees of the track areas are respectively 1/lambda _i And 1/lambda _j ，f _i And f _j Are respectively a pixel point p _i And p _j A probability of being predicted as a track region class;

when the railway scene analysis network model is trained and verified according to the obtained data set, the track area in the unmanned aerial vehicle remote sensing image of the railway scene needs to be a long strip shape;

2. The method of claim 1, wherein the acquiring and processing the image to obtain the data set comprises screening the acquired image, labeling the screened image with a lableme software, and dividing the image into a training set, a verification set and a test set according to a certain ratio.

3. The method of claim 1, wherein the degree of membership is calculated according to the following equation (3):

λ＝d/d ₀ (3)

When two or more strip-shaped track areas exist in the image, the pixel point p reaches the central line l _β D, point on the edge of the beta-th orbital region to the centre line l _β Is a distance d ₀ 。

4. The method of claim 1, wherein the comprehensive evaluation of the analysis result comprises: and calculating the analytic result obtained by adopting the test set and the corresponding label truth value to obtain prediction precision evaluation, and evaluating the inference speed of the railway scene analytic network model.

5. The method according to claim 4, wherein the prediction accuracy evaluation is calculated according to the following formulas (3) to (4):

the TP represents the number of pixels of a certain semantic category c predicted to be the category; in the TN indicates the pixel points of the category c, the number of the pixel points which are not predicted to be the category is predicted; the FP expresses the number of the pixels of the category c, which are predicted to be the category but are not the number of the pixels of the category in fact; FN represents the number of pixels of the category, but is not predicted to be the category but actually the number of the pixels of the category; the IoU represents the intersection ratio precision of the category c; in the formula (4), mlou represents the average cross-over ratio of the precision of all semantic categories, and C represents the number of semantic categories.