CN112949634B

CN112949634B - Railway contact net nest detection method

Info

Publication number: CN112949634B
Application number: CN202110249738.1A
Authority: CN
Inventors: 武斯全; 田震; 廖开沅; 赵宏伟; 许华婷; 徐嘉勃
Original assignee: Beijing Jiaotong University
Current assignee: Beijing Jiaotong University
Priority date: 2021-03-08
Filing date: 2021-03-08
Publication date: 2024-04-26
Anticipated expiration: 2041-03-08
Also published as: CN112949634A

Abstract

The invention provides a method for detecting a railway contact net nest. The method comprises the following steps: obtaining an interest domain picture containing a bird nest region through reverse reasoning according to the picture containing the bird nest, taking the interest domain picture as a template picture, forming a template library according to all the template pictures, and training a second-stage YOLO detector by using the template library; sequentially matching the pictures which do not contain the bird nest with each template picture in the template library to obtain an interest domain picture data set, and training a first-stage YOLO detector by using the interest domain picture data set; the picture to be detected is input into a first-stage YOLO detector after training, the first-stage YOLO detector outputs a region-of-interest picture, the region-of-interest picture is input into a second-stage YOLO detector after training, and the second-stage YOLO detector outputs a bird nest detection result of the picture to be detected. The invention can solve the problem of difficult recognition caused by the lack of obvious characteristics due to less bird nest information in the overhead contact system, and can effectively and automatically recognize and detect the railway overhead contact system.

Description

Railway contact net nest detection method

Technical Field

The invention relates to the technical field of railway contact net foreign matter detection, in particular to a railway contact net bird nest detection method.

Background

Currently, related research for the object field has become one of the most interesting hot spots in the field of computer vision detection. The marking of the target in the dynamic video is mainly realized by analyzing a picture sequence acquired by a picture sensor, extracting a target scene of interest in the picture sequence, marking a pixel area of the same target, and identifying the position, the size, the profile and other information of the target. The typical target mark recognition method comprises the steps of target feature description, feature information extraction, target feature matching and the like. The characteristic information such as position, color, outline, texture and the like of the target to be represented is extracted, and then the detection target is evaluated by relying on the characteristic information, so that whether the target can be matched with the characteristic information or not is judged, and the labeling of the target is completed.

At present, the method for detecting the foreign matters of the high-speed railway overhead contact system in the prior art comprises the following steps: detection based on the fast R-CNN detection model, detection based on relative position invariance, and detection of bird's nest on railway contact networks using HOG (directional gradient histogram, histogramof Oriented Gradient) features. Wherein, the detection based on the Faster R-CNN detection model introduces an RPN (Region Proposal Network, regional generation network) for generating candidate regions for the target. The Faster R-CNN can be seen as consisting of an RPN network that produces target candidate regions and Faster R-CNN detection that uses these candidate regions for predictive classification. Firstly, inputting a picture, and transmitting the picture forward to the last shared convolution layer, on one hand, transmitting the obtained feature map to an RPN network, and on the other hand, continuing to transmit forward to generate a feature map with higher dimension. In an RPN network, a plurality of candidate regions and candidate region scores are generated through a series of processes, and non-maximum filtering is adopted on candidate region frames so as to reduce the number of candidate regions. And inputting the candidate region which is larger than the set threshold value and the high-dimensional feature map generated before into the RoI pooling layer together to extract the features of the corresponding region, and finally connecting the region features with the full-connection layer to output target classification and score and a bounding box target positioning frame regression value.

The detection based on relative position invariance is to use a machine vision processing technology, after the color features, texture features, shape features and the like of the picture are primarily analyzed, the features of a nest nesting platform are combined, the preprocessed detection picture is utilized to obtain the picture edge by a sobel horizontal edge detection operator, then the picture is subjected to angle correction by a probability Hough conversion straight line detection method, the detection of the forefront hard cross beam is realized by combining the line length relation of the picture to be analyzed, finally the picture is subjected to binarization processing, and whether the bird nest exists on the cross beam is judged by counting the white area among the hard cross beams.

The detection of the bird nest on the railway contact network by utilizing the HOG features is to roughly extract the possible area of the bird nest in the picture according to priori knowledge, then calculate the HOG features of the extracted area, and finally accurately identify the bird nest in the picture according to the HOG features of the picture by a support vector machine (SVM, support vector machines). Because the neural network has incomparable advantages of other algorithms in terms of picture processing, a part of students in China combine the neural network with the traditional picture processing technology to detect the target of interest in the picture, and the method can effectively improve the speed and the accuracy of detection.

The defect of the method for detecting the foreign matters of the high-speed railway overhead line system in the prior art is as follows: in the actual operation process of bird nest detection, because the running environment of the train is varied variously, and bird nest with various forms is positioned in a complex contact net, therefore, the accuracy and recall rate of an identification system built based on HOG and DMP models can not reach expected standards. This is because conventional recognition models such as HOG and DPM employ a manner of manually extracting features as detection templates and then performing sliding frame matching detection. Such means are susceptible to shape and texture features, so that a bird nest present in such an environment is difficult to extract standard detection features.

In addition, because the requirement of manually extracting the characteristics is more difficult to achieve the requirement of accurate identification under the condition of insufficient data volume, the defects of narrow application range, insufficient universality and the like exist, and a plurality of problems are brought to the identification of the bird nest of the high-speed railway overhead line system.

Disclosure of Invention

The embodiment of the invention provides a method for detecting a railway contact net bird nest, which is used for effectively and automatically identifying and detecting the railway contact net.

In order to achieve the above purpose, the present invention adopts the following technical scheme.

A railway contact net nest detection method comprises the following steps:

obtaining an interest domain picture containing a bird nest region through reverse reasoning according to a picture containing a bird nest in a railway contact net picture data set, taking the interest domain picture as a template picture, forming a template library according to all the template pictures, and training a second-level YOLO detector by using the template library;

sequentially matching the pictures which do not contain bird nest in the railway contact net picture data set with each template picture in the template library to obtain an interest domain picture data set, and training a first-stage YOLO detector by using the interest domain picture data set;

Inputting a picture to be detected into a trained first-stage YOLO detector, outputting a region-of-interest picture by the first-stage YOLO detector, inputting the region-of-interest picture into a trained second-stage YOLO detector, and outputting a bird nest detection result of the picture to be detected by the second-stage YOLO detector.

Preferably, the obtaining the interest domain picture including the bird nest region according to the picture including the bird nest in the railway contact net picture dataset by reverse reasoning, using the interest domain picture as a template picture, and forming a template library according to all the template pictures, includes:

The method comprises the steps of performing preliminary segmentation on pictures containing bird's nest in a railway contact net picture data set to obtain basic areas with set similarity, performing preliminary merging on the basic areas according to difference and combination among the areas to obtain a series of preliminary candidate areas, surrounding the preliminary candidate areas by rectangular frames, merging the preliminary candidate areas according to the similarity among the rectangular frames to obtain final candidate areas, manually marking bird's nest positions in the final candidate areas, representing marked bird's nest area attributes by rectangles, taking the final candidate areas containing bird's nest areas as interest areas, taking the interest area pictures as template pictures, and forming a template library according to all the template pictures.

Preferably, the preliminary segmentation is performed on the pictures containing bird nest in the railway contact net picture dataset to obtain basic regions with set similarity, and the basic regions are preliminarily combined according to the difference between the regions to obtain a series of preliminary candidate regions, including:

A picture containing bird's nest is represented by an undirected graph G= < V, E >, wherein the vertex of the undirected graph represents a pixel point of the picture, the weight of the edge e= (V _i,v_j) represents the dissimilarity of the adjacent vertex pair i, j, and the color distance of the pixel is used Representing dissimilarity w (e) between two pixels, one basic region being a set of points with minimum dissimilarity;

the intra-class differences of the base region are defined as:

The inter-class difference between the two base areas C ₁、C₂ defines the minimum connecting edge of the two base areas:

If two base regions are not edge-connected, diff (C ₁,C₂) = infinity

When the condition Diff (C ₁,C₂)≤min(Int(C₁)+τ(C₁),Int(C₂)+τ(C₂) is satisfied), it is judged that the two basic areas C ₁、C₂ can be merged;

Where τ (C) is a threshold function that weights the region of isolated points:

τ(C)＝k/||C||

And carrying out preliminary merging on each basic region to obtain a series of preliminary candidate regions.

Preferably, the surrounding the preliminary candidate region with a rectangular frame, merging the preliminary candidate regions according to the similarity between the rectangular frames to obtain a final candidate region, including:

Surrounding the preliminary candidate region by a rectangular frame, wherein the position of the rectangular frame C is represented by a quadruple of (x, y, w, h), wherein x, y represents the coordinate of the upper left corner of the rectangular frame, and w, h represents the width and the height of the rectangular frame;

The color distance between the rectangular frame c _i of the preliminary candidate region r _i and the rectangular frame c _j of the preliminary candidate region r _j is:

In the middle of Representing the pixel proportion of the kth bin of the color histogram;

The texture distance between the rectangular boxes c _i and c _j between the rectangular box c _i of the preliminary candidate region r _i and the rectangular box c _j of the preliminary candidate region r _j is:

In the middle of Representing the k-th dimension pixel point proportion of the texture histogram;

for the preliminary candidate regions r _i and r _j:

Where size (r _i) represents the size of the rectangular box corresponding to region r _i and size (im) represents the size of the original picture to be segmented.

For the preliminary candidate regions r _i and r _j:

Where size (BB _ij) represents the circumscribed rectangular size of regions r _i and r _j;

The total difference between the preliminary candidate regions r _i and r _j is:

S(r_i,r_j)＝a₁S_colour(r_i,r_j)+a₂S_texture(r_i,r_j)+a₃S_size(r_i,r_j)+a₄S_fill(r_i,r_j)

a ₁,a₂,a₃,a₄ is a corresponding weight value;

When the total difference S (r _i,r_j) between the preliminary candidate regions r _i and r _j is greater than the set merging threshold, the preliminary candidate regions r _i and r _j are merged to obtain the final candidate region.

Preferably, the manually labeling the bird nest position in the final candidate region, using a rectangle to represent the attribute of the labeled bird nest region, taking the final candidate region containing the bird nest region as the interest region, taking the interest region picture as the template picture, and forming a template library according to all the template pictures, including:

the final candidate region C is represented by a rectangle, and the position attribute of the rectangle is represented by a quadruple (x, y, w, h);

labeling the nest position in the final candidate region, and representing the labeled nest region attribute by using a rectangle, wherein the rectangular position attribute is (bx, by, bw, bh), the interest region is a candidate region containing the nest region, and the position coordinates of the interest region satisfy the following conditions:

and satisfies a threshold condition:

And taking the interest domain picture as a template picture, and forming a template library according to all the template pictures.

Preferably, the matching the pictures of the railway contact net picture data set, which do not contain bird nest, with each template picture in the template library in sequence to obtain the interest domain picture data set includes:

Sequentially matching a picture to be matched, which does not contain a bird nest, in a railway contact net picture data set with each template picture in a template library, wherein the template picture is set to be T, the picture to be matched, which does not contain a bird nest, is set to be I, the width of the template picture is set to be w, the height of the template picture is set to be h, and R represents a matching result, and the matching method is expressed by the following general expression:

Wherein:

The larger the R value is, the higher the similarity between a rectangular area with the size (w, h) at the (x, y) position of the picture to be matched and the template is, the maximum value of the similarity of the template is taken as a template matching result, and the template matching value is required to be higher than a threshold parameter;

Note Rs (T, I) =max _x,y∈I R (x, y)

Each template picture corresponds to an optimal matching value R, the positions of rectangular matching frames corresponding to the R are (x, y, w and h), and a template primary matching result forms a result set S:

Wherein c is a matched threshold parameter;

the result set S is arranged in descending order of Rs value, and the intersecting condition of the two rectangular matching frames S and t rectangular is as follows:

max(x(s)，x(t))≤min(x(s)+w(s)，x(t)+w(t))

max(y(s)，y(t))≤min(y(s)+h(s)，y(t)+h(t))

And traversing the result set S in sequence, discarding the labeling if the current rectangular matching frame is intersected with the labeled rectangular matching frame, otherwise, labeling the current rectangular matching frame in the VOC format, and forming the interest domain data set by all the labeled rectangular matching frames.

Preferably, the first stage YOLO detector and the second stage YOLO detector include: YOLOv3-spp, YOLOv4 and Faster R-CNN.

Preferably, the confidence and expectations of the first-stage YOLO detector are as follows:

Wherein Pr (Zone) is the probability that the current grid has an object to be tested (interest domain), in the training process, the grid to be tested contains the interest domain, pr (Zone) is 1, otherwise is 0, For the intersection ratio of the marking frame of grid prediction and the rectangular frame where the interest domain is actually located, B is the marking frame number of each grid prediction, S ² is the total division grid number of the picture, and/(A)The method comprises the steps that an average value of all prediction frames IOU of prediction is made for all grids where objects are located, I (Zone) is the size of a region of interest, I (image) is the size of an original picture, and E (Zone) is the sum of total IOUs given by the picture;

The confidence level of the second-stage YOLO detector is:

where Pr (Birdnest) is the probability that the current grid has a bird nest, the grid to be measured contains a bird nest, pr (Birdnest) is 1, otherwise is 0, For the intersection ratio of the labeling frame of grid prediction and the rectangular frame where the bird nest is truly located, P (distribution) is the probability that the bird nest of the picture exists in the interest domain, and obviously all bird nests exist in the interest domain, and the probability is 1;

the bird nest predictions in the interest domain are expected to be:

Where Pr _i (birdnest) is the probability that the ith grid in the sub-image has a bird's nest, For the intersection ratio of the jth rectangular frame predicted by the ith grid of the sub-image and the nest picture, confidence (Zone) marks the confidence of the interest domain for one rectangular frame predicted by the original image grid, and the confidence of the interest domain is marked by the element of the matrixA prediction frame representing the prediction made by the divided grids of the original image can mark the grasping degree of the bird nest;

The expectations of cascading predictions are:

In the middle of Average cross-ratios of prediction annotation frames made for grids contained in bird's nest in interest domain sub-images and rectangles in which bird's nest is located, where/>Representing an average IOU value of the grid prediction anchor frame;

the precision of the cascade prediction is:

P＝F(birdnest，Zone，N)*F(Zone，image，M)＞F(birdnest，image，N)。

According to the technical scheme provided by the embodiment of the invention, the method for automatically identifying and quickly tracking the bird nest of the high-speed railway contact net can effectively solve the problems of accurately and quickly identifying and quickly tracking the bird nest on the high-speed railway contact net; the recognition difficulty caused by the lack of obvious shape or texture characteristics due to the fact that the bird nest information amount in the contact net is small can be effectively solved; therefore, the railway contact net can be effectively and automatically identified and detected by the bird nest.

Additional aspects and advantages of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic diagram of an implementation of a method for detecting a bird nest of a railway contact net according to an embodiment of the present invention;

Fig. 2 is a process flow diagram of a method for detecting a bird nest of a railway contact net according to an embodiment of the present invention;

FIG. 3 is a flowchart of a process for initially dividing and merging pictures to generate a basic region according to an embodiment of the present invention;

fig. 4 (a) is a schematic diagram of an original picture provided by an embodiment of the present invention, and fig. 4 (b) is a schematic diagram of a series of basic regions obtained after preliminary segmentation and merging.

FIG. 5 is a flowchart of a process for merging basic regions according to inter-region differences to obtain a series of candidate regions according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of labeling a region of interest with a rectangular box according to an embodiment of the present invention;

FIG. 7 is a flowchart of a process for performing template matching between a picture to be detected and a template picture according to an embodiment of the present invention;

Fig. 8 (a) is a schematic diagram of an original picture provided by an embodiment of the present invention, and fig. 8 (b) is a schematic diagram of template matching of the original picture;

Fig. 9 is a schematic diagram of dividing a picture into s×s grids according to an embodiment of the present invention;

FIG. 10 is a schematic diagram of a single-stage network prediction according to an embodiment of the present invention;

FIG. 11 is a schematic diagram of a first-level network prediction of a cascaded network according to an embodiment of the present invention;

Fig. 12 is a schematic diagram of second-level network prediction of a cascaded network according to an embodiment of the present invention;

FIG. 13 is a schematic view of an IOU curve during YOLOv-GIOU training according to an example embodiment of the present invention;

FIG. 14 is a schematic view of an IOU curve in Yolov, 4-CIOU training process according to an embodiment of the present invention;

FIG. 15 is a schematic diagram of a comparison of YOLOv-SPP second-stage detection with direct detection of IOU according to an embodiment of the present invention;

fig. 16 is a schematic diagram of comparison between YOLOv second-stage detection and direct detection of an IOU according to an embodiment of the present invention.

Detailed Description

Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein the same or similar reference numerals refer to the same or similar elements or elements having the same or similar functions throughout. The embodiments described below by referring to the drawings are exemplary only for explaining the present invention and are not to be construed as limiting the present invention.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless expressly stated otherwise, as understood by those skilled in the art. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or coupled. The term "and/or" as used herein includes any and all combinations of one or more of the associated listed items.

It will be understood by those skilled in the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

For the purpose of facilitating an understanding of the embodiments of the invention, reference will now be made to the drawings of several specific embodiments illustrated in the drawings and in no way should be taken to limit the embodiments of the invention.

The method based on the convolutional neural network uses the convolutional kernel as the feature extractor, the picture can be directly used as the input of the network, and the complex feature extraction and data reconstruction processes in the traditional recognition algorithm are avoided through the nest feature obtained through training, so that the accuracy and recall rate can be obviously improved.

According to the method, a SELECTIVE SEARCH algorithm is used for clustering to obtain a plurality of candidate areas, and the interest areas are found out from the candidate areas according to the position of the bird nest. And carrying out template matching on all the obtained interest domains in the picture set, and marking the interest domains of all the pictures. And constructing a YOLO network to perform recognition training on the interest domain of all the pictures, and simultaneously constructing the YOLO network to perform recognition training on the bird nest in the interest domain. And carrying out first-stage identification on the unknown picture sample to find out the interest domain, and then carrying out second-stage identification to find out the nest from the interest domain. The cascade identification can greatly improve the accuracy of identification and the calculation efficiency.

Example 1

The implementation schematic diagram of the railway contact net bird nest detection method provided by the embodiment of the invention is shown in fig. 1, the processing flow is shown in fig. 2, and the method comprises the following processing steps:

Step S210: obtaining an interest domain picture containing a bird nest region through reverse reasoning according to a picture containing a bird nest in the railway contact net picture data set, taking the interest domain picture as a template picture, and forming a template library according to all the template pictures.

The region of interest is a region, and the region is a whole with higher similarity, and firstly, we need to find out all the region sets with higher similarity in the picture. And searching for the region is the process of dividing and merging the pictures.

Firstly, performing preliminary segmentation on pictures containing bird nest in a railway contact net picture dataset to obtain a large number of basic areas with similarity. The base regions are then combined to obtain a series of candidate regions. A process flow chart for generating a basic region by primarily dividing and merging pictures provided by the embodiment of the invention is shown in fig. 3, and the process comprises the following steps: a picture may be represented by an undirected graph g= < V, E >, where the vertices of the undirected graph represent one pixel of the picture and the weights of the edge e= (V _i,v_j) represent the dissimilarity of adjacent vertex pairs i, j. The color distance of the pixels can be usedThe equal pixel attribute represents the dissimilarity w (e) between two pixels. The basic area is the point set with the minimum dissimilarity, so the basic area is the minimum spanning tree containing the point set, and the preliminary segmentation of the picture finds the forest formed by the minimum spanning tree of the picture.

The difference determines whether the base regions merge or not, and the intra-class difference of the base regions is defined as:

Wherein C represents a basic region of the merging process, e represents a connecting edge inside the region, the weight of the connecting edge represents dissimilarity among pixels, and the intra-class difference Int (C) is the maximum value of the connecting edge inside the region.

The inter-class difference is defined as the smallest connecting edge of two basic areas:

Wherein C ₁ and C ₂ represent two different regions, v _i and v _j represent two vertices of the connecting edge of the two different regions, and the difference Diff (C ₁,C₂) between classes is the minimum value of the connecting edge connecting the two regions.

In particular, if two base regions are not edge-connected, diff (C ₁,C₂) = infinity

Thus, the basis for combining the two basic areas is obtained, and when the condition is satisfied:

Diff(C₁,C₂)≤min(Int(C₁)+τ(C₁),Int(C₂)+τ(C₂))

It is judged that the basic regions C ₁ and C ₂ can be merged

τ(C)＝k/||C||

wherein k is a manually set parameter, C is the number of vertexes of the basic region C, and the size of the algorithm segmentation region can be controlled by adjusting the size of k.

And combining the picture point sets to obtain a plurality of basic areas. Fig. 4 (a) is a schematic diagram of an original picture provided by an embodiment of the present invention, and fig. 4 (b) is a schematic diagram of a series of basic region areas obtained after preliminary segmentation and merging.

FIG. 5 is a flowchart of a process for merging basic regions according to inter-region differences to obtain a series of candidate regions according to an embodiment of the present invention. The specific treatment process comprises the following steps: and merging the formed basic areas to deduce the interest area where the bird nest is located. The basic area and the combined result are represented by a rectangle, and the position of the rectangle can be represented by four-element groups of (x, y, w, h). Where x, y represents the coordinates of the upper left corner of the rectangular frame, and w, h represents the width and height of the rectangular frame.

For region C, the corresponding method of calculating the representative rectangular position attribute is as follows:

firstly, calculating the difference among areas, wherein the difference among areas can be evaluated by four evaluation indexes:

color distance, dependent on minimum of respective bins of two region color histograms

Counting the number of pixel points in each bin of the rectangular image corresponding to the different color channels of the rectangular image in the r _i to obtain an n-dimensional color histogram, whereinThe pixel proportion of the kth bin of the color histogram is represented.

The texture distance depends on the minimum of the respective bins of the fast sift feature histogram for both regions.

The statistical region r _i corresponds to the pixel point number of each sift characteristic in each bin of different color channels of the rectangular image to obtain an n-dimensional texture histogram, whereinRepresenting the k-th dimension pixel proportion of the texture histogram.

The merging between the small areas is preferentially carried out, and the small areas are given higher merging weight.

Where size (r _i) represents the size of the rectangular image corresponding to region r _i and size (im) represents the size of the original picture to be segmented.

Areas with large overlapping areas of the circumscribed rectangles are preferentially merged.

Where size (BB _ij) represents the circumscribed rectangular size of regions r _i and r _j. The remaining parameters are as above.

Weighting the above differences to obtain the total difference between the regions:

S(r_i,rj)＝a₁S_colour(r_i,r_j)+a₂S_texture(r_i,r_j)+a₃S_size(r_i,r_j)+a₄S_fill(r_i,r_j)

a ₁,a₂,a₃,a₄ is the corresponding weight value.

The foregoing conditional segmentation is to perform a segmentation based on pixel differences on the original picture to be segmented, and it gives which basic region each pixel belongs to, and the basic region is a continuous set of pixels, and its shape is irregular. The segmentation based on the judgment condition is based on the result of the first basic segmentation, the irregular point set of the first segmentation is surrounded by a rectangular frame firstly in the second segmentation, a series of rectangular areas are obtained, the range of the rectangular areas is larger than that of the surrounded basic areas, stricter similarity judgment is needed to be carried out on the rectangular areas again, the rectangular areas are combined according to the similarity, the final candidate areas are obtained, and the candidate areas containing bird's nest are the searched interest areas. The first segmentation is only performed once according to whether the pixel points are similar, and the second segmentation is performed on the basis of the first segmentation, so that the considerations of the size, the characteristics, the shape and the like are added, and a rectangular segmentation area which finally contains bird's nest is obtained.

Manually labeling the nest position of the input data set, wherein the labeled nest region attribute is still represented by a rectangle, and the position attribute is set as (bx, by, bw, bh). The interest region is a candidate region comprising a bird nest region, and the position coordinates of the interest region are as follows:

At the same time, excessive merging is prevented, and the interest domain simultaneously meets the threshold condition:

the candidate region where the bird nest is located is the interest domain obtained by reasoning, and as shown in fig. 6, the rectangular box is marked to be the interest domain obtained by reasoning.

Step S220: and sequentially matching the pictures which do not contain bird nest in the railway contact net picture data set with each template picture in the template library to obtain an interest domain picture data set.

Fig. 7 is a flowchart of a process for performing template matching between a picture to be detected and a template picture according to an embodiment of the present invention, where a specific process includes: taking the picture of the interest domain obtained by the reasoning as a template element picture, and forming a template library by all the template element pictures. And performing template matching on the pictures which do not contain bird nest in the railway contact net picture data set by using the template library, namely traversing each picture, and marking all interest areas of the pictures to be detected by adopting normalized correlation coefficient matching.

Assuming that the template picture is T, the picture to be matched of the picture which does not contain the bird nest is I, the width of the template picture is w, the height is h, and R represents a matching result, the matching method can be expressed by the following public expression:

Wherein:

The larger the R value is, the higher the similarity between the rectangular area with the size (w, h) of the (x, y) position of the picture to be matched and the template is, and the maximum value of the similarity of the template is taken as the template matching result. And requires that the template matching value be above the threshold parameter.

Note Rs (T, I) =max _x,y∈I R (x, y)

The template matching is carried out firstly, the pictures to be matched are matched with each template picture in the template library in sequence, each template picture corresponds to an optimal matching value R, and the positions of rectangular matching frames corresponding to the R are (x, y, w, h). The primary matching result of the template forms a result set S:

Where c is the matched threshold parameter.

The result sets S are arranged in descending order of Rs values, and each rectangular matching box in the result sets may intersect.

For two rectangle matching boxes s, the condition that the t rectangles intersect is:

max(x(s)，x(t))≤min(x(s)+w(s)，x(t)+w(t))

max(y(s)，y(t))≤min(y(s)+h(s)，y(t)+h(t))

And traversing the result set S in sequence, discarding the labeling if the current rectangular matching frame is intersected with the labeled rectangular matching frame, otherwise, labeling the current rectangular matching frame in the VOC format, and forming the interest domain data set by all the labeled rectangular matching frames. The best matching result obtained by the template matching algorithm is a rectangular frame, the matching result is (x, y, w, h), and the single object is obtained, and the format of the label. Txt file is Oxywh (one row). One row represents a rectangular label box. The txt file markup format is then converted to xml files in the deep learning trained VOC format using txttoxml scripts.

Step S230: and constructing a contact net bird nest detection model based on the cascade neural network, and training the contact net bird nest detection model. The contact net nest detection model comprises a first-stage YOLO detector and a second-stage YOLO detector, the second-stage YOLO detector is trained by using the template library, and the first-stage YOLO detector is trained by using the interest domain picture data set.

The first-stage YOLO detector uses 3900 pictures after template matching of a template library, and a label file of each picture is obtained in the process of template matching, and training is performed by using yolov-spp neural network.

The picture of the second-stage YOLO detector is a region of interest (containing a bird nest) obtained by a reverse reasoning algorithm, the position of the bird nest is the relative position of the relative region of interest, and the label. Txt can be obtained in the process of the reverse reasoning algorithm. Training was performed using yolov-spp neural network.

Step S240: inputting a picture to be detected into a trained first-stage YOLO detector, outputting a region-of-interest picture by the first-stage YOLO detector, inputting the region-of-interest picture into a trained second-stage YOLO detector, and outputting a bird nest detection result of the picture to be detected by the second-stage YOLO detector.

The size of the contact net nest is small, obvious shape and texture characteristics are lacked, and an ideal result is difficult to obtain by classifying the stripe pictures by adopting the existing artificial design characteristics. Deep learning offers a viable solution to this, YOLO neural networks have powerful detection capabilities as a common target detection network. According to the invention, a two-stage prediction network cascade detection network is utilized to detect the bird nest of the overhead line system, and YOLOv-SPP network structure is adopted as a hierarchical detector.

The YOLO neural network takes the entire picture as input, outputs a predicted bounding box and the class to which it belongs.

The algorithm starts to divide a picture into s×s grids, as shown in fig. 9, where the grid of the center of the labeled object is responsible for predicting the corresponding labeled object. Each grid needs to predict B bounding boxes, each of which needs to regress its own position (x, y, w, h) and confidence.

YOLOv3 the size of the propagation tensor in the network is changed by changing the step size of the convolution kernel. YOLOv3 the prediction speed of the model is increased by computing bounding boxes in advance. YOLOv3 the position of the bounding box is determined by predicting the relative offset of the bounding box center point from the corresponding grid's upper left corner position. And the normalization processing is carried out on t _x、t_y, so that the predicted value of the boundary box is between 0 and 1, and the central point of the boundary box can be ensured to be in the divided grids.

b_x＝σ(t_x)+c_x

b_y＝σ(t_y)+c_y

T _x、t_y、t_w、t_h is the predicted output of the model. c _x and c _y represent coordinates of the grid, p _w and p _h represent the size of the bounding box before prediction, and b _x、b_y、b_w and b _h are center coordinates and the size of the predicted bounding box.

The confidence is whether the grid correctly predicts the deviation of the object to be detected and the actual position of the bounding box and the object. Confidence can be expressed by the following formula:

the IOU is the intersection ratio of the boundary frame and the object labeling frame, and the calculation method is as follows:

Where (tx, ty, tw, th) represents the label box truth location attribute (x, y, w, h).

(Px, py, pw, ph) represents the bounding box pred position attribute (x, y, w, h).

Pr (Object) represents the probability that the mesh has an Object to be predicted. During training, pr (Object) is 0 if no Object to be measured falls into the grid, otherwise it is 1.

During the training process, each grid corresponds to a matrix of pixels, each matrix of pixels being used as an input to the neural network. For each grid, the network gives the output for each bounding box (x, y, w, h, conf, c ₁......c_n), where (x, y, w, h) gives the location of the bounding box, conf is the confidence of the bounding box, and c ₁......c_n represents the class probability of the object.

YOLOv3 uses multi-scale features to detect targets. The feature map obtained through 32 times downsampling has larger receptive field, and is suitable for detecting targets with larger sizes in pictures; obtaining a characteristic diagram with a medium-scale receptive field through 16 times downsampling, and being suitable for detecting a medium-size target in a picture; the feature map obtained through 8 times downsampling has smaller receptive field, and is suitable for detecting targets with smaller sizes in pictures.

YOLOv 3A, obtaining the size of the prior frame by adopting a K-means clustering method, setting 3 prior frames with different sizes for each downsampling scale, and clustering 9 prior frames with 9 sizes altogether. When the resolution of the input picture is 416 x 416, the allocation of the 9 a priori frames obtained on the feature maps of different sizes is shown in table 1.

TABLE 1 feature map and prior frame

The direct use of YOLO networks for detecting the bird nest of the overhead line is not ideal, because the bird nest has a small proportion in the picture, so that a large number of grids are subjected to invalid operation, and the calculation resources are wasted.

And YoloV the neural network uses a priori anchor boxes clustered on the dataset.

For small-scale objects, the 10U values of the large-size anchor frame and the medium-size anchor frame are not very high, and the overall accuracy is not high.

Fig. 10 is a schematic diagram of single-stage network prediction provided in an embodiment of the present invention, where, as shown in fig. 10, a grid in the upper left corner is effectively calculated.

The predictions for a single network are expected to be:

And (3) recording: Average IOU/>, which is an effective prediction box

In the middle ofIs the nest picture duty ratio.

In the contact network environment, the environmental distribution of the bird nest affected by physical factors has continuity, namely the position of the bird nest in the contact network is bounded. The obtained interest areas are all possible occurrence areas of the bird nest. The expression can be formulated as follows:

p (distribution) = P (bird nest present in the interest region of picture |picture contains bird nest) =1

Based on the prior condition of bird nest distribution, a cascading YOLO neural network can be used for identifying the interest domain first, and then the position of bird nest is identified from interest domain prediction sub-pictures given by the network.

(1) First-stage prediction:

fig. 11 is a schematic diagram of first-stage network prediction of a cascaded network according to an embodiment of the present invention. The segmentation effect is shown in fig. 11, wherein a black frame is an interest domain given by template matching, and a black segmentation line is a mesh divided by yolo networks. Only the four grids in the upper left corner are responsible for predicting the interest domain, with confidence, and the confidence of the other grids is 0.

The confidence and expectations of the first level network detection are as follows:

Wherein Pr (Zone) is the probability that the current grid has an object to be tested (interest domain), in the training process, the grid to be tested contains the interest domain, pr (Zone) is 1, otherwise is 0, And the intersection ratio of the annotation frame for grid prediction and the rectangular frame where the interest domain is truly located is determined. B is the number of marking frames predicted by each grid, S ² is the total number of grids divided by the picture, and/(Ind/Ind)An average of all prediction boxes IOU of predictions is made for all grids in which the object is located. I (Zone) is the region of interest size, I (image) is the size of the original picture. E (Zone) is the sum of the total IOUs given by the pictures, and represents the overall grasping degree of correct object prediction.

(2) Second level prediction:

Fig. 12 is a schematic diagram of second-level network prediction of a cascaded network according to an embodiment of the present invention. The second stage Yolo network takes the interest domain subpicture set predicted by the first stage network as input, performs YOLO cascade detection on the subpicture set, is essentially a secondary division of the grid, increases the IOU of the bounding box as much as possible in the training process to improve the training accuracy and performs as much effective calculation as possible.

The confidence of the second level network is:

Where Pr (Birdnest) is the probability that the current grid has an object to be tested (bird nest), during training, the grid to be tested contains bird nest, pr (Birdnest) is 1, otherwise is 0, And the intersection ratio of the annotation frame predicted by the grid and the rectangular frame where the bird nest is truly located. P (distribution) is the probability that the bird's nest of the picture is present in the interest domain, and it is obvious that all bird's nest are present in the interest domain, so this term is 1.

The bird nest predictions in the interest domain are expected to be:

Where Pr _i (birdnest) is the probability that the ith grid in the sub-image has a bird's nest, For the intersection ratio of the jth rectangular frame predicted by the ith grid of the sub-image and the nest picture, confidence (Zone) marks the confidence of the interest domain for one rectangular frame predicted by the original image grid, and the confidence of the interest domain is marked by the element of the matrixA prediction box representing the prediction made by the meshing grid of the original image can mark the degree of confidence of the bird's nest.

The expectations of cascading predictions are:

In the middle of And (3) carrying out average cross-correlation ratio of a prediction annotation frame made for grids contained in the bird nest in the interest domain sub-image and a rectangle where the bird nest is located. (the prediction annotation box is given by a priori clustering algorithm, which is proportional to the sub-image size, so the sub-image prediction box is more fit to the shape of the bird's nest than the prediction box size in the original large image, so its average IOU value is larger). The rest parameters are the same as above.

In the middle ofThe average IOU value representing the grid prediction anchor box, since the anchor box size is given a priori by the dataset clustering, is positively correlated with the size of the input picture. The closer the ratio of the object to be detected to the anchor frame is, the larger the average IOU value is, thus the/>, given in the formulaAnd/>Are all greater than the original/>The cascade prediction accuracy is greater than the original prediction accuracy.

On the other hand, the average prediction accuracy of the single-stage neural network is positively correlated with the number of training sample data sets, and the average prediction accuracy of the neural network when the training samples are marked with objects in Base pictures and the number is n is set to be F (objects, base, n).

From the discussion above, the cascading desired accuracy is higher than the single stage network desired accuracy:

F(birdnest，Zone，n)*F(Zone，image，n)＞F(birdnest，image，n)

In the data set of the bird nest of the contact net, the sample data set with the bird nest mark is few, and most of pictures do not contain the bird nest. However, the number of samples of the interest data set is extremely large, the interest independent of the bird nest of the object to be detected has information gain on the distribution of the bird nest, and the training gain of the interest can increase the accuracy of the whole detector. Let the number of data set samples be M and the number of samples comprising a bird's nest be N.

The overall accuracy of the trainer is:

P＝F(birdnest，Zone，N)*F(Zone，image，M)＞F(birdnest，image，N)

The interest domain data set is easy to obtain, M > N, and when the sample is enough, the first-stage detector can distinguish all interest domains of the picture to be detected.

F(Zone，image，M)→1

Through the limit amplification of the first-stage detector, the precision of the detector reaches an extreme value:

P_max＝F(birdnest，Zone，N)

The precision of the detector is determined by the precision of the second-stage detector, the first-stage detector plays a role in amplifying an object, the second-stage detector detects a bird nest of the object to be detected in an interest domain, the average IoU value of the object is obviously larger than that of the single-stage detector, and the detection precision of the detector is obviously improved.

Example two

Experimental environment and data

The experimental environment and system of the invention are configured as follows:

(1) Hardware configuration: Core ^TM i9-10900K@3.70GHz+NVIDIA Geforce RTX 3090+64GB memory

(2) Operating system: windows10

(3) Deep learning framework: CUDA 11.0+Pytorch1.7.0

Reverse reasoning

The experimental data come from the contact net detection video collected by the detection vehicle on a certain heavy haul railway. Manually processing the video frames to obtain a picture set, selecting 400 pictures containing bird's nest, and obtaining 130 bird's nest pictures in the interest domain by reverse reasoning for training of a second-stage deep learning detector. And (3) manually checking and screening 58 interesting domain pictures to be used as a template library to perform template matching on 3900 bird nest-free pictures of the video set, and manually checking and correcting the matched data set for training a first-stage deep learning detector. And (3) performing contrast training by adopting other deep learning models, taking 240 of the original 400 pictures as a deep learning training set and 80 pictures as a verification set, and performing testing on 126 pictures containing bird nest by adopting a data enhancement method on the rest 80 pictures. The experimental data organization structure is shown in table 1:

Table 1 Experimental data organization Structure

The number of the pictures which are obtained by screening the pictures to be tested containing the bird nest through a reverse reasoning algorithm is 130, the pictures are interest areas of the contact network where the bird nest is located, the contact network areas in the pictures are similar in structure, and the spatial characteristics of the bird nest can be effectively reflected. The 130 pictures obtained are used as a data set for training a second-stage detector, but part of the pictures have more environmental background factors, the environmental noise is larger, the matching process of the template is more interfered, and manual rejection is needed. The pictures after manual removal form a template library for template matching.

Template matching

And carrying out template matching on 3900 pictures containing the catenary in the data set according to a template matching algorithm, and marking out the interest domain. Because of the environmental error of the template library, part of samples have marking deviation, and manual verification is needed. And rejecting the detection frames with the wrong matching, and performing supplementary labeling on the pictures with less matching quantity.

Also, due to the limitations of template matching, it is likely that performing template matching on an unknown sample cannot accurately label all the interest regions. The template matching result after manual verification is used as a data set for training a first-stage detector, the image features of the interest areas are learned by a deep learning algorithm to identify the interest areas of the unknown sample, the method has strong generalization capability, and the method can accurately identify all contact net areas possibly with bird damage along the railway.

Deep learning model detection of some mainstream

The accuracy of the prediction is reflected by the ratio of the number of correctly predicted samples to the total number of predicted samples, and the higher the accuracy is, the more accurate the detection model is. And the detection rate refers to the ratio of the number of correctly predicted samples to the total number of real samples, and the higher the detection rate is, the more reliable the detection model is.

In order to realize real-time detection, the detection speed of the model is improved as much as possible while the model is ensured to have higher detection precision. The number of processed images per second (second frame rate FPS) is generally used to reflect the speed of detection, which is the speed of detecting network processed images for a single network detector; for the cascade network, a calculation method of serially calculating the first-stage detector and the second-stage detector and parallelly detecting each interest domain is adopted.

When the data set is smaller in scale, whether the two-stage algorithm such as the YOLO series and the like or the one-stage algorithm such as the Faster R-CNN and the like, the recognition capability of the small target object is limited, the characteristics of the nest-shaped body of the contact net are single, the contact net is difficult to distinguish from an environment sample, the training difficulty is high, and the test set is difficult to well represent.

Under the condition that the training sample is smaller, because the nest shape is smaller, the integral characteristics of the nest sample are difficult to learn well, the expression capability is poor, the condition of missing marks is easy to occur, and the detection rate of the detector is influenced.

The YOLO series network can be used for carrying out multi-scale identification on objects, has detection capability higher than that of the Faster R-CNN network, has single physical characteristics, is easily confused with environmental samples, and is easy to cause false marks, so that the accuracy of the detector is affected.

YOLOv3-spp, YOLOv4 and Faster R-CNN and other recognition detection models are recognized, and the result is as follows:

As can be seen from the table, whether the network is a two-stage network such as a YOLO series network or a one-stage network such as a Faster R-CNN network, when the bird nest data set is less, the volume is smaller, the physical characteristics are single, the learning effect of the network is poor, the integral characteristics of the bird nest are difficult to learn, and the bird nest of the contact net along the railway cannot be accurately marked.

Cascade detection

(1) First stage detector

The first level detector uses the fast R-CNN, yolov3-spp and yolov4 detection models for domain of interest identification task training.

The domain of interest is detected using different network structures.

The number of layers of YOLOv-SPP network is 225, the parameter amount is 6250 ten thousand, the number of layers of YOLOv network is 327, and the parameter amount is 6400 ten thousand. YOLOv3 and YOLOv are substantially identical in backbone network structure, and YOLOv4 provides an optimization improvement in the training process: YOLOv3 adopts GIOU _Loss for prediction frame regression, which cannot distinguish the relative position relationship of objects, while YOLOv adopts CIOU _Loss, which considers the dimension information such as the center point distance of the boundary frame and the aspect ratio of the boundary frame on the basis of GIOU _Loss, thereby greatly improving the accuracy of prediction frame regression.

An IOU curve diagram in a Yolov-GIO training process provided by the embodiment of the present invention is shown in fig. 13, and an IOU curve diagram in a Yolov-CIOU training process provided by the embodiment of the present invention is shown in fig. 14, and it can be seen from fig. 13 and fig. 14 that when the training progress is the same, CIOU of YOLOv4 is much larger than GIOU of YOLOv3, and the IOU curve diagram in a Yolov-GIO training process further includes a center position and a region size of an interest domain, so that region characteristics of the interest domain can be better learned.

Faster R-CNN uses two-stage detection against the YOLO series, which first generates candidate regions, generates more predicted frames than the YOLO series network, and then classifies and regresses the candidate frames. And the one-stage algorithm of the YOLO series directly classifies and regresses the input pictures, and candidate areas are not generated. Therefore, faster R-CNN has lower error rate and miss recognition rate, but the recognition speed of the one-stage algorithm is slower.

The test is carried out on 126 pictures in total on the test data set of the overhead line system along the railway, and the prediction results of different network models are as follows:

First-level network model interest domain detection quantity comparison

(2) Second stage detector

The template library is subjected to model training of a second-stage detector for 130 bird nest pictures.

The second-stage detector has the detection task of identifying the bird nest in the interest domain, and the images in the data set are amplified through bilinear interpolation, so that even the bird nest with small volume can occupy a large area in the images, interference factors of environmental noise are effectively eliminated, and the image processing device has strong anti-interference capability. As bird nest is amplified along with the original image sheet through bilinear interpolation, the IOU value trained by the bird nest is also improved, the same detection model is compared, and the IOU value is compared as follows under the condition that the training progress is the same: fig. 15 is a schematic diagram of a YOLOv-SPP second-stage detection and direct detection I0U comparison provided by an embodiment of the present invention, and fig. 16 is a schematic diagram of a YOLOv second-stage detection and direct detection IOU comparison provided by an embodiment of the present invention. As can be seen from fig. 15 and 16, the detection model under the same YOLO series has a higher IOU value for the image of the bird nest amplified by bilinear interpolation, so that the position of the bird nest can be more accurately located.

The method comprises the steps of detecting 222 bird's nest in total of 126 pictures in a test set, detecting interest areas by using different first-stage detectors, testing the detection performance of a second-stage detector, and adopting a calculation method of serial calculation of the first-stage detector and the second-stage detector and parallel detection of the second-stage detector.

The detection results are as follows:

performance comparison of different target detection algorithms

/>

As can be seen from the table results, the worst detection rate of the cascade of three detection models is 84.68%, which is still higher than the best result of direct detection by 77.48%. And the cascade network identification is performed, the priori interest domain is identified through the first-stage network induction, the bird nest of the object to be detected is identified in the interest domain by the second stage, the distribution condition of the bird nest in the interest domain is single, the interference of other environment samples can be reduced, the distinguishing degree of the bird nest is improved, and the training difficulty is reduced. Therefore, the cascade network has small dependence on the data set scale of the second-stage object to be detected, and the distribution characteristics of the bird nest in the interest area can be learned by only needing few pictures.

As the first-stage detector uses the fast-RCNN to detect the maximum number of the interesting domains, the detector can reach the extreme condition of the cascade network, so that the detector with the optimal detection rate is the fast-RCNN cascade YOLOv-SPP network, but the detection rate of the fast-RCNN is lower, and the detection rate of the whole detector is limited by the first-stage detector, so that the real-time detection performance cannot be met. The YOLOv network has the expression capability higher than YOLOv, is excellent in the training of identifying the interest areas with more samples and easily distinguished object features to be detected, and has low accuracy in the training with fewer samples and unobvious sample features. The YOLOv-SPP network and the YOLOv-YOLOv-SPP network which are cascaded have higher detection rate and accuracy and higher FPS. The video along the railway can be detected in real time.

In summary, the method for automatically identifying and quickly tracking the bird nest of the high-speed railway contact net provided by the embodiment of the invention can effectively solve the problems of accurately and quickly identifying and quickly tracking the bird nest on the high-speed railway contact net; the recognition difficulty caused by the lack of obvious shape or texture characteristics due to the fact that the bird nest information amount in the contact net is small can be effectively solved; the search range can be reduced under the condition that the representation target is far smaller than the scene range, so that the situation of large target tracking can be converted, and the accuracy can be remarkably improved. Therefore, the railway contact net can be effectively and automatically identified and detected by the bird nest.

The invention provides a tracking method based on a cascade neural network, wherein a first-stage neural network shows a region of interest of bird nest occurrence and then uses the region of interest for training a second-stage neural network, so that the tracking speed is kept high while the accuracy is pursued.

The existing deep learning model has great difficulty and poor effect in directly detecting the bird nest, and the invention provides a detection method capable of completing complex detection tasks with excellent detection capability by combining an existing algorithm to process the data set under the condition of few data sets and small detection targets. The focus is on the principle that cascade detection architecture offers advantages over traditional but network detection, and that it has excellent detection capabilities.

Those of ordinary skill in the art will appreciate that: the drawing is a schematic diagram of one embodiment and the modules or flows in the drawing are not necessarily required to practice the invention.

From the above description of embodiments, it will be apparent to those skilled in the art that the present invention may be implemented in software plus a necessary general hardware platform. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the embodiments or some parts of the embodiments of the present invention.

In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for apparatus or system embodiments, since they are substantially similar to method embodiments, the description is relatively simple, with reference to the description of method embodiments in part. The apparatus and system embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

The present invention is not limited to the above-mentioned embodiments, and any changes or substitutions that can be easily understood by those skilled in the art within the technical scope of the present invention are intended to be included in the scope of the present invention. Therefore, the protection scope of the present invention should be subject to the protection scope of the claims.

Claims

1. The method for detecting the bird nest of the railway contact net is characterized by comprising the following steps of:

Inputting a picture to be detected into a trained first-stage YOLO detector, outputting a region-of-interest picture by the first-stage YOLO detector, inputting the region-of-interest picture into a trained second-stage YOLO detector, and outputting a bird nest detection result of the picture to be detected by the second-stage YOLO detector;

The method comprises the steps that a picture containing a bird nest in a railway contact net picture data set is subjected to reverse reasoning to obtain a region-of-interest picture containing a bird nest region, the region-of-interest picture is used as a template picture, and a template library is formed according to all the template pictures, and the method comprises the following steps:

Performing preliminary segmentation on a picture containing a bird nest in a railway contact net picture dataset to obtain a basic region with set similarity, performing preliminary merging on the basic region according to the difference between the regions to obtain a series of preliminary candidate regions, surrounding the preliminary candidate regions by rectangular frames, merging the preliminary candidate regions according to the similarity between the rectangular frames to obtain a final candidate region, manually marking the bird nest position in the final candidate region, using a rectangle to represent the marked bird nest region attribute, using the final candidate region containing the bird nest region as an interest region, using the interest region picture as a template picture, and forming a template library according to all the template pictures;

The method for primarily dividing the pictures containing the bird nest in the railway contact net picture data set to obtain basic areas with set similarity, primarily combining the basic areas according to the difference between the areas to obtain a series of primary candidate areas, and comprises the following steps:

the intra-class differences of the base region are defined as:

If two base regions are not edge-connected, diff (C ₁,C₂) = infinity

τ(C)＝k/‖C‖

Preliminary merging is carried out on each basic region to obtain a series of preliminary candidate regions;

The preliminary candidate areas are surrounded by rectangular frames, the preliminary candidate areas are combined according to the similarity between the rectangular frames, and a final candidate area is obtained, and the method comprises the following steps:

Rectangular box of preliminary candidate region r _i And a rectangular box/>, of the preliminary candidate region r _j The color distance between the two is as follows:

Rectangular box of preliminary candidate region r _i And a rectangular box/>, of the preliminary candidate region r _j Rectangular box between/>And/>The texture distance between the two is as follows:

for the preliminary candidate regions r _i and r _j:

Wherein size (r _i) represents the size of a rectangular frame corresponding to the region r _i, size (r _j) represents the size of a rectangular frame corresponding to the region r _j, size (im) represents the size of an original picture to be segmented, and S _size(r_i,r_j) is the size similarity score of the rectangular regions r _i and r _j;

for the preliminary candidate regions r _i and r _j:

Wherein size (BB _ij) represents the size of the circumscribed rectangle of the regions r _i and r _j, S _fill(r_i,r_j) is the similarity score filled in the rectangular regions r _i and r _j, and the intersection degree of different regions is measured;

a ₁,a₂,a₃,a₄ is a corresponding weight value;

When the total difference S (r _i,r_j) between the preliminary candidate areas r _i and r _j is larger than a set merging threshold, merging the preliminary candidate areas r _i and r _j to obtain a final candidate area;

The step of sequentially matching the pictures which do not contain bird nest in the railway contact net picture data set with each template picture in the template library to obtain an interest domain picture data set comprises the following steps:

Wherein:

Recording device

Wherein c is a matched threshold parameter;

max(x(s),x(t))≤min(x(s)+w(s),x(t)+w(t))

max(y(s),y(t))≤min(y(s)+h(s),y(t)+h(t))

2. The method according to claim 1, wherein the manually labeling the bird's nest position in the final candidate region, representing the labeled bird's nest region attribute with a rectangle, using the final candidate region including the bird's nest region as the interest region, using the interest region picture as the template picture, and constructing the template library according to all the template pictures, includes:

and satisfies a threshold condition:

3. The method according to any one of claims 1 to 2, wherein the first stage YOLO detector and the second stage YOLO detector comprise: YOLOv3-spp, YOLOv4 and Faster R-CNN.

4. The method of claim 3, wherein the confidence and expectations of the first-stage YOLO detector are as follows:

The confidence level of the second-stage YOLO detector is:

the bird nest predictions in the interest domain are expected to be:

The expectations of cascading predictions are:

the precision of the cascade prediction is:

P＝F(birdnest,Zone,N)*F(Zone,image,M)>F(biednest,inage,N)。