AU2021102397A4

AU2021102397A4 - Method for target detection oriented to dam defect images

Info

Publication number: AU2021102397A4
Application number: AU2021102397A
Authority: AU
Inventors: Yi Liu; Yuan Li; Yingchi MAO; Ping PING; Jun Qian; Longbao WANG; Shufang XU
Original assignee: Hohai University HHU
Current assignee: Hohai University HHU
Priority date: 2020-05-08
Filing date: 2021-05-07
Publication date: 2021-06-24
Anticipated expiration: 2029-05-07
Also published as: CN111597941A; CN111597941B

Abstract

The present disclosure discloses a method for target detection oriented to dam defect images, comprising: applying deformable convolution to a VGG16 network, expanding the convolutional sensing range, and capturing deformation features of dam defects by learning convolution offset so as to obtain a feature map of defect images; during the multi-scale feature map detection, modifying the size proportion of a prior box in the anchor mechanism to improve the detection accuracy of strip-shaped defect features and the generalization capability of the model; adopting an improved non-maximum suppression algorithm to screen out redundant negative samples, so that the diversity of training samples is ensured as much as possible on the premise of balancing the ratio of positive and negative samples. According to the present disclosure, effective detection is performed for the dam defect images, so as to both realize the detection of multi-deformation defect features and further improve the generalization capability of strip-shaped defect detection. The present disclosure shows a higher detection precision and a better convergence performance in the target detection for the dam defect images.

Description

METHOD FOR TARGET DETECTION ORIENTED TO DAM DEFECT IMAGES TECHNICAL FIELD

[01] The present disclosure relates to the field of target detection for the dam defect images, and in particular to a method for the target detection oriented to the dam defect images.

BACKGROUNDART

[02] In the field of construction engineering, those inspection items or inspection points that do not meet specified requirements in engineering construction quality are defined as defects. With the long-term operation of hydro-power dams, factors such as material aging and environmental impacts result in the formation of defects to different degrees. If a defect is relatively minor, corresponding measures can be taken to deal with it in time to meet the structural bearing requirement. Once the defect is not treated and remedied in time, it will pose a great threat to the safe operation of the dam. Therefore, automatic patrol inspection equipment may be used to detect and check defects in time so as to effectively maintain the structural safety of the dam.

[03] Public data sets used for target detection have always had inherent species and features, so, during feature extraction, the features are usually convolved by using a sensitive range of a fixed size. Due to the uncertainty of defect generation, geometric shapes of defects vary according to changes of genesis and environment, which increases the difficulty of feature extraction accordingly. The traditional convolution method is used to extract features in SSD, which is effective to samples of fixed geometries, but cannot apply to geometric deformation with unknown defects for defect data sets, thus having some limitations.

[04] It will be clearly understood that, if a prior art publication is referred to herein, this reference does not constitute an admission that the publication forms part of the common general knowledge in the art in Australia or in any other country.

SUMMARY

[05] To overcome problems in the prior art and/or provide a commercial choice, the present disclosure provides a method for target detection oriented to dam defect images. The target detection algorithm adopting deformable convolution to extract features may both realize efficient detection and accurately identify and detect dam defects of variable geometries.

[06] In one aspect, embodiments of the present invention provide a method for target detection oriented to dam defect images, comprising : (1) Adopting a deformable-convolution VGG16 network to extract defect features;(2) Performing multi-scale feature map detection by using a modified size proportion of a prior box;(3) In the training process, adopting an improved non-maximum suppression method to balance positive and negative samples, namely adopting a method by which only negative samples are screened out during the redundant sample screening and elimination, so as to ensure the diversity of training samples.

[07] The present disclosure also provides a method for target detection oriented to dam defect images, which include the following steps:

[08] (1) For geometric deformation features of dam defects, applying the deformable convolution to a single-stage target detector SSD to modify the convolution in its backbone network VGG16 into the deformable convolution, expanding the convolutional sensitive range, and capturing deformation features of dam defects by learning a convolution offset;

[09] (2) In the stage of the multi-scale feature map detection, for the strip-shaped feature such as "crack" defects in the dam, modifying a size proportion of a prior box in an anchor mechanism to improve the detection accuracy of strip-shaped features and the generalization capability of a model;

[10] (3) In the training process, adopting an improved non-maximum suppression method, namely adopting a method by which only negative samples are screened out during the redundant sample screening and elimination, so as to ensure the diversity of training samples.

[11] Further, specific steps of adopting the deformable-convolutionVGG16 network to extract defect features include:

[12] (1.1) Original images are input, labeled as U, wherein batch is set to b;

[13] (1.2) The original image batch undergoes a common convolution filled in the way of "same", namely the output and input are unchanged in size, and a corresponding output result is the offset of each pixel in the original images' batch; in a deformable convolution, R increases a sensitive range by the offset { A pnn=1,---,N}, wherein N |RI, so a pixel value upon convolution is:

[14] y(po)= y w(pj)-x(p0+p,+Apj) ppeR

[15] Here, a sampled convolution kernel consists of irregular R, and the offset from the center of convolution kernel is pn +Ap, with the original standard convolution process divided into two ways, the upper one of which is to learn the offset to obtain HxW X 2N output offsets, wherein N=|RI represents a number of pixels in the convolution kernel, and 2N represents the offsets in two vertical directions;

[16] (1.3) Adding a pixel index value in the images U to V to get an offset coordinate (namely a coordinate value in the original images U), and it is necessary to limit the coordinate value within the image size and use the floating-point coordinate value to obtain a pixel;

[17] (1.4) The calculation result of the offset Apn is often a decimal with high precision, and non-integer coordinates cannot be used in discrete data such as images. If a simple rounding method is adopted, there will be errors to some extent, so it is necessary to calculate the pixel value at x(po + p, + Ap,) by bilinear difference, that is, the pixel

value of this point should be calculated by looking for four pixels closest to the coordinates. x(p)= x(po + p, + Ap) is simplified as:

[18] x(p)=>1G(q,p)-x(q) q

[19] Wherein x(q) represents pixel values at four adjacent integer coordinates, and G(•,•) is the weight parameter corresponding to the four integer points adjacent to p:

[20] x(p)=ymax(o,1-|q,-px|)-max(,1-|q,-p,|)-x(q) q

[21] (1.5) After all pixels corresponding to coordinate values are calculated, a new feature map is obtained which is input into the next layer as input data.

[22] Further, performing multi-scale feature map detection by using the modified size proportion of the prior box specifically includes:

[23] (2.1) A prior box with a different scale is set for each pixel cell in the feature map, and with the decrease of the feature map size, the size of the prior box increases linearly as follows:

Sk Smin + Smax -Smin (k-1),ke L1,r]

[24] m-1

[25] Wherein m is a number of feature maps, and as the prior box is separately sized for the convolution layer in the backbone network, Smax and Smi represent the maximum value and the minimum value based on the size proportion of the feature maps;

[26] (2.2) A prior box with a different aspect ratio is set for each pixel cell in the feature map. For strip-shaped defects such as cracks in a dam, an aspect ratio of an original prior box is insufficient to label the defects completely, so an aspect ratio of the prior box is set to:

[2 ar, e 1, 2, 5' 2', 5 1271]a

[28] (2.3) The actual width and height of the prior box is calculated according to the following formula:

[29] Wk=IfkISk ,, hk=|f|I'sk /

[30] To ensure the accuracy of the target detection and the integrity of the prior box coverage, each feature map is additionally provided with a prior box sized as

Sk =SkSk+1 and with an aspect ratio of 1, namely each feature map is provided with two differently sized prior boxes with an aspect ratio of 1, so the prior box has an aspect

ratio actually set to ar e 1,2,5, , 5,1 ; center points of the prior box of each pixel

cell is distributed in the center of each cell, namely (05 0.5 ) E 0, k fk

where fk| is the size of the feature map;

[31] (2.4) Each prior box of each pixel outputs a value consisting of two parts which respectively correspond to a position of a prior box and confidence scores of various categories in the prior box. The position of the prior box contains four values (cx, cy, w, h) which respectively represent the center coordinate, the width and the height of the prior box. The confidence value represents the possibility that the target in the prior box corresponds to each category. If there are c categories in the detected target, it is necessary to predict c+1 confidence values, wherein the first confidence refers to a score without the target or belonging to the background.

[32] Further, balancing positive and negative samples by using the improved non-maximum suppression method in the training process has the following specific steps:

[33] (3.1) For a defect data set of the present disclosure, there are very few real targets in each image, but there are many prior boxes. Therefore, starting from the prior box, if an IoU ratio between the prior box and the real target is greater than 0.5, the prior box is listed as a positive sample, otherwise a negative sample; and confidences of all negative sample prior boxes are sorted in descending order to select a negative sample having the smallest confidence;

[34] (3.2) After the check through other negative sample prior boxes, a negative sample prior box is deleted if it has an area which is overlapped with the lowest-scored prior box and greater than a threshold value of 0.5;

[35] (3.3) One of unprocessed negative sample prior boxes with the lowest confidence is selected to repeat the above steps, and only negative samples are deleted. Compared with traditional NMS, this method increases the diversity of sample training on the premise of deleting redundant prior boxes.

[36] According to the present disclosure, a lightweight single-stage target detection algorithm SSD is selected as a basic framework to: reasonably analyze features of dam defects, specifically improve the VGG16 network in the feature extraction stage, add an intermediate mechanism for processing geometric transformation, expand the convolution sensitive range, and capture deformation features of defects by learning the convolution offset.

[37] The present disclosure improves the feature extraction of SSD, and applies the deformable convolution to the backbone network VGG16 of SSD so as to expand the convolution sensitive range and provide a more flexible feature extraction mechanism for features with changeable geometries, thus further improving the target detection accuracy in the case of high-efficiency detection.

[38] Advantageous effects: in comparison to the prior art the present disclosure provides the following advantages:

[39] 1. Deformable convolution acts as an intermediate mechanism of the VGG16 network to deal with geometric transformation, improves the spatial information modeling ability of the model, and has better performance in target detection accuracy of dealing with defect features with unfixed geometries.

[40] 2. Strip-shaped defects can be located and detected more accurately by modifying r 1 1 the aspect ratio of the prior box into a r e w,2,5'', which improves the detection

accuracy of strip-shaped features and the generalization ability of the model.

[41] 3. In the training process, adopting the improved non-maximum suppression method may not only mitigate imbalance between positive and negative samples, but also eliminate negative samples only, thus increasing the sample training volume and optimizing the learning effect.

BRIEFT DESCRIPTION OF THE DRAWINGS

[42] Fig. 1 is a schematic diagram of dam defects in embodiments of the present disclosure;

[43] Fig. 2 is an overall framework diagram of a defect image target detection algorithm in embodiments of the present disclosure;

[44] Fig. 3 is a framework diagram of VGG16 feature extraction network in embodiments of the present disclosure;

[45] Fig. 4 is a framework diagram of deformable convolution in embodiments of the present disclosure;

[46] Fig. 5 is a schematic diagram of expanding the sensitive range by the deformable convolution in embodiments of the present disclosure;

[47] Fig. 6 is a schematic diagram of an improved aspect ratio of the prior box;

[48] Fig. 7 is a schematic diagram of a defect image target detection result in embodiments of the present disclosure.

DETAILED DESCRIPTION OF THE EMBODIMENTS

[49] The present disclosure will be further illustrated with reference to specific embodiments below. These embodiments are only used to illustrate the present disclosure without limiting the scope of the present disclosure. After reading the present disclosure, modifications of various equivalent forms made by those skilled in the art to the present disclosure fall within the scope defined by the appended claims of this present disclosure.

[50] It is known that there are patrol inspection defect images of a power station dam project, which include four types of defects and one type of engineering features, namely cracks, alkaline precipitates, water seepage, concrete scaling, and holes, as shown in Fig. 1. The total number of defect images is 8890, including 12995 labeled examples.

[51] Fig. 2 shows an overall framework of the method for target detection oriented to dam defect images provided by the present disclosure, and introduces the main work flow according to the present disclosure, which is specifically implemented as follows:

[52] (1) For geometric deformation features of dam defects, applying the deformable convolution to a single-stage target detector SSD to modify the convolution in its backbone network VGG16 as shown in Fig. 3 into the deformable convolution, expanding the convolutional sensitive range, and capturing deformation features of dam defects by learning a convolution offset.

[53] (1.1) Original images are input, labeled as U, wherein batch is set to b;

[54] (1.2) The original image batch undergoes a common convolution filled in the way of "same", namely the output and input are unchanged in size, and a corresponding output result is the offset of each pixel in the original image batch; in a deformable convolution, R increases a sensitive range by the offset { A pnIn=1,---,N}, as shown in

Fig. 5. In the offset, N=|R1, so a pixel value upon convolution is:

[55] y(po)= Y w(p)-x(pO+p,+Ap) ppER

[56] Here, a sampled convolution kernel consists of irregular R, and the offset from the center of convolution kernel is pn +Ap, with the original standard convolution process divided into two ways, the upper one of which is to learn the offset to obtain HxW X 2N output offsets, wherein N= R1 represents a number of pixels in the convolution kernel, and 2N represents the offsets in two vertical directions, as shown in Fig. 4;

[57] (1.3) Adding a pixel index value in the images U to V to get an offset coordinate (namely a coordinate value in the original images U), and it is necessary to limit the coordinate value within the image size and use the floating-point coordinate value to obtain a pixel;

[58] (1.4) The calculation result of the offset Apn is often a decimal with high precision, and non-integer coordinates cannot be used in discrete data such as images. If a simple rounding method is adopted, there will be errors to some extent, so it is necessary to

calculate the pixel value at x(po + p, + Ap,) by bilinear difference, that is, the pixel

value of this point should be calculated by looking for four pixels closest to the

coordinates. x(p)= x(po + p, + Ap) is simplified as:

[591 x(p)=>1G(q,p)-x(q) q

[60] Wherein x(q) represents pixel values at four adjacent integer coordinates, and G(•,•) is the weight parameter corresponding to the four integer points adjacent to p:

[611 x(p)=Xmax(o,1-l'x- lmax(O,1-q, - p,).x(q) q

[62] (1.5) After all pixels corresponding to coordinate values are calculated, a new feature map is obtained which is input into the next layer as input data.

[63] (2) In the stage of the multi-scale feature map detection, for the strip-shaped feature such as "crack" defects in the dam, modifying a size proportion of a prior box in an anchor mechanism to improve the detection accuracy of strip-shaped features and the generalization capability of a model.

[64] (2.1) A prior box with a different scale is set for each pixel cell in the feature map, and with the decrease of the feature map size, the size of the prior box increases linearly as follows:

Sk Smin + Smax smin (k-1),k E[1,M]

[65] m-1

[66] Wherein m is a number of feature maps, and as the prior box is separately sized for the convolution layer in the backbone network, m is set to 5 herein; Smax and Smi represent the maximum value and the minimum value based on the size proportion of the feature maps, so Smax and Sminare set to 0.9 and 0.2; and the size proportion of the first feature map is set to Smin /2 =0.1. Each feature map after the first layer has a size proportion of its prior box increasing linearly according to the formula (4-16), and obtains respective Sk of 0.2, 0.37, 0.54, 0.71 and 0.88. The proportion Sk is multiplied by the size of respective feature map to get the prior box size of each feature map.

[67] (2.2) A prior box with a different aspect ratio (generally set to

ar e 1,2,3,!,! )is set for each pixel cell in the feature map. For strip-shaped

defects such as cracks in a dam, an aspect ratio of an original prior box is insufficient to label the defects completely, so an aspect ratio of the prior box is set to

ar E 1,2,15 4 as shown in Fig. 6;

[68] (2.3) The actual width and height of the prior box is calculated according to the following formula:

[691 Wk =IfkIskFa,hk =|fk|-sk /

[70] To ensure the accuracy of the target detection and the integrity of the prior box coverage, each feature map is additionally provided with a prior box sized as

Sk SkSk+1 and with an aspect ratio of 1, namely each feature map is provided with two differently sized prior boxes with an aspect ratio of 1, so the prior box has an aspect

ratio actually set to ar' E 1,2,5, , ,1'; center points of the prior box of each pixel

i.+0.5 j+0.5 cell is distributed in the center of each cell, namely (± , ), ijE [0,

where fk| is the size of the feature map;

[71] (2.4) Each prior box of each pixel outputs a value consisting of two parts which respectively correspond to a position of a prior box and confidence scores of various categories in the prior box. The position of the prior box contains four values (cx, cy, w, h) which respectively represent the center coordinate, the width and the height of the prior box. The confidence value represents the possibility that the target in the prior box corresponds to each category. If there are c categories in the detected target, it is necessary to predict c+1 confidence values, wherein the first confidence refers to a score without the target or belonging to the background.

[72] (3) In the training process, adopting an improved non-maximum suppression method, namely adopting a method by which only negative samples are screened out during the redundant sample screening and elimination, so as to ensure the diversity of training samples.

[73] (3.1) For a dam defect data set of the present disclosure, there are very few real targets in each image, but there are many prior boxes. Therefore starting from the prior box, if an IoU ratio between the prior box and the real target is greater than 0.5, the prior box is listed as a positive sample, otherwise a negative sample; and confidences of all negative sample prior boxes are sorted in descending order to select a negative sample having the smallest confidence;

[74] (3.2) After the check through other negative sample prior boxes, a negative sample prior box is deleted if it has an area which is overlapped with the lowest-scored prior box and greater than a threshold value of 0.5;

[75] (3.3) One of unprocessed negative sample prior boxes with the lowest confidence is selected to repeat the above steps, and only negative samples are deleted. Compared with traditional NMS, this method increases the diversity of sample training on the premise of deleting redundant prior boxes. Fig. 7 shows the detection for 4 types of defects and 1 type of engineering feature in the dam defect images, and such detection may remain relatively high accuracy for most defect features.

[76] In the present specification and claims (if any), the word 'comprising' and its derivatives including 'comprises' and 'comprise' include each of the stated integers but does not exclude the inclusion of one or more further integers.

Claims

WHAT IS CLAIMED IS:

1. A method for target detection oriented to dam defect images, comprising: (1) Adopting a deformable-convolution VGG16 network to extract defect features; (2) Performing multi-scale feature map detection by using a modified size proportion of a prior box; (3) In the training process, adopting an improved non-maximum suppression method to balance positive and negative samples, namely adopting a method by which only negative samples are screened out during the redundant sample screening and elimination, so as to ensure the diversity of training samples.

2. The method for target detection oriented to the dam defect images according to claim 1, wherein in Step (1), adopting a deformable-convolution VGG16 network to extract defect features comprises: for geometric deformation features of dam defects, applying the deformable convolution to a single-stage target detector SSD to modify the convolution in a backbone network VGG16 into the deformable convolution, expanding the convolutional sensitive range, and capturing deformation features of dam defects by learning a convolution offset.

3. The method for target detection oriented to the dam defect images according to claim 1, wherein in Step (1), adopting a deformable-convolution VGG16 network to extract defect features specifically comprises: (1.1) Original images are input, labeled as U, wherein batch is set to b; (1.2) The original images' batch undergoes a common convolution filled in the way of "same", namely the output and input are unchanged in size, and a corresponding output result is the offset of each pixel in the original images' batch; a number of output values is bxHxWx2N which is labeled as V, wherein 2N represents an x offset and a y offset in each direction; (1.3) Adding a pixel index value in the image U to V to get an offset coordinate (namely a coordinate value in the original images U), and it is necessary to limit the coordinate value within the image size and use the floating-point coordinate value to obtain a pixel; (1.4) Using a bilinear difference method to obtain a pixel value at a float-point coordinate; (1.5) After all pixels corresponding to coordinate values are calculated, a new feature map is obtained which is input into the next layer as input data.

4. The method for target detection oriented to the dam defect images according to claim 1, wherein in Step (2), performing multi-scale feature map detection by using the modified size proportion of the prior box specifically comprises: (2.1) A prior box with a different scale is set for each pixel cell in the feature map, and with the decrease of the feature map size, the size of the prior box increases linearly as follows:

Sk Smin + Smax Smin (k - 1), k [ L1, m] mr-1 Wherein m is a number of feature maps, and as the prior box is separately sized for the convolution layer in the backbone network, Smax and Smin represent the maximum value and the minimum value based on the size proportion of the feature maps; (2.2) A prior box with a different aspect ratio is set for each pixel cell in the feature map. For strip-shaped defects such as cracks in a dam, an aspect ratio of an original prior box is insufficient to label the defects completely, so an aspect ratio of the prior box is set to:

ar' E 1,2,5',2

(2.3) The actual width and height of the prior box is calculated according to the following formula:

W" =|.fk|- s,F, hk =| fk' s 4F/ Wherein If k | is a size of the feature map; (2.4) Each prior box of each pixel outputs a value consisting of two parts which respectively correspond to a position of a prior box and confidence scores of various categories in the prior box.

5. The method for target detection oriented to the dam defect images according to claim 1, wherein in Step (3), balancing positive and negative samples by using the improved non-maximum suppression method in the training process specifically comprises: (3.1)Confidences of all negative sample prior boxes are sorted in descending order to select a negative sample having the smallest confidence; (3.2) After the check through other negative sample prior boxes, a negative sample prior box is deleted if it has an area which is overlapped with the lowest-scored prior box and greater than a threshold value of 0.5; (3.3) One of unprocessed negative sample prior boxes with the lowest confidence is selected to repeat the above steps, and only negative samples are deleted. Compared with traditional NMS, this method increases the diversity of sample training on the premise of deleting redundant prior boxes.