CN116385477A

CN116385477A - Tower image registration method based on image segmentation

Info

Publication number: CN116385477A
Application number: CN202211618506.XA
Authority: CN
Inventors: 沈阳; 马培龙; 冯广辉
Original assignee: Jiayuan Technology Co Ltd
Current assignee: Jiayuan Technology Co Ltd
Priority date: 2022-12-15
Filing date: 2022-12-15
Publication date: 2023-07-04

Abstract

The invention discloses a pole and tower image registration method based on image segmentation, which comprises the following steps: reconstructing a backbone feature extraction network of the U-Net by using VGG16, and integrating the backbone feature extraction network into a attention mechanism CBAM module; training an improved U-Net network by using the marked pole tower data set; inputting the tower image into a trained improved U-Net network to obtain a high-precision tower segmentation map; manufacturing a standard tower image template, wherein a clear and complete tower to-be-detected target is arranged on the template; registering the tower segmentation map and the tower template map by using an SIFT-based image registration algorithm; the inspection robot performs intelligent shooting according to the registration result in the inspection process, reconstructs a trunk feature extraction network of the U-Net by using VGG-16, and fuses a attention mechanism CBAM module; the invention provides the method for carrying out image registration by using the tower segmentation image and the tower template, thereby avoiding the interference of complex background on registration precision.

Description

Tower image registration method based on image segmentation

Technical Field

The invention belongs to the technical field of image segmentation, and particularly relates to a pole and tower image registration method based on image segmentation.

Background

With the development of image recognition technology and the improvement of automation level, the automatic inspection by using an unmanned plane or a robot has gradually replaced the traditional manual inspection. In response to the plan of the national construction smart grid 2.0, researchers have begun turning their eyes to automated inspection techniques in recent years, and the national grid has begun using unmanned aerial vehicles or robots to conduct automated inspection operations. Through long-distance line inspection, no one can collect a large amount of tower image data, and if all the image video data are returned to technicians on the ground, the technicians still need to spend a large amount of time to judge faults.

Disclosure of Invention

In order to achieve the above purpose, the inspection robot can collect the tower images and identify and screen the tower images by combining the deep learning and image processing technology, and only the identified fault images and some unrecognizable images are reserved, so that most of inspection time can be saved. The robot inspection can ensure the safety and reliability of the line inspection process after the defect identification technology is added, so that the line inspection efficiency is greatly improved, and the robot inspection system has the advantages of low cost, small influence by environmental and geographic factors and the like.

In line inspection, robot inspection images often have complicated and changeable backgrounds like towns, mountains, forests and the like, and moreover, the acquisition visual angles are changed, the illumination intensity is changed, and the size scale of the detection target is changed, so that the traditional image segmentation is difficult to deal with. The image segmentation algorithm based on the deep learning can have a good segmentation effect in the face of a complex detection scene. In the transmission line inspection scene, the pole tower region is accurately extracted from a complex background, namely the pole tower edge is required to be segmented at the pixel level, namely the pole tower edge is segmented with high precision. The image segmentation algorithm based on deep learning is commonly used with U-Net and a segmentation algorithm Mask R-CNN which is improved on the basis of fast R-CNN. By extracting the electric power towers from complex backgrounds, interference of the complex backgrounds is avoided, accuracy of tower image registration can be improved to a great extent, and intelligent shooting of the towers is achieved.

The U-Net network structure mainly comprises three parts: encoder, decoder and jump connection structure. There are two more points to be noted in the U-Net network: the input and output sizes are inconsistent: U-Net is a symmetrical network, the left part is downsampled for feature extraction, the right part is upsampled to restore the condensed features to an image. When the U-Net input is a 572 x 572 image, the final output image size is 388 x 388, and the U-Net output is smaller than the input. Because the U-Net uses valid convolution strategy in the whole course, i.e. no padding is added, each convolution can reduce the feature map; overlay-tile strategy: when the input picture is too large, the original picture cannot be input into the network when the image is segmented, so that the original picture needs to be cut, and the large picture is changed into a Zhang Dexiao picture. In order to make the division of the picture splicing part more accurate, an overlay-tile strategy, that is, overlapping clipping, is adopted. From input to output, the U-Net network needs to provide more characteristic information by an overlap part, so that the influence of dividing a large graph into small graphs is minimized. Also, because of the overlay-tile strategy, the input picture size is 572 x 572.

The technical scheme of the invention is as follows: a tower image registration method based on image segmentation, the method comprising: reconstructing a backbone feature extraction network of the U-Net by using VGG16, and integrating the backbone feature extraction network into a attention mechanism CBAM module; training an improved U-Net network by using the marked pole tower data set; inputting the tower image into a trained improved U-Net network to obtain a high-precision tower segmentation map; marking all the targets to be detected on the towers with specific positions through a standard visual angle tower image with clear targets, which is acquired in advance, so that a standard tower image template is manufactured, and the clear and complete targets to be detected on the template are arranged; registering the tower segmentation map and the tower template map by using an SIFT-based image registration algorithm; and the inspection robot performs intelligent shooting according to the registration result in the inspection process.

Wherein CBAM (Convolution Block Attention Module) is a lightweight convolution attention module that incorporates both channel and spatial attention modules. The CBAM module may be used in a feed forward neural network, given an intermediate feature map, the CBAM module would infer the attention map sequentially along two independent dimensions (channel and space) and then multiply the attention map with the input feature map for adaptive feature optimization. Because CBAM is a lightweight generic module, the overhead of this module can be ignored and seamlessly integrated into any CNN architecture, and end-to-end training can be done with the underlying CNN.

As an improvement of the invention, the method comprises performing convolution of specific channels several times after inputting the image to obtain a preliminary effective feature layer, and performing maximum pooling to obtain a feature layer.

As an improvement of the invention, the method comprises SIFT feature detection, firstly, performing different scale changes on an original image and performing gaussian blur on the different scale image by using different gaussian kernel functions, wherein the formula is as follows:

L(x，y，σ)＝G(x，y，σ)*I(x，y)

in the above formula, I (x, y) is an input image, x is the width of the image, y is the height of the image, σ is a scale space blur coefficient, L (x, y, σ) is a convolution operation, and L (x, y, σ) is an image subjected to a gaussian kernel G (x, y, σ) blur process. The specific formula of the Gaussian kernel function is as follows

After the Gaussian pyramid is obtained, adjacent images obtained in the same layer in the pyramid are subtracted, so that a difference image is obtained, and the following formula is calculated specifically:

D(x，y，σ)＝(G(x，y，kσ)-G(x，y，σ))*I(x，y)

＝L(x，y，kσ)-L(x，y，σ)

in the formula, k is the amplification factor of the space fuzzy coefficient, the operation is repeated in the Gaussian pyramid, the differential Gaussian image finally obtained by all layers forms the Gaussian differential pyramid, and the construction of the scale space in the SIFT algorithm is completed.

As an improvement of the invention, the method comprises the following step of positioning local extreme points in the scale space after the scale space is built by the SIFT algorithm.

As an improvement of the present invention, locating the local extremum points includes comparing each pixel point in the scale space with surrounding adjacent pixel points of the same size as the pixel point and with points of adjacent sizes up and down.

As an improvement of the present invention, the method includes that after the detection of the key points (the positioning step of the local extremum points) is completed, the SIFT algorithm is to determine the direction parameter of each key point, specifically as follows:

in the above formula, m (x, y) represents a modulus value of the gradient of the key point, and L (x, y) is position information of the key point in the scale space; l (x+/-1, y+/-1) is information about the key point, namely gradient information in the transverse direction and the longitudinal direction;

in the above formula, θ (x, y) represents the direction of the gradient.

The method further comprises the steps of extracting feature descriptors of the SIFT algorithm, dividing a neighborhood of the SIFT algorithm into sub-neighborhoods by taking each feature point as a center, calculating gradient features in a plurality of directions in each sub-neighborhood to obtain final feature vectors, and finally carrying out normalization processing on the feature vectors.

As an improvement of the invention, the method further comprises the steps of performing feature matching on the key points through a SIFT algorithm, and measuring similarity between the key points by adopting a euclidean distance, wherein the similarity is represented by the following formula:

wherein x and y are coordinate points, i is a determined coordinate point pair, a SIFT algorithm uses a nearest neighbor algorithm to key points in one graph, nearest neighbor key points and secondary neighbor key points are searched in the other graph according to Euclidean distances, and when the ratio of Euclidean distance of the nearest neighbor key point pair to Euclidean distance of the secondary neighbor key point pair is smaller than a set threshold, the nearest neighbor point pair is a pair of effective matching points.

As an improvement of the invention, next, the SIFT algorithm calculates a homography matrix H between two images, the mathematical expression of the matrix H being:

if the aligned coordinates of a pair of matching points are (x, y, 1) and (x ', y', 1), a matrix equation can be established through a homography matrix H:

after being unfolded, the method comprises the following steps:

when the homogeneous coordinate system is used, and the number of the matched key points is more than a certain value, an algorithm RANSAC algorithm is used for iteration to obtain an optimal model, and then an optimal homography matrix is obtained.

Compared with the prior art, the invention has the beneficial effects that: the invention uses VGG-16 to reconstruct the backbone feature extraction network of U-Net, and fuses the attention mechanism CBAM module; the method has the advantages that the image registration is carried out by using the tower segmentation image and the tower template, so that the interference of complex backgrounds on registration accuracy is avoided. By extracting the electric power towers from complex backgrounds, interference of the complex backgrounds is avoided, accuracy of tower image registration can be improved to a great extent, and intelligent shooting of the towers is achieved.

Drawings

Fig. 1 is a flowchart of the registration of a tower image in the present embodiment.

Fig. 2 is a schematic diagram of the attention mechanism CBAM module in this embodiment.

Fig. 3 is a schematic diagram of a U-Net network in this embodiment.

Detailed Description

The present invention is further illustrated in the following drawings and detailed description, which are to be understood as being merely illustrative of the invention and not limiting the scope of the invention.

Examples: as shown in fig. 1, a tower image registration method based on image segmentation, the method includes: the backbone feature extraction network of U-Net is reconstructed using VGG16 and incorporated into the attention mechanism CBAM module, as shown in FIG. 2, wherein Input feature map: input feature map, channel attention module: channel attention module, spatital attention module: spatial attention module, refined feature map: an optimized nominal graph; training an improved U-Net network by using the marked pole tower data set; inputting the tower image into a trained improved U-Net network to obtain a high-precision tower segmentation map; manufacturing a standard tower image template, wherein a clear and complete tower to-be-detected target is arranged on the template; registering the tower segmentation map and the tower template map by using an SIFT-based image registration algorithm; and the inspection robot performs intelligent shooting according to the registration result in the inspection process.

The above description briefly describes a U-Net network, as shown in FIG. 3, input image: input diagram, output segmentation map: output split picture, conv: convolution, relu: activation function, max pool: maximum pooling, up_conv: upsampling, copy and crop: and the characteristic diagrams are fused and spliced, and the applicability of the U-Net network in a transmission line inspection scene is analyzed. Firstly, analyzing the problems in the segmentation of the pole tower image: because the illumination intensity is changeable and the background is changeable, the boundary of the tower is difficult to distinguish from the background, and the gradient change is complex, the tower image segmentation needs high-resolution information to realize accurate segmentation; the tower skeleton structure is relatively fixed, can be described by simple semantic information, and the low-resolution information can be used for identifying a tower target; the transmission line inspection data sets are fewer, enough pole tower pictures are difficult to obtain for safety reasons, and the manufacturing process of the data sets is time-consuming and labor-consuming.

Because Skip-connect operation in U-Net combines low resolution information (providing object class identification basis) and high resolution information (providing accurate segmentation positioning basis), the U-Net segmentation is accurate and the details can be processed well. U-Net is applicable to data sets with few samples, and the original authors only use 30 pictures as training sets. U-Net is also a lightweight model, and the original U-Net model size is only about 28M. Based on the advantages of the U-Net network, the invention finds that the thought of U-Net can be applied to tower segmentation, but in consideration of the two problems of U-Net, the valid convolution increases the difficulty and universality of model design; the overlay-tile strategy is applicable to medical images and is not suitable for common scenes. Because the overall structure of the trunk feature extraction part of U-Net is similar to VGG-16. The present invention replaces the backbone feature extraction network with VGG-16 so that pre-training weights on ImageNet can be used.

When the input image size is 3×512×512, the specific implementation manner is as follows:

1) conv1: the convolution of 64 channels of [3,3] is carried out twice to obtain a preliminary effective characteristic layer of [512,512,64], and 2×2 max pooling is carried out to obtain a characteristic layer of [256,256,64 ].

2) conv2: the convolution of 128 channels of [3,3] is carried out twice to obtain a preliminary effective characteristic layer of [256,256,128], and 2×2 max pooling is carried out to obtain a characteristic layer of [128,128,128 ].

3) conv3: 256-channel convolution of [3,3] is carried out three times to obtain a preliminary effective characteristic layer of [128,128,256], and 2×2 maximum pooling is carried out to obtain a characteristic layer of [64,64,256 ].

4) conv4: the 512-channel convolution of [3,3] is performed three times to obtain a preliminary effective feature layer of [64,64,512], and then 2×2 max pooling is performed to obtain a feature layer of [32,32,256 ].

5) conv5: three convolutions of 512 channels [3,3] were performed to obtain a preliminary effective feature layer of [32,32,512 ].

The SIFT feature detection comprises three modules, namely scale space construction, key point positioning and key point direction determination. The method specifically comprises the following steps: detecting key points, generating local descriptors, matching according to similarity measurement, and purifying matching comparison values; in order to detect feature points in an image at different scales, SIFT firstly performs different-scale changes on an original image, and performs gaussian blur on the image at different scales by using different gaussian kernel functions.

L(x,y,σ)＝G(x,y,σ)*I(x,y)

In the above formula, sigma represents a scale space fuzzy coefficient, and a specific formula of a Gaussian kernel function is as follows:

in the above expression, I (x, y) represents an input image, x represents a convolution operation, and L (x, y, σ) represents an image subjected to a gaussian kernel G (x, y, σ) blurring process, that is, a gaussian pyramid. After the Gaussian pyramid is obtained, subtraction operation is carried out on the adjacent images obtained in the same layer in the pyramid, so that a difference image is obtained, and the specific calculation is shown as the formula:

D(x,y,σ)＝(G(x,y,kσ)-G(x,y,σ))*I(x,y)

＝L(x,y,kσ)-L(x,y,σ)

and repeating the operation in the Gaussian pyramid, and constructing a scale space in the SIFT algorithm by forming the Gaussian difference pyramid by the differential Gaussian images finally obtained by all layers.

After the scale space is built by the SIFT algorithm, the key points are positioned in the scale space, and the local extreme points in the Gaussian differential pyramid are the key points to be positioned. To ensure that the SIFT algorithm does not miss extreme points, the SIFT locates the extreme points in the scale space, and a specific locating strategy is to compare each pixel point in the scale space with surrounding adjacent pixel points of the same scale as the pixel point and a total of 26 points of upper and lower adjacent scales.

After the detection of the key points is completed, the SIFT algorithm also determines the direction parameter of each key point.

Where m (x, y) represents the modulus of the gradient of the key point and θ (x, y) represents the direction of the gradient.

The SIFT algorithm then performs feature descriptor extraction. The SIFT algorithm takes each feature point as a center, divides the neighborhood of the SIFT algorithm into 4 multiplied by 4 sub-neighborhood, calculates gradient features in 8 directions in each sub-neighborhood, and finally obtains feature vectors of 128 dimensions in total for each key point. And finally, carrying out normalization processing on the feature vector, so that the influence caused by brightness change can be eliminated.

The last step of the SIFT algorithm is to perform feature matching on key points, and the Euclidean distance is used for measuring the similarity between the key points, wherein the similarity is shown in the formula:

the SIFT algorithm uses a nearest neighbor algorithm [75] on the keypoints in one graph, and finds nearest neighbor and next-neighbor keypoints according to euclidean distances in the other graph, and when the ratio of the euclidean distance of the nearest neighbor keypoint pair to the euclidean distance of the next-neighbor keypoint pair is smaller than a set threshold, the nearest neighbor pair is a valid matching point. Then, calculating a homography matrix H between two images by using a SIFT algorithm, wherein the mathematical expression of the matrix H is as follows:

after being unfolded, the material is obtained:

when the homogeneous coordinate system is used, and the number of the matched key points is more than 4, an algorithm RANSAC algorithm is used for iteration to obtain an optimal model, and outlier matched point pairs are removed through the RANSAC algorithm to obtain an optimal homography matrix. And finally, carrying out registration model estimation and resampling the tower image to the same size as the template image.

It should be noted that the foregoing merely illustrates the technical idea of the present invention and is not intended to limit the scope of the present invention, and that a person skilled in the art may make several improvements and modifications without departing from the principles of the present invention, which fall within the scope of the claims of the present invention.

Claims

1. A tower image registration method based on image segmentation, the method comprising:

reconstructing a backbone feature extraction network of the U-Net by using VGG16, and integrating the backbone feature extraction network into a attention mechanism CBAM module; training an improved U-Net network by using the marked pole tower data set; inputting the tower image into a trained improved U-Net network to obtain a high-precision tower segmentation map; manufacturing a standard tower image template, wherein a clear and complete tower to-be-detected target is arranged on the template; registering the tower segmentation map and the tower template map by using an SIFT-based image registration algorithm; and the inspection robot performs intelligent shooting according to the registration result in the inspection process.

2. The method for registering a tower image based on image segmentation according to claim 1, wherein the method comprises performing convolution of specific channels several times after inputting the image to obtain a preliminary effective feature layer, and performing maximum pooling to obtain a feature layer.

3. The method for registering a tower image based on image segmentation according to claim 2, wherein the method comprises SIFT feature detection, firstly, performing different scale changes on an original image and performing gaussian blur on the different scale image by using different gaussian kernel functions, and the formula is as follows:

L(x，y，σ)＝G(x，y，σ)*I(x，y)

in the above formula, I (x, y) is an input image, x is the width of the image, y is the height of the image, σ is a scale space blur coefficient, L (x, y, σ) is a convolution operation, and the specific formula of the gaussian kernel function is as follows:

subtracting the adjacent images obtained in the same layer in the pyramid to obtain a difference image, wherein the specific formula is as follows:

D(x，y，σ)＝(G(x，y，kσ)-G(x，y，σ))*I(x，y)

＝L(x，y，kσ)-L(x，y，σ)

in the above formula, k is the amplification factor of the space fuzzy coefficient, the operation is repeated in the Gaussian pyramid, the differential Gaussian image finally obtained by all layers forms the Gaussian differential pyramid, and the scale space construction in the SIFT algorithm is completed.

4. A tower image registration method based on image segmentation according to claim 3, wherein the method comprises the following step of locating local extreme points in a scale space after the scale space is constructed by a SIFT algorithm.

5. The method of claim 4, wherein locating local extremal points comprises comparing each pixel in the scale space with surrounding pixels of the same size as the pixel and with points of adjacent sizes.

6. The method for registering a tower image based on image segmentation according to claim 5, wherein the method comprises determining a direction parameter of each key point by a SIFT algorithm after the key point detection is completed, and the method is characterized by comprising the following specific formula:

in the above formula, m (x, y) represents the modulus of the key point gradient;

in the above formula, θ (x, y) represents the direction of the gradient.

7. The method for registering a tower image based on image segmentation according to claim 6, further comprising extracting feature descriptors from a SIFT algorithm, wherein the SIFT algorithm uses each feature point as a center, divides a neighborhood of the SIFT algorithm into sub-neighborhoods, calculates gradient features in a plurality of directions in each sub-neighborhood to obtain final feature vectors, and finally normalizes the feature vectors.

8. The method for registering a tower image based on image segmentation according to claim 7, further comprising feature matching key points by a SIFT algorithm, and measuring similarity between the key points by using euclidean distance, wherein the similarity is represented by the following formula:

the SIFT algorithm uses a nearest neighbor algorithm for key points in one graph, and searches nearest neighbor key points and secondary neighbor key points in the other graph according to Euclidean distances, and when the ratio of the Euclidean distance of the nearest neighbor key point pair to the Euclidean distance of the secondary neighbor key point pair is smaller than a set threshold, the nearest neighbor point pair is a pair of effective matching points.

9. The method for registering a tower image based on image segmentation according to claim 8, wherein a SIFT algorithm is used to calculate a homography matrix H between two images, and the mathematical expression of the matrix H is:

10. the method for registering the pole and tower images based on image segmentation according to claim 9, wherein when the homogeneous coordinate system is used, when the number of the matched key points is more than a certain value, an algorithm RANSAC algorithm is used for iteration to obtain an optimal homography matrix.