CN117218345A

CN117218345A - Semantic segmentation method for electric power inspection image

Info

Publication number: CN117218345A
Application number: CN202311183380.2A
Authority: CN
Inventors: 黄飞虎; 战鹏祥; 廖思睿; 周子堃; 彭舰; 徐文政; 弋沛玉; 王金策
Original assignee: Sichuan University
Current assignee: Sichuan University
Priority date: 2023-09-13
Filing date: 2023-09-13
Publication date: 2023-12-12

Abstract

The invention discloses a semantic segmentation method of an electric power inspection image, which comprises the following steps: s1, acquiring an RGB image, a thermodynamic diagram and a depth map of power inspection; s2, respectively extracting features of an RGB image, a thermodynamic diagram and a depth diagram, and performing cross-modal feature fusion on the feature diagram to obtain a cross-modal feature diagram; s3, performing rough segmentation on the example region of the cross-modal feature map to obtain a rough segmentation example region; s4, extracting feature association information from pixel level to instance level in the cross-modal feature map based on the rough segmentation instance area; and S5, carrying out pixel-level feature enhancement in the cross-modal feature map based on the feature association information to obtain a predicted semantic segmentation result map. Compared with the traditional method and the CNN-based image segmentation method, the method can more fully represent the context relation and the global semantic information among the features, and has smaller parameter number and faster reasoning speed compared with the attention mechanism-based image segmentation method.

Description

Semantic segmentation method for electric power inspection image

Technical Field

The invention belongs to the technical field of power equipment inspection, and particularly relates to a semantic segmentation method for an electric power inspection image.

Background

In the power inspection task, image semantic segmentation is an important technology, and can help automatically identify power equipment, defects and other key factors, so that inspection efficiency and inspection accuracy are improved. The existing research method mainly comprises the following steps: (1) semantic segmentation based on traditional computer vision methods. Such methods rely mainly on low-level visual features of the image, and commonly include Graph Cut (Graph Cut), clustering, edge detection, and other techniques. (2) semantic segmentation based on Convolutional Neural Networks (CNNs). Such models employ an encoder-decoder architecture, where the encoder is responsible for extracting features of the image and the decoder is responsible for mapping these features back to the pixel-level segmentation results. In the encoder section, common network structures include VGGNet, resNet and the like. The decoder portion typically uses a transposed convolutional layer for upsampling and restoring resolution. (3) semantic segmentation method based on attention mechanism. Such methods allow the model to automatically learn contextual dependencies between different locations in the image as the image is processed, adaptively focus on regions in the image that are relevant to the semantic segmentation task, and dynamically assign weights for the different regions based on the image content.

Semantic segmentation based on a traditional computer vision method is often dependent on low-level features of an artificial design image, cannot well characterize high-level semantic information, cannot effectively learn context relations among pixels, is difficult to perform semantic understanding on power grid panorama, and is poor in performance in a power inspection scene with complex and changeable environment; the semantic segmentation based on the convolutional neural network only can extract local characteristics, long-distance dependency relationship among pixels is difficult to capture, overlapped and blocked objects in complex scenes cannot be well processed, the method is sensitive to appearance change of power equipment caused by factors such as environment and illumination, and generalization is poor in different scenes; semantic segmentation methods based on attention mechanisms typically require computation of correlation weights between each location and other locations. For large-size images or high-resolution feature images, the computational complexity of the model can be obviously increased, so that the time cost of training and reasoning is increased, and the requirements of scenes such as unmanned aerial vehicle inspection on low computational effort and high real-time performance are difficult to meet.

Disclosure of Invention

Aiming at the defects in the prior art, the semantic segmentation method for the power inspection image solves the problem that the existing semantic segmentation model is difficult to segment the semantic under the complex scene with shielding and appearance change when aiming at the detection of the power inspection defect.

In order to achieve the aim of the invention, the invention adopts the following technical scheme: a semantic segmentation method of an electric power inspection image comprises the following steps:

s1, acquiring multi-mode image data of power inspection;

wherein the multi-modal image data includes an RGB image, a thermodynamic diagram, and a depth map;

s2, respectively extracting features of an RGB image, a thermodynamic diagram and a depth diagram, and performing cross-modal feature fusion on the feature diagram to obtain a cross-modal feature diagram;

s3, performing rough segmentation on the example region of the cross-modal feature map to obtain a rough segmentation example region;

s4, extracting feature association information from pixel level to instance level in the cross-modal feature map based on the rough segmentation instance area;

and S5, carrying out pixel-level feature enhancement in the cross-modal feature map based on the feature association information, and further obtaining a predicted semantic segmentation result map.

Further, in the step S2, the visual characteristics of the power image of the RGB image are extracted through the MobileNet model, so as to obtain a visual characteristic diagram XF;

extracting the pixel heat intensity change characteristics in the thermodynamic diagram through a SheffleNet model to obtain a thermodynamic characteristic diagram XT;

and extracting structural features including lines and equipment from the depth map through the PointNet model to obtain a depth feature map XD.

Further, in the step S2, a bi-directional attention mechanism is adopted to perform cross-modal feature fusion, so as to obtain a cross-modal feature map X, where the expression is as follows:

XF′＝XF+Attention(XF，XD)+Attention(XF，XT)

XD′＝XD+Attention(XD，XF)+Attention(XD，XT)

XT′＝XT+Attention(XT，XD)+Attention(XT，XF)

X＝Concat(XF′，XD′，XT′)

in the formula, XF ' is a visual feature map fused with thermal information and depth information, XD ' is a depth feature map fused with visual information and thermal information, XT ' is a thermal feature map fused with visual information and depth information, attention (·) is an Attention mechanism, and Concat (·) is a splicing operation.

Further, the step S3 specifically includes:

s31, performing convolution operation on the cross-modal feature map X by using the cavity convolution kernels w with different cavity rates to obtain a cross-modal feature map X' after the convolution operation;

s32, carrying out global average pooling, 1X 1 convolution dimension increasing and splicing operation on the cross-modal feature graphs X 'with different void ratios after convolution operation in sequence to realize multi-scale information fusion, so as to obtain a multi-scale cross-modal feature graph X';

s33, merging deep semantic features of the multi-scale cross-modal feature map X' with shallow semantic features of the initial cross-modal feature map X through jump connection to obtain coding features;

s34, up-sampling decoding is carried out on the coding features by utilizing transpose convolution, and rough division example areas corresponding to all examples are obtained.

Further, in the step S31, the cross-modal feature map X 'after the convolution operation is at an arbitrary position X' _i Expressed as:

wherein k represents the position on the convolution kernel, r represents the void rate of void convolution, w is the void convolution kernel, and X is a cross-modal feature map;

in the step S34, the rough segmentation implementation region is:

M＝Deconv(X+X″)

where X is a cross-modal feature map and Deconv (·) is a transpose convolution operation.

Further, the step S4 specifically includes:

s41, carrying out weighted summation on pixel level representations in the image of the coarse-division instance corresponding to the similar instance to obtain instance level representations;

s42, extracting feature association information of each pixel level representation and the instance level representation by using the similarity of the pixel level representation and the instance level representation corresponding to the similar instance.

Further, in the step S41, the instance level represents f _k The method comprises the following steps:

wherein X is _i For the ith pixel level representation, M _ki For the normalized probability that the ith pixel belongs to class k, represent M _k The value of the ith position is added, and I is a pixel set;

in the step S42, feature-related information w _ik The method comprises the following steps:

in the method, in the process of the invention,is a normalized relationship function.

Further, the step S5 specifically includes:

s51, taking the feature association information as weight, and weighting and aggregating instance-level representations of K areas to obtain association features;

s52, enhancing each pixel level representation in the cross-modal feature map by using the associated features to obtain enhanced pixel level feature representations;

s53, performing transposition convolution operation on the enhanced pixel level feature representation to obtain a final predicted semantic segmentation result graph.

Further, in the step S51, the feature associated information represented by the ith pixel level is used as the associated feature Y corresponding to the weight _i The method comprises the following steps:

wherein, ρ (·) and δ (·) are both transformation functions, w _ik For characteristic association information, f _k For instance level representation;

in the step S52, the enhanced pixel level feature representation Z is:

Z＝Concat(X，Y)

in the formula, concat (&) is a splicing operation, and X is a cross-mode characteristic diagram;

in the step S53, the semantic segmentation result map is expressed as:

R＝Deconv(Z)

wherein R is a semantic label represented by each pixel in the semantic segmentation result graph, and Deconv (·) is a transpose convolution operation.

The beneficial effects of the invention are as follows:

(1) The invention fully considers the multi-mode and multi-scale information in the power inspection scene. For input images of different modes, different types of lightweight backbone networks are adopted to extract key features.

(2) The invention further extracts semantic features by utilizing multi-cavity rate cavity convolution, and obtains a coarse segmentation example area with good effect by using transposed convolution up-sampling.

(3) According to the invention, the dependency information between the pixel and the object region is aggregated into the pixel representation by utilizing the characteristic association information of the pixel-instance region, so that the pixel representation can be better similar to the abstract representation of the instance to which the pixel belongs, and a refined semantic segmentation result is obtained.

(4) Compared with the traditional method and the CNN-based image segmentation method, the method can more fully represent the context relation and the global semantic information among the features, and has smaller parameter number and faster reasoning speed compared with the attention mechanism-based image segmentation method.

Drawings

Fig. 1 is a flowchart of a semantic segmentation method of a power inspection image provided by the invention.

Fig. 2 is a schematic diagram of cross-modal feature diagram construction provided by the present invention.

FIG. 3 is a schematic diagram of pixel-level to example-level feature association and feature enhancement provided by the present invention.

Detailed Description

The following description of the embodiments of the present invention is provided to facilitate understanding of the present invention by those skilled in the art, but it should be understood that the present invention is not limited to the scope of the embodiments, and all the inventions which make use of the inventive concept are protected by the spirit and scope of the present invention as defined and defined in the appended claims to those skilled in the art.

Example 1:

the embodiment of the invention provides a semantic segmentation method of a power inspection image, as shown in fig. 1, comprising the following steps:

s1, acquiring multi-mode image data of power inspection;

In step S1 of the embodiment of the present invention, during power inspection, multi-mode image data of the power inspection is collected by multiple devices such as a high-definition color camera, a thermal imager, a laser range finder, etc., including an RGB image, a thermodynamic diagram and a depth map; the RBG high-definition image of the power line and the device shot by the high-definition camera provides visual information; detecting a thermodynamic diagram of the heat distribution of the surface and surrounding areas of the electrical facility by a thermal imager to discover possible overload problems of the electrical power lines and equipment; and 3D structure scanning is carried out by a laser range finder to obtain a depth map, and the three-dimensional space position, the size and other structure information of the circuit and the equipment are provided.

In step S2 of the embodiment of the present invention, as shown in fig. 2, a parallel multi-branch structure is designed and adopted to learn feature graphs of images of different modes, specifically:

the RGB image Xf mainly represents visual information such as edges, textures and the like, so that a depth separable convolution-based MobileNet model is adopted to extract the visual characteristics of the power image of the RGB image, and a visual characteristic diagram XF is obtained; the method comprises the steps of taking a MobileNet model as a light backbone network, efficiently extracting visual characteristics of an electric image, enabling depth separable convolution to be composed of two steps of depth convolution and point-by-point convolution, firstly, performing independent convolution operation on each channel of an input feature map, and then performing dimension lifting or dimension reduction by applying 1*1 convolution operation on the basis.

The heat map Xt mainly expresses the heat intensity change information of pixels, and focuses on the position relation and the local mode of the equipment, so that the heat intensity change characteristics of the pixels in the thermodynamic diagram are extracted through a SheffeNet model to obtain a thermodynamic characteristic map XT; the method comprises the steps of adopting a ShuffleNet as a light backbone network, reducing computational complexity through group convolution, and simultaneously improving the expression capability of characteristics by applying channel shuffling operation to obtain a thermal characteristic diagram XT. This approach is applicable to characterization of the law of thermal changes and local patterns within the image.

The depth map Xd mainly contains the three-dimensional spatial position, size and other structural information of the lines and the equipment, so that the structural features of the lines and the equipment are extracted from the depth map through the PointNet model to obtain a depth feature map XD; the PointNet is adopted as a light backbone network, independent characteristics are directly learned for each point through a multi-layer perceptron, max-Pooling is carried out on all the point characteristics, the maximum response value is obtained as an integral characteristic, and finally a depth characteristic map XD is obtained.

In step S2 of the embodiment of the present invention, cross-modal feature fusion is performed on the extracted visual feature map XF, depth feature map XD, and thermal feature map XT by using a bidirectional attention mechanism, so as to obtain a cross-modal feature map X, where the expression is as follows:

XF′＝XF+Attention(XF，XD)+Attention(XF，XT)

XD′＝XD+Attention(XD，XF)+Attention(XD，XT)

XT′＝XT+Attention(XT，XD)+Attention(XT，XF)

X＝Concat(XF′，XD′，XT′)

According to the embodiment, the cross-modal feature diagram is obtained, deep cross-modal fusion is realized while independent information of each mode is reserved, and the relevance and complementarity between the three-mode features are fully excavated.

In S3 of the embodiment of the present invention, based on the cross-modal feature map X, a hole convolution kernel w with different hole ratios is adopted, multi-scale context information is extracted, the feature map is encoded, and the encoded feature is up-sampled and decoded, so as to obtain a final coarse-division instance region. Based on this, step S3 of the embodiment of the present invention specifically includes:

In step S31 of the present embodiment, the cross-modal feature map X 'after the convolution operation has been subjected to the convolution operation at an arbitrary position X' _i Expressed as:

wherein k represents the position on the convolution kernel, r represents the void rate of void convolution, w is the void convolution kernel, and X is a cross-modal feature map; where r can be understood as the step of sampling the element on X, the receptive field size can be adjusted by adjusting the void fraction.

In step S32 of the present embodiment, global context information is characterized by global averaging pooling.

In step S33 of this embodiment, since a great amount of detail information is lost in the feature map during the process of convolutionally extracting features, deep semantic features with strong abstract capability are fused with shallow semantic features with rich details through jump connection.

In step S34 of the present embodiment, based on the above method, the rough segmentation example distinction M obtained by supervised learning is:

M＝Deconv(X+X″)

In step S4 of the embodiment of the present invention, according to the foregoing obtained rough segmentation result, an embodiment level feature representation may be obtained, so as to obtain feature association information; specifically, as shown in fig. 3, step S4 specifically includes:

In step S41 of the present embodiment, it is assumed that K-1 power devices are in total, including K kinds of division targets in total. Each rough object region M is set _k Is a two-dimensional graph associated with class k, and the numerical value at each location on the graph indicates the probability that the pixel at that location belongs to class k. Obtaining an instance level representation f by weighting the aggregated pixel level representation _k The method comprises the following steps:

wherein X is _i For the ith pixel level representation, M _ki For the normalized probability that the ith pixel belongs to class k, represent M _k The value of the I-th position, I, is the set of pixels.

In step S42 of this embodiment, feature-related information w _ik The method comprises the following steps:

The step S5 of the embodiment of the invention specifically comprises the following steps:

In step S51 of this embodiment, the feature-related information represented by the ith pixel level is used as the corresponding related feature Y in the case of weight _i The method comprises the following steps:

wherein, ρ (·) and δ (·) are both transformation functions, w _ik For characteristic association information, f _k For instance level representation; wherein ρ and δ can be achieved by this operation: 1 x 1conv→bn→relu, conv denotes convolution operation, BN denotes batch normalization, relu denotes linear rectification activation function.

In step S52 of the present embodiment, the enhanced pixel level feature representation Z is:

Z＝Concat(X，Y)

in step S53 of the present embodiment, the semantic division result map is expressed as:

R＝Deconv(Z)

In the embodiment of the invention, before the image semantic segmentation is carried out by using the method, parameter training is needed, the average intersection ratio mIoU is adopted as an evaluation index of supervised learning, and for each class of examples, the ratio of the intersection and the union of the predicted pixel point and the true labeling pixel point is calculated, and then the average value is calculated, wherein the formula is as follows:

wherein p is _ii Representing the number of pixels in R for which prediction is correct, p _ij Representing pixels in R that are of class i but are predicted to be of class j.

The principles and embodiments of the present invention have been described in detail with reference to specific examples, which are provided to facilitate understanding of the method and core ideas of the present invention; meanwhile, as those skilled in the art will have variations in the specific embodiments and application scope in accordance with the ideas of the present invention, the present description should not be construed as limiting the present invention in view of the above.

Those of ordinary skill in the art will recognize that the embodiments described herein are for the purpose of aiding the reader in understanding the principles of the present invention and should be understood that the scope of the invention is not limited to such specific statements and embodiments. Those of ordinary skill in the art can make various other specific modifications and combinations from the teachings of the present disclosure without departing from the spirit thereof, and such modifications and combinations remain within the scope of the present disclosure.

Claims

1. The semantic segmentation method for the power inspection image is characterized by comprising the following steps of:

s1, acquiring multi-mode image data of power inspection;

2. The semantic segmentation method of the power inspection image according to claim 1, wherein in the step S2, the visual features of the power image of the RGB image are extracted through a MobileNet model to obtain a visual feature map XF;

3. The semantic segmentation method of the power inspection image according to claim 2, wherein in the step S2, a bi-directional attention mechanism is adopted to perform cross-modal feature fusion, so as to obtain a cross-modal feature map X, and the expression is as follows:

XF′＝XF+Attention(XF,XD)+Attention(XF,XT)

XD′＝XD+Attention(XD,XF)+Attention(XD,XT)

XT′＝XT+Attention(XT,XD)+Attention(XT,XF)

X＝Concat(XF′,XD′,XT′)

4. The semantic segmentation method of the power inspection image according to claim 1, wherein the step S3 specifically comprises:

5. The method for semantic segmentation of a power inspection image according to claim 4, wherein in the step S31, the convolved cross-modal feature map X 'is at an arbitrary position X' _i Expressed as:

in the step S34, the rough segmentation implementation region is:

M＝Deconv(x+X″)

6. The semantic segmentation method of the power inspection image according to claim 1, wherein the step S4 specifically comprises:

7. The method according to claim 6, wherein in the step S41, the instance level represents f _k The method comprises the following steps:

8. The semantic segmentation method of the power inspection image according to claim 6, wherein the step S5 specifically comprises:

9. The electricity of claim 8The method for semantic segmentation of the force inspection image is characterized in that in the step S51, the feature association information represented by the ith pixel level is taken as the corresponding association feature Y when the weight is taken _i The method comprises the following steps:

in the step S52, the enhanced pixel level feature representation Z is:

Z＝Concat(X,Y)

in the step S53, the semantic segmentation result map is expressed as:

R＝Deconv(Z)