CN117132914B

CN117132914B - Method and system for identifying large model of universal power equipment

Info

Publication number: CN117132914B
Application number: CN202311403372.4A
Authority: CN
Inventors: 杨必胜; 陈驰; 付晶; 邵瑰玮; 严正斐; 邹勤; 金昂; 王治邺; 吴少龙; 孙上哲
Original assignee: Wuhan University WHU
Current assignee: Wuhan University WHU
Priority date: 2023-10-27
Filing date: 2023-10-27
Publication date: 2024-01-30
Anticipated expiration: 2043-10-27
Also published as: CN117132914A

Abstract

The invention provides a method and a system for identifying a large model by using general power equipment, wherein an inclined image acquired by an unmanned aerial vehicle inspection system is taken as a research object. Aiming at the data characteristics, a single-stage target detector is trained to recognize a target boundary box in an image as initial prompt information, and the middle layer characteristics of an image encoder are utilized to generate a prompt containing semantic category information. And forming a general power equipment segmentation model related to the category by fusing the two kinds of prompt information. The method well solves the problems of huge data and few identification types required by the training model in the power scene, and provides basic data for the follow-up defect diagnosis and three-dimensional modeling of the power equipment.

Description

Method and system for identifying large model of universal power equipment

Technical Field

The invention belongs to automatic identification application of inspection image power equipment in unmanned aerial vehicle power inspection, and provides a universal power equipment identification large model.

Background

With the increase of mileage of the transmission line in successive years, a great test is provided for guaranteeing the safe and stable operation of the transmission line. Along with the development of unmanned aerial vehicle, computer vision and embedded technology blowout, the electric power inspection operation mode is gradually changed from the traditional manual mode to unmanned aerial vehicle fine inspection. In the inspection process of the unmanned aerial vehicle, a sensor pod carried on the unmanned aerial vehicle collects inclined images along an electric power corridor, and the electric power equipment is positioned and identified through a target detection algorithm deployed on the unmanned aerial vehicle or in the background to diagnose hidden trouble. The novel power inspection mode of the unmanned aerial vehicle and the visual recognition system is becoming a mainstream inspection operation mode due to low cost and high efficiency.

In the field of computer vision, the SAM technology realizes the high-performance detection effect of a zero sample, and obtains excellent segmentation performance in a general scene. In the field of power inspection, due to the fact that the complexity of unmanned aerial vehicle images and the size difference of targets are large, the conventional detection algorithm is low in universality, limited in parameter quantity and poor in generalization capability, a large universal model of power equipment is difficult to realize, and the performance of the detection algorithm still lacks robustness in an actual power transmission line. Based on the detection performance can be effectively improved by adopting the SAM technology in the unmanned aerial vehicle power inspection field. However, SAM is a class independent example segmentation method, relying heavily on a priori manual cues, including points, boxes and rough masks, which limits make SAM unsuitable for full-automatic interpretation of power inspection images.

Disclosure of Invention

Aiming at the problems of poor generality and few identification types of the existing algorithm in the power scene, the invention designs a large model identification method and a large model identification system for the universal power equipment, which have wide application scenes and complete identification types, by taking the processing of patrol data in the power scene as a research object.

The invention relates to a method for identifying a large model of universal power equipment, which is characterized by comprising the following steps:

step 1, acquiring power inspection image data and constructing a power transmission line data set;

step 2, training a single-stage target detector, and detecting a target boundary box from the data set image as an explicit prompt;

step 3, processing the image data in the step 1 by using an image encoder in the large model, and generating an implicit prompt containing semantic category information by using a prompter of the large model; fusing the prompts in the two forms in the steps 2 and 3, and transmitting the fused semantic category information into a large model to obtain a universal power identification result;

the fusion mode is that an explicit prompt generated by a single-stage detector is aligned with an implicit prompt feature generated by a large model middle layer, and the position information and the category information of the two prompts are fused by calculating the mapping relation between the explicit prompt and the implicit prompt feature map.

In a preferred manner, the following steps are specifically implemented in step 1:

step 1.1, screening and cleaning acquired inspection images of various scenes;

step 1.2, labeling inspection images by using labelImg, wherein a transmission line scene comprises cities, greenhouses, farmlands, shrubs, barren lands, lakes and the like, and power equipment and external invaded object types cover transmission towers, insulators, equalizing rings, damper blocks, spacers, insulator burst sheets and hanging objects;

step 1.3, inputting the processed oblique image to a single-stage object detector.

Further, the single-stage object detector employs YOLOv8.

In a preferred manner, the specific process of step 2 is as follows:

step 2.1, the original image is subjected to scale change and filling;

step 2.2, the image processed in the step 2.1 is subjected to data enhancement and pretreatment and then is input into a backbone network of the single-stage detector;

step 2.3, carrying out multi-scale feature fusion on the features extracted by the backbone network;

and 2.4, inputting the fused characteristics to a single-stage target detector, and acquiring the target category and the rough detection frame contained in the image.

Further, the data enhancement and preprocessing in step 2.2 includes: horizontal and vertical overturn, contrast adjustment, rotation, mosaic enhancement, adaptive anchor frame calculation and adaptive gray filling.

In a preferred mode, a SAM large model is adopted in the step 3, and the specific process is as follows:

step 3.1, inputting an original image, and generating an intermediate feature map after a pre-trained VIT backbone network;

step 3.2, inputting the intermediate features obtained in the previous step through a ViT backbone network into a lightweight feature aggregation module to obtain fused semantic features;

step 3.3, after the fused semantic features are obtained, generating implicit hint embedding for the SAM mask decoder by using a prompter;

and 3.4, aligning the display prompt generated by the single-stage detector with the implicit prompt feature generated by the SAM middle layer, and then carrying out prompt fusion to extract rich semantic information.

Based on the same inventive concept, the scheme also designs a system for realizing the large model identification method of the universal power equipment:

the power transmission line data acquisition module acquires power inspection image data and constructs a power transmission line data set;

the single-stage target detector module takes the detected target boundary box as an explicit hint;

the universal power recognition module is used for generating implicit prompts containing semantic category information by forming the middle layer characteristics of the image encoder in the large model into the input of the prompter; fusing the display prompt and the implicit prompt, and transmitting the fused semantic category information into a large model to obtain a general electric power recognition result;

the fusion mode is that an explicit prompt generated by a single-stage detector is aligned with an implicit prompt feature generated by a SAM middle layer, and the position information and the category information of the two prompts are fused by calculating the mapping relation between the explicit prompt and the implicit prompt feature map.

Based on the same inventive concept, the scheme also designs electronic equipment, which comprises:

one or more processors;

a storage means for storing one or more programs;

the one or more programs, when executed by the one or more processors, cause the one or more processors to implement a generic power device identification large model approach.

Based on the same inventive concept, the present solution further provides a computer readable medium having a computer program stored thereon, wherein: the program, when executed by the processor, implements a generic power device identification large model method.

Compared with the prior art, the invention has the following advantages and beneficial effects:

the identified types and application scenes of the power equipment are greatly expanded, so that automatic processing of multi-period and multi-type scene inspection data, such as cities, farmlands, lakes, forests, grasslands, barren lands and the like, can be satisfied; the overall detection precision is obviously improved, and higher recall rate is ensured; the robustness is strong and the generalization capability is high in the actual power transmission line scene.

And taking the inclined image acquired by the unmanned aerial vehicle inspection system as a research object. Aiming at the data characteristics, a single-stage target detector is adopted to identify a target boundary box in an image as initial prompt information, and the middle layer characteristics of an image encoder are utilized to generate a prompt containing semantic category information. And finishing the class-related general power equipment segmentation model by fusing the two kinds of prompt information. The method well solves the problem of low identification category of massive inspection data in the power scene, and provides basic data for subsequent defect diagnosis and three-dimensional modeling of power equipment.

Drawings

FIG. 1 is a flow chart of an embodiment of the present invention.

FIG. 2 is a block diagram of a single-stage detector in an embodiment of the invention.

FIG. 3 is a schematic diagram of a single-stage detector in accordance with an embodiment of the present invention.

Fig. 4 is a block diagram of a decoder in an embodiment of the present invention.

Detailed Description

The technical scheme of the invention is described below with reference to the accompanying drawings and examples.

Example 1

The invention provides a general power equipment large model identification method, which is characterized in that inspection images acquired by an unmanned aerial vehicle inspection system are selected to specifically explain the method. The process of the method is shown in the attached figure 1, and comprises the following steps:

and step 2, training a single-stage target detector, and taking the detected image data target bounding box as an explicit prompt.

And 3, the middle layer of the image encoder is characterized to form the input of a prompter, and a prompter containing semantic category information is generated. And fusing the two forms of prompts, and transmitting semantic category information into the SAM. Thereby obtaining a general power recognition result. Preferably, the large model adopts SAM, and other image segmentation models are also possible, and in this embodiment, the SAM model is optimal.

Further, the specific implementation of the step 1 includes the following sub-steps:

step 1.1, firstly screening and cleaning the collected inspection images of various scenes.

And 1.2, labeling the inspection image by using labelImg. The power transmission line scene comprises cities, greenhouses, farmlands, shrubs, barren lands, lakes and the like, and the power equipment and external invaded object types cover power transmission towers, insulators, equalizing rings, shock-proof hammers, spacing bars, insulator burst pieces, hanging objects and the like.

Step 1.3, the processed oblique image is input to a single-stage object detector YOLOv8.

Further, the specific implementation of the step 2 includes the following sub-steps:

step 2.1, scaling the original image to 640 x 640 scale through scale change and filling.

And 2.2, carrying out data enhancement and pretreatment on the zoomed image, wherein the enhancement means comprise horizontal and vertical overturning, contrast adjustment, rotation, mosaic enhancement and the like. The preprocessing comprises adaptive anchor frame calculation and adaptive gray filling.

And 2.3, inputting the image after data enhancement into a single-stage detector backbone network. Firstly, an original image passes through a convolution CBS module with the convolution kernel size of 6 multiplied by 6 and the step length of 2, and then passes through a combination module composed of 4 CBS modules with the convolution kernel size of 3 multiplied by 3 and the step length of 2 and C2 f. The CBS module consists of 1 two-dimensional convolution, 1 two-dimensional batch normalization, and scaling exponential linear unit activation functions. The C2f module learns residual characteristics, more jump connection and additional separation operations are added, and more rich gradient flow information is obtained while the light weight is ensured.

And 2.4, performing multi-scale feature fusion on the features extracted through the backbone network through a multi-scale feature pyramid PAN-FPN fused with C2f, and outputting three feature graphs with the feature graph scales of 80×80, 40×40 and 20×20.

And 2.5, inputting the fused characteristics to a YOLO detection head. YOLOv8 employs a decoupled detector head to separate the classification from the detector head. The loss calculation process mainly comprises a positive and negative sample distribution strategy and loss calculation, wherein YOLOv8 adopts a task alignment distribution principle, a positive sample is selected according to a classification and regression fractional weighting result, the loss value calculation comprises classification and regression branches, the classification branches still adopt binary cross entropy loss, and the regression branches use Distribution Focal Loss and CIoU loss functions. And outputting the target category and the rough detection frame contained in the image.

Further, step 3 adopts a SAM big model, the model comprises an encoder, a prompter, a fusion module and a decoder, and the specific implementation of the method comprises the following sub-steps:

and 3.1, inputting original unmanned aerial vehicle inspection image data into an SAM encoder, and generating an intermediate feature map after a pre-trained VIT encoder backbone network. The pre-training mask of the VIT backbone processes the original image from the encoder as an intermediate feature. The original image is scaled to 1024 scales, the convolution with the convolution kernel size of 16 and the step size of 16 is adopted to discretize the image into vectors with the size of 64 multiplied by 768, the vectors are sequentially flattened in the width of the feature map and the channel dimension and then enter a multi-layer VIT backbone network, the feature dimension of the VIT backbone network output vector is 256 through the convolution of two layers, and the sizes of the two layers of convolution kernels are 1 and 3 respectively.

And 3.2, inputting the intermediate features obtained in the previous step through a ViT backbone network into a lightweight feature aggregation module of the large model prompter to generate fusion semantic features, and generating implicit prompts through a mask decoder. The module learns to represent semantic features from the various intermediate feature layers of ViT without increasing the computational complexity of the prompter, the process being formulated as follows:

and->Characterization of the SAM backbone and the method of the same>The resulting downsampled features. The channel is first reduced from c by 1/16 of the original dimension using a 1 x 1 convolutional layer, and then the spatial dimension is reduced using a 3 x 3 convolutional layer with a step size of 2. />Representing a convolution layer of size 3 x 3,/i>The final fused convolutional layer is represented, including two 3×3 convolutional layers and one 1×1 convolutional layer, to restore the original channel size of the SAM mask decoder. And inputting the semantic features obtained in the previous step, and generating a prompt for the SAM mask decoder by using a prompter. First using anchor-based region proposalThe network generates candidate target frames. Visual feature representations of individual objects from the position-coded feature map are obtained by RoI pooling. Three perception heads are thinned out from the visual characteristics: semantic header->Positioning head->And prompt head->. The semantic header determines a particular target category and the locator header establishes a matching criterion, i.e., location-based greedy matching, between the generated hint and the target instance mask. The hint header generates the hint embedding required by the SAM mask decoder. Wherein->Representing a lightweight RPN. />The operation may result in the subsequent hint generation losing position information relative to the entire image, incorporating the Position Encoding (PE) into the original fusion feature (Fagg). The process is represented by the following formula:

the model's penalty includes binary classification and localization penalty for the RPN network, classification penalty for the semantic header, regression penalty for the localization header, and segmentation penalty for the frozen SAM mask decoder. The total loss can be expressed as:

and 3.3, aligning the explicit prompt generated by the single-stage detector with the implicit prompt feature generated by the SAM middle layer, and fusing the position information and the category information of the two prompts by calculating the mapping relation between the explicit prompt and the implicit prompt feature map, thereby providing more accurate positioning and classification precision.

And 3.4, integrating two embeddings respectively output by the image encoder and the prompt encoder by the mask decoder, decoding a final segmentation mask, and learning and prompting the aligned image embedding layer and embedding of 4 additional Token by using a transducer. The 4 Token embeddings are respectively the IoU Token embeddings and the 3 segmentation result Token embeddings, the Token embeddings are obtained through the transform learning, and the target result is obtained through the final task head.

The technical scheme and the beneficial effects of the invention are further described below in connection with specific example applications.

After the plurality of inclined image data sets acquired by the unmanned aerial vehicle are processed by the method, the types of the electric power scene components are identified to be more than 12 types, the defects are more than 20 types, the power equipment is divided into mIoU (power plant unit) and the running speed FPS is up to 60. The invention can ensure the processing efficiency and realize the segmentation result of general power equipment with more types and higher precision.

Example two

Based on the same conception, the scheme also designs a universal power equipment identification large model system, which comprises a data acquisition module, a power inspection image acquisition module and a power transmission line data set acquisition module, wherein the data acquisition module acquires power inspection image data;

the universal power recognition module processes the image data in the data acquisition module by utilizing an image encoder in the large model, and generates implicit prompt containing semantic category information through the prompter; fusing the display prompt and the implicit prompt, and transmitting the fused semantic category information into a large model to obtain a general electric power recognition result;

Because the device described in the second embodiment of the present invention is a system for implementing the method for identifying a large model by using a general power device in the second embodiment of the present invention, based on the method described in the first embodiment of the present invention, a person skilled in the art can understand the specific structure and deformation of the electronic device, and therefore, the details are not repeated herein.

Example III

Based on the same inventive concept, the invention also provides an electronic device comprising one or more processors; a storage means for storing one or more programs; the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method described in embodiment one.

Because the device described in the third embodiment of the present invention is an electronic device used for implementing the method for identifying a large model by using a general power device in the first embodiment of the present invention, based on the method described in the first embodiment of the present invention, a person skilled in the art can understand the specific structure and deformation of the electronic device, and therefore, the description thereof is omitted herein. All electronic devices adopted by the method of the embodiment of the invention belong to the scope of protection to be protected.

Example IV

Based on the same inventive concept, the present invention also provides a computer readable medium having stored thereon a computer program which, when executed by a processor, implements the method described in embodiment one.

Because the device described in the fourth embodiment of the present invention is a computer readable medium used for implementing the method for identifying a large model by using a general power device according to the first embodiment of the present invention, a person skilled in the art can understand the specific structure and deformation of the electronic device based on the method described in the first embodiment of the present invention, and therefore, the detailed description thereof is omitted herein. All electronic devices adopted by the method of the embodiment of the invention belong to the scope of protection to be protected.

The foregoing is a further detailed description of the invention in connection with specific embodiments, and it is not intended that the invention be limited to such description. It will be apparent to those skilled in the art that several simple deductions or substitutions may be made without departing from the spirit of the invention, and these should be considered to be within the scope of the invention.

Claims

1. A method for identifying a large model of a universal power device, comprising the steps of:

step 2, training a single-stage target detector, and taking the detected image data target bounding box as an explicit prompt;

step 3, adopting a SAM large model, wherein the large model comprises a VIT encoder, a prompter, a fusion module and a decoder, processing the image data in the step 1 by using the VIT encoder in the large model, and generating an implicit prompt containing semantic category information by using the prompter of the large model; the fusion module fuses the prompts in the two forms in the steps 2 and 3, and transmits the fused semantic category information into the large model to obtain a universal power recognition result;

the fusion mode is that an explicit prompt generated by a single-stage target detector is aligned with an implicit prompt feature generated by a large model middle layer, and the position information and the category information of the two prompts are fused by calculating the mapping relation between the explicit prompt and the implicit prompt feature map.

2. The universal power device identification large model method of claim 1, wherein: the step 1 is specifically implemented as follows:

step 1.1, screening and cleaning acquired inspection images of various scenes;

step 1.2, labeling inspection images by using labelImg, wherein a transmission line scene comprises cities, greenhouses, farmlands, bushes, barren lands and lakes, and power equipment and external invaded object types cover transmission towers, insulators, equalizing rings, damper blocks, spacers, insulator burst sheets and hanging objects;

3. The universal power device identification large model method of claim 2, wherein: the single-stage object detector employs YOLOv8.

4. The universal power device identification large model method of claim 1, wherein: the specific process of the step 2 is as follows:

step 2.1, performing scale change and filling on the original image;

step 2.2, the image processed in the step 2.1 is subjected to data enhancement and pretreatment and then is input into a backbone network of the single-stage target detector;

5. The universal power device identification large model method of claim 4, wherein:

the data enhancement and preprocessing in step 2.2 comprises: horizontal and vertical overturn, contrast adjustment, rotation, mosaic enhancement, adaptive anchor frame calculation and adaptive gray filling.

6. The universal power device identification large model method of claim 1, wherein:

the specific implementation process of the SAM big model in the step 3 is as follows:

step 3.1, inputting original image data, and generating an intermediate feature map through a pre-training VIT encoder;

step 3.2, the lightweight feature aggregation module of the prompter generates fusion semantic features from the acquired intermediate feature map, and then the prompter is utilized to generate implicit prompt for the SAM mask decoder;

step 3.3, aligning the display prompt generated by the single-stage target detector with the implicit prompt feature generated by the SAM middle layer by the fusion module, and then carrying out prompt fusion to extract semantic information;

and 3.4, integrating two embeddings output by the VIT encoder and the prompter by the decoder, and decoding a final segmentation mask.

7. The universal power device identification large model method of claim 6, wherein: the original image input in step 3.1 is scaled to 1024 scales.

8. A system for implementing the universal power device identification large model method of claim 1, characterized in that:

the universal power recognition module processes the image data in the data acquisition module by utilizing an image encoder in the large model, and generates an implicit prompt containing semantic category information by a large model prompter; fusing the display prompt and the implicit prompt, and transmitting the fused semantic category information into a large model to obtain a general electric power recognition result;

the fusion mode is that an explicit prompt generated by a single-stage target detector is aligned with an implicit prompt feature generated by a SAM middle layer, and the position information and the category information of the two prompts are fused by calculating the mapping relation between the explicit prompt and the implicit prompt feature map.

9. An electronic device, comprising:

one or more processors;

a storage means for storing one or more programs;

when executed by the one or more processors, causes the one or more processors to implement the method of any of claims 1-7.

10. A computer readable medium having a computer program stored thereon, characterized by: the program, when executed by a processor, implements the method of any of claims 1-7.