CN115761453A

CN115761453A - Power inspection scene-oriented light-weight single-sample target detection method based on feature matching

Info

Publication number: CN115761453A
Application number: CN202211285599.9A
Authority: CN
Inventors: 宋明黎; 申成吉; 黄启涵; 张皓飞; 宋杰
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2022-10-20
Filing date: 2022-10-20
Publication date: 2023-03-07
Anticipated expiration: 2042-10-20
Also published as: CN115761453B

Abstract

A light-weight single-sample target detection method based on feature matching and oriented to a power patrol scene comprises the following steps: 1) Training a base class target detection model based on a mask; 2) Enhancing data of the new category label sample; 3) Extracting the characteristics of the new category label sample; 4) Preliminary reasoning of the model on the test data; 5) The inference results are corrected using conventional features. Through the steps, the method realizes the target detection under the condition that each class marking sample is extremely rare (1), overcomes the problem of relying on the prior knowledge of the class of the test image in the traditional single-sample target detection, and reduces the parameter quantity of the model while keeping the performance of the model as much as possible by using the neural architecture search.

Description

Power inspection scene-oriented light-weight single-sample target detection method based on feature matching

Technical Field

The invention belongs to the field of single-sample target detection in a power inspection scene, and provides a light-weight single-sample target detection method based on feature matching for the power inspection scene, aiming at the problems that training images with sufficient quality and quantity are difficult to acquire due to the fact that the positions of partial substations in the power inspection scene are far away and the geographic conditions are severe and the traditional target detection model is too large, and the high cost in actual deployment.

Background

The power inspection is a routine inspection task for power equipment such as a transformer substation, and is high in intensity and requires a large amount of work. Conventional power inspection mainly depends on regular manual inspection by inspection personnel, and due to the restriction of various factors such as weather conditions, environmental factors and personnel states, inspection quality and arrival rate cannot be guaranteed, and the labor cost is greatly required. Therefore, a computer vision technology mainly based on target detection is introduced to detect potential safety hazards so as to realize automatic and efficient power routing inspection.

Object detection is a fundamental but challenging task in computer vision, aiming at locating certain categories in the target image. It has been widely used in many computer vision tasks, such as object tracking, autopilot, and scenegraph generation.

The general process of object detection is to predict classes (hypothetical rectangles for reference in an image) for a set of bounding boxes. Most conventional methods generate bounding boxes roughly by sliding a window through the image, resulting in a slow inference speed. With the development of deep learning technology, the target detection model based on the deep neural network greatly improves the detection effect and is adopted by mainstream detection methods. Existing depth detection models can be generally divided into two types, namely a two-stage model and a single-stage model, wherein the two-stage model firstly generates a large number of candidate bounding boxes from an image and then performs position correction and classification tasks on the candidate bounding boxes. The single-phase model combines the two phases to alleviate the computational cost problem and the storage cost problem caused by the generation of the candidate bounding box. In addition, the transform-based detection model DETR and its variants have also attracted attention in recent years, and in addition to achieving end-to-end detection, the DETR-based detection model achieves the best detection effect on the MS COCO dataset.

However, these detection models rely on a large amount of training data, and when the training data is sparse, these models are prone to overfitting, resulting in poor detection. In the power inspection scene, as part of substations are located in geographical positions with severe natural conditions and the frequency of occurrence of part of safety problems is low, it is difficult to acquire enough tag data for the targets of the type of quilt. Therefore, learning from a small amount of class knowledge results in the detection capability of the target class being required by future power patrol models.

single-Shot Object Detection (One-Shot Object Detection) is an extreme case where label samples are scarce, and particularly refers to a Detection problem in the case where only a single label sample is available in an Object class. More specifically, single sample target detection divides all target classes into base classes, each of which has a large number of labeled samples, and new classes, each of which has only one labeled sample. The single-sample target detection needs to migrate a detection model obtained by pre-training on a base class to a new class, and detect a new class target on a test image by using a single labeled sample of the new class. The traditional single-sample target detection method is generally carried out under the condition that the target class on the test image is known, namely only the position of a boundary box of the class is detected, and the target class judgment of the boundary box is omitted. In the power patrol scenario, however, the class of objects on the test image is not a priori knowledge.

Book (I)

In addition, the traditional deep target detection model has a relatively serious reasoning speed problem, and the deep detection model, particularly the dual-stage model, generally has a relatively large parameter number and needs to consume a relatively long reasoning time. In the electric power inspection, the problem of long calculation time is more prominent because some scenes with severe natural conditions are not allowed to carry equipment with high calculation performance.

Disclosure of Invention

The invention provides a light-weight single-sample target detection method based on feature matching and oriented to a power inspection scene, aiming at overcoming the defects in the prior art.

The invention realizes the light-weight multi-class target detection under the condition that the target class labeled samples are few (only 1) in the power inspection scene so as to solve the problem that the power inspection target detection model depends on labeled data quantity too much, and the problem that the traditional single-sample target detection method depends on the target class prior knowledge in a test image and the problem that the parameter quantity of a depth detection model is too large.

The invention relates to a light single-sample target detection method based on feature matching and oriented to a power inspection scene, which comprises the following steps of:

1) Training a mask-based base class target detection model;

before detecting a new category with only a single labeled sample, the method of the invention firstly pre-trains a Faster R-CNN target detection model on base category data with a large number of labeled samples. Unlike previous single sample detection methods that rely on a priori knowledge of the target class, the pre-trained model uses a randomly initialized feature extractor and is not pre-trained on large datasets such as ImageNet. In order to improve the recognition capability of the model on shallow visual features such as target edges, corners and the like, the method firstly detects the edges of the training image by using a Canny edge detection method, and uses the edges as a fourth channel of the training image. In addition, in order to reduce the parameter amount of the model while ensuring the performance of the detection model, the invention uses a plurality of trainable binary masks to prune the convolution layers of the feature extractor, when the channel mask value of a certain convolution layer is 1, the output of the channel is reserved, and when the mask value is 0, the output of the channel is set to 0. To train these binary masks, the present invention trains using a loss function that penalizes the case where the mask value is too high (parameter amount) too large. Let M _i Mask, S, representing the i-th layer of the network _i Represents the quantity of parameters of the i-th layer of the network,the mask penalty function of the method is then:

L _{model_size} ＝∑ _i ratio(M _i )*S _i (1)

wherein, ratio (M) _i ) Represents M _i The medium mask value is a proportion of 1.

2) Enhancing the data of the new category label sample;

the invention enhances the data of a single labeled sample of a new category, expands the labeled sample into a plurality of different samples and enhances the expression capability of the target category knowledge. The invention adopts three methods of traditional data enhancement, domain alignment and bounding box correction to carry out data enhancement on a single labeled sample. The traditional data enhancement comprises saturation transformation, brightness transformation, image rotation and the like; the bounding box correction is to divide the marked sample and remove the influence of the background in the marked sample; and the domain alignment is to paste the marked sample on a background image and adaptively adjust the size of the pasted sample according to the proportion of the marked sample.

3) Extracting the characteristics of the new category labeled sample;

because the labeled samples of the new category are pasted on the background image in different sizes, the direct feature extraction of the whole image can cause the interference of the background features on the features of the new category. Therefore, the method firstly uses the Faster R-CNN pre-trained in the base class to extract the area where the new class target is located, and then uses the features passing through the RoI Pooling layer as the features of the class.

4) The model carries out preliminary reasoning on the test data;

the invention uses fast R-CNN to make preliminary reasoning on test data. Firstly, reasoning an input test image according to a normal flow of Faster R-CNN to obtain a candidate bounding box and corresponding features passing through a RoI layer. And (4) calculating cosine similarity of the features of each candidate bounding box and the features of each category extracted in the step (3), and obtaining the probability of each candidate bounding box belonging to each category through a Softmax layer. Let the total number of classes be K, c _i Mean feature representing class i, b _j Representing bounding box jThe probability that the bounding box j is the class k is obtained from the RoI feature

Comprises the following steps:

5) Correcting the reasoning result by using the traditional characteristics;

besides the above-mentioned target detection by using depth features, the invention also uses the traditional SIFT features to extract key points and further matches the images. Specifically, the single annotation image of the new category is matched with the test image which is judged to be of the category in the step, and the image which fails in matching is eliminated. The method comprises the steps of firstly using an SIFT feature extraction algorithm to respectively extract key points and feature descriptions of an annotated image and a test image, and then using a K-nearest neighbor matching algorithm to calculate matched key points between the two images. If the number of the matched key point pairs is 0, judging that the test image does not belong to the category; if the number of the matched key point pairs is larger than 0, the minimum distance is calculated, and if the minimum distance is smaller than a certain threshold value, the test image is judged to belong to the category.

The method is a light-weight single-sample target detection method based on feature matching and neural framework search and oriented to a power inspection scene, and is used for solving the problem of light-weight and multi-class target detection under the condition that labeled samples are few (1) in the power inspection scene. The method comprises the steps of pre-training a fast R-CNN detection model added with a binary mask on base class data labeled with a large amount of data, then carrying out multiple data enhancement (traditional data enhancement, domain alignment and boundary box correction) on a single labeled sample of a new class, then extracting features of the new class from the enhanced labeled sample of the new class by using the trained fast R-CNN model, then carrying out preliminary reasoning on a test image by using cosine similarity, and finally correcting a preliminary reasoning result by using traditional SIFT features, thereby further improving the performance.

The invention is modified on the basis of the widely used fast R-CNN detection framework. The Faster R-CNN is a representative two-stage depth detection model and mainly comprises 4 structures of a feature extractor, a candidate region generation region, a candidate region classifier and a candidate region position regressor. The feature extractor extracts features of an input image, and generally comprises a deep convolutional neural network; generating a candidate region possibly containing the target object from the extracted features by the candidate region generation region; and the candidate region classifier and the candidate region position regressor respectively predict the target class and the final position of the candidate region. The invention removes the candidate region classifier and the candidate region position regressor of the Faster R-CNN, and then extracts the feature prototype for each new category by using a single labeled sample of each new category. Before extracting the feature prototype, the invention uses a plurality of techniques to enhance the marked sample, including traditional data enhancement technique, domain alignment technique and bounding box correction technique. Besides using the depth features as feature prototypes, the prediction results of the depth model are corrected by using the traditional visual feature method, including a SIFT feature point matching method and a Canny edge contour detection method.

The invention also introduces neural architecture search, and improves the effect of the model on the premise of limiting the parameter quantity of the model.

The invention has the beneficial effects that: the method and the device realize target detection under the condition that each category labeling sample is extremely rare (1) in the power inspection scene, overcome the problem of relying on test image category prior knowledge in the traditional single-sample target detection, and simultaneously reduce the parameter quantity of the model by using neural architecture search while keeping the performance of the model as much as possible.

Drawings

FIG. 1 is a flow chart of the method of the present invention.

Fig. 2 shows 4 examples of the method for extracting the edge feature of the image by using the Canny edge detection method according to the present invention.

FIG. 3 is a framework diagram of the method of the present invention for extracting target features and preliminary reasoning.

Fig. 4 a-4 b are 1 sample of the preliminary inference result corrected by SIFT features according to the present invention, where fig. 4a is a successful SIFT feature point matching case, and fig. 4b is a failed SIFT feature point matching case.

FIG. 5 is an example of the inference results on a dataset image of the present invention.

Detailed Description

The technical scheme of the invention is further explained by combining the attached drawings.

The invention relates to a light-weight single-sample target detection method based on feature matching and neural architecture search and oriented to a power patrol scene, which comprises the following steps of:

1) Training a base class target detection model based on a mask;

before detecting a new category with only a single labeled sample, the method of the invention firstly pre-trains a Faster R-CNN target detection model on base category data with a large number of labeled samples. In order to reduce the parameter quantity of the feature extractor, the invention respectively sets a trainable binary mask value for each channel of each convolution layer of the ResNet18, when the mask value is 1, the channel is reserved, and when the mask value is 0, the channel is removed. Let M be _i Mask, S, representing the i-th layer of the network _i Representing the parameter quantity of the ith layer of the network, the mask loss function of the method is as follows:

L _{model_size} ＝∑ _i ratio(M _i )*S _t (1)

The input of the model is base class data subjected to various image enhancement, including enhancement methods such as translation, rotation, clipping and contrast brightness change. In addition, in order to improve the recognition capability of the model for shallow visual features such as target edges, corners and the like, the invention also uses a Canny edge detection method to extract the edge data of the training image, splices the edge data with the corresponding training image, and uses the spliced edge data as a fourth channel of the input image, as shown in fig. 2.

The training of the model includes 3 loss functions, namely a classification loss function of the bounding box, a regression loss function of the bounding box and a mask loss function (limiting model parameters). The model was trained using an SGD optimizer with a learning rate of 0.02, a momentum of 0.9, a weight decay of 0.0001, and an iteration number (epoch) of 30.

2) Enhancing the data of the new category label sample;

the invention enhances the data of a single labeled sample of a new category, expands the labeled sample into a plurality of different samples and enhances the expression capability of the target category knowledge. The invention adopts three methods of traditional data enhancement, domain alignment and bounding box correction to carry out data enhancement on a single labeled sample. The traditional data enhancement comprises saturation transformation, brightness transformation, image rotation and the like; the bounding box correction is to divide the marked sample and remove the influence of the background in the marked sample; and the domain alignment is to paste the marked sample on a background image and adaptively adjust the size of the pasted sample according to the proportion of the marked sample. For each new class of labeled sample, the present invention generates 16 enhanced images.

3) Extracting the characteristics of the new category label sample;

after the data enhancement of the previous step, a plurality of different enhanced samples are obtained from the labeled sample of each new class, and the method uses a Faster R-CNN model pre-trained in the base class to perform feature extraction on the enhanced samples. For an input image, the Faster R-CNN firstly extracts a plurality of candidate regions of the image and corresponding RoI characteristics thereof, and then averages the RoI characteristics of the candidate regions with the marked bounding box IoU value being greater than or equal to 0.75 to obtain the characteristics of the type in the enhanced sample. In order to extract more accurate features, the labeled bounding box is added into the candidate region in advance, and the RoI features of the labeled bounding box are extracted. Finally, the mean value of the features of the model on each enhanced image of the new class is used as the prototype feature of the class.

4) The model carries out preliminary reasoning on the test data;

as shown in FIG. 3, the present invention uses fast R-CNN to make preliminary inferences on test data. HeadFirstly, reasoning an input test image according to a normal flow of Faster R-CNN to obtain a candidate bounding box and a corresponding RoI characteristic. Then, the invention calculates the cosine similarity of the feature of each candidate bounding box and the feature of each category extracted in the step (3), and obtains the probability of each candidate bounding box belonging to each category through a Softmax layer. Let the total number of classes be K, c _i Mean feature representing class i, b _j Representing the RoI feature extracted by the bounding box j, then the probability that the bounding box j is of the class k

Comprises the following steps:

5) Correcting the reasoning result by using the traditional characteristics;

as shown in fig. 4, in addition to the above-mentioned target detection using depth features, the present invention also uses the conventional SIFT feature to extract key points for further image matching. Specifically, the invention matches the single labeled image of the new category with the test image which is judged to be the category in the step, and eliminates the image which fails in matching. According to the method, firstly, an SIFT feature extraction algorithm is used for respectively extracting key points and feature descriptions of an annotated image and a test image, and then a K-nearest neighbor matching algorithm is used for calculating matched key points between the two images. If the number of the matched key point pairs is 0, judging that the test image does not belong to the category; if the number of the matched key point pairs is larger than 0, the minimum distance is calculated, and if the minimum distance is smaller than a certain threshold value, the test image is judged to belong to the category.

Through the steps, a lightweight Faster R-CNN model is pre-trained on base class data by using a model architecture searching method based on binary mask values, after a plurality of enhanced samples are obtained by performing data enhancement on a single tagged sample of a new class, the pre-trained Faster R-CNN is used for performing feature extraction on the single tagged sample, then a preliminary reasoning method is performed by using a feature matching method, and finally a prediction result is corrected by using the traditional SIFT feature, so that lightweight multi-class target detection under the condition that few (only 1) target class tagged samples are obtained under the power inspection scene is realized.

The invention carries out experiments on a safety detection data set under a power patrol inspection scene, wherein the data set comprises 20 categories, 10 categories are base categories comprising a large number of training samples, and the other 10 categories are new categories only comprising a single training sample. The method achieves the accuracy of 64.52 on a new type of test set, and initially achieves the application degree in a real scene, and an accuracy table and an inference result are shown in an attached figure 5 and a table 1. Table 1 shows the effect of the present invention on the safety inspection data set in the power patrol scenario.

TABLE 1

The embodiments described in this specification are merely illustrative of implementations of the inventive concept and the scope of the present invention should not be considered limited to the specific forms set forth in the embodiments but rather by the equivalents thereof as may occur to those skilled in the art upon consideration of the present inventive concept.

Claims

1. The method for detecting the light-weight single-sample target based on the feature matching and oriented to the power inspection scene comprises the following steps:

1) Training a mask-based base class target detection model;

before detecting a new category with only a single labeled sample, firstly pre-training a Faster R-CNN target detection model on base category data with a large number of labeled samples; the pre-training model uses a randomly initialized feature extractor and is not pre-trained on a large data set such as ImageNet; in order to improve the recognition capability of the model to the shallow visual features such as target edges, corners and the like, the Canny edge detection method is firstly used for detecting the edges of the training image, and the edges are taken as the second edges of the training imageFour channels; in addition, in order to reduce the number of parameters of the model while ensuring the performance of the detection model, a plurality of trainable binary masks are used for pruning convolution layers of the feature extractor, when the mask value of a channel of a certain convolution layer is 1, the output of the channel is reserved, and when the mask value is 0, the output of the channel is set to be 0; in order to train the binary masks, a loss function is designed, and punishment is carried out on the condition that the mask value is too high (parameter quantity is too large); let M _i Mask, S, representing the i-th layer of the network _i Representing the parameter quantity of the ith layer of the network, the mask loss function of the method is as follows:

L _{model_size} ＝∑ _i ratio(M _i )*S _i (1)

2) Enhancing the data of the new category label sample;

carrying out data enhancement on a single labeled sample of a new category, and expanding the labeled sample into a plurality of different samples so as to enhance the expression capability of the target category knowledge; performing data enhancement on a single marked sample by adopting three methods of traditional data enhancement, domain alignment and boundary box correction; the traditional data enhancement comprises saturation transformation, brightness transformation, image rotation and the like; the bounding box correction is to divide the marked sample and remove the influence of the background in the marked sample; the domain alignment is to paste the marked sample on a background image and adaptively adjust the size of the pasted sample according to the proportion of the marked sample;

3) Extracting the characteristics of the new category label sample;

because the labeling samples of the new category are pasted on the background image in different sizes, the interference of the background characteristics on the characteristics of the new category can be caused by directly extracting the characteristics of the whole image; therefore, firstly, extracting the region where the new class target is located by using fast R-CNN pre-trained in the base class, and then taking the features passing through the RoI Pooling layer as the features of the class;

4) The model carries out preliminary reasoning on the test data;

preliminary push on test data using Faster R-CNNC, processing; firstly, reasoning an input test image according to a normal flow of Faster R-CNN to obtain a candidate bounding box and corresponding features passing through a RoI layer; secondly, calculating cosine similarity between the features of each candidate bounding box and the features of each category extracted in the step 3), and obtaining the probability of each candidate bounding box belonging to each category through a Softmax layer; let the total number of classes be K, c _i Mean feature representing class i, b _j Representing the RoI feature extracted by the bounding box j, the probability that the bounding box j is the class k

Comprises the following steps:

5) Using the traditional characteristics to correct the reasoning result;

besides the target detection by using the depth features, extracting key points by using the traditional SIFT features, and further matching the images; specifically, matching a single label image of a new category with the test image which is judged to be of the category in the step, and removing images which fail in matching; firstly, respectively extracting key points and feature descriptions of an annotated image and a test image by using an SIFT feature extraction algorithm, and then calculating matched key points between the two images by using a K adjacent matching algorithm; if the number of the matched key point pairs is 0, judging that the test image does not belong to the category; if the number of the matched key point pairs is larger than 0, the minimum distance is calculated, and if the minimum distance is smaller than a certain threshold value, the test image is judged to belong to the category.