WO2023077821A1

WO2023077821A1 - Multi-resolution ensemble self-training-based target detection method for small-sample low-quality image

Info

Publication number: WO2023077821A1
Application number: PCT/CN2022/099827
Authority: WO
Inventors: 王鹏; 邓玉岩; 林蔚东
Original assignee: 西北工业大学
Priority date: 2021-11-07
Filing date: 2022-06-20
Publication date: 2023-05-11
Also published as: CN114067173A

Abstract

A multi-resolution ensemble self-training-based target detection method for a small-sample low-quality image, relating to the technical field of image processing. First, preliminary training is performed on a target detection model by means of labeled data, then prediction is performed on unlabeled data by means of the trained model, then the data having experienced the prediction is added into original data, to train the target detection model again, and so on, and iterative updating is continuously performed to obtain a final detection model; moreover, a multi-resolution ensemble self-training mode is used each time the latest model is used for prediction of unlabeled data. According to the method, labeled low-quality image data and unlabeled low-quality image data are effectively combined, and the precision of target detection for small-sample low-quality images is improved.

Description

Small-sample low-quality image target detection method based on multi-resolution ensemble self-training

technical field

The invention belongs to the technical field of image processing, and in particular relates to a small-sample low-quality image target detection method based on multi-definition integrated self-training.

Background technique

With the advancement of science and technology and the rapid development of digital information technology, digital equipment products are not only widely used in all walks of life, but also become an indispensable part of people's daily life. From the middle of the last century to the present, computer image target detection technology has shown vigorous vitality, and has been widely used in aerospace, terrain exploration, traffic monitoring and other fields. In the era of high-speed information development, especially digital cameras, hand-held cameras With the popularization of electronic products, object detection technology is used in all aspects of life.

In practical applications, computer image imaging conditions are varied, and the quality of images often obtained is poor. For example, images obtained under extreme weather conditions such as rain, snow, fog, etc. have reduced contrast, blurred details, and severely degraded quality, which greatly limits the subsequent application of computer vision, especially in outdoor navigation, traffic monitoring, target recognition, etc.; video images on the Internet After frequent copying, transmission, format conversion, etc., the clarity is often poor and the information loss is serious; the slight movement of the shooting equipment causes the image to shake and the picture is blurred; another example is the night environment, due to factors such as insufficient light, single light source, and shooting equipment. , the collected images generally have the characteristics of low contrast, high noise, and color distortion, which greatly reduces the utilization rate of the image. In summary, low-quality images refer to unclear images caused by poor imaging conditions of the shooting scene or unstable shooting equipment, etc. Research on target detection problems in the case of reduced image quality is of great importance in the field of computer vision and practical applications. significance.

On the other hand, compared with target detection in traditional scenes, it is often more difficult to obtain low-quality scene images in extreme weather, and the cost of manually annotating data is often high, so research on a method based on small low-quality image annotation data Methods for object detection are necessary.

Contents of the invention

technical problem to be solved

In order to avoid the shortcomings of the prior art, the present invention proposes a small-sample low-quality image target detection method based on multi-resolution integrated self-training. Using the detection framework proposed by the present invention, the data can be fully utilized for low-quality image target detection tasks under limited small-sample labeling data and computing resources, and can be adapted to different target detection models, taking into account the balance between efficiency and detection accuracy .

Technical solutions

A small-sample low-quality image target detection method based on multi-definition integrated self-training, characterized in that the steps are as follows:

Step 1: Assuming that the input image data is a labeled data pair (X ₁ , Y ₁ ) and unlabeled data X ₂ , first use the labeled data to perform preliminary training on the target detection model Faster-R-CNN, and the optimization goal is:

MIN Loss(Y ₁ ,F ₁ (X ₁ ))

After getting the first training model F1, use it to predict the unlabeled data, namely:

Y ₂ =F ₁ (X ₂ )

Step 2: Then add (X ₂ ,Y ₂ ) as labeled data to the original data to obtain an enhanced data set D ₁ =(X ₁ ,Y ₁ ,X ₂ ,Y ₂ ), use D ₁ to retrain target detection The model Faster-R-CNN gets F ₂ , and then uses F ₂ to predict better labeling information for X ₂ to get Y ₃ ;

Step 3: By analogy, the final detection model F _n is obtained through continuous iterative updating;

Step 4: Use the final detection model _Fn to detect the image data to be detected to obtain the final image target.

The further technical solution of the present invention: after each training obtains the latest model F, use it to predict the unlabeled data: first, use the dark channel dehazing model to clear the original low-quality foggy pictures, and control Its window parameters are used to generate defogged pictures I ₁ I ₂ ...I _k with different degrees of clarity, where k means that there are k types of clarity in total, and then input the pictures of the above k types of clarity to F for prediction, and k groups will be generated (x, y, w, h, c) five-tuple prediction results, where the first four numbers predict the position, and the last number c predicts the confidence of the current category; when integrating the k-group five-tuple prediction results, According to the size of confidence c: when c is greater than the given threshold 0.8, the current prediction result is kept in the final result set; when c is less than the given threshold 0.3, the current prediction result is added to the set to be manually corrected; for For the remaining prediction results, the intra-class fusion is performed according to the intersection ratio, that is, for any two prediction frames of the same type whose intersection ratio is greater than a given threshold of 0.7, the one with the larger c value is retained and added to the final Result set; finally, the "wrong" prediction results in the set to be manually corrected generated in the above process are corrected and added to the final result set.

A further technical solution of the present invention: the number of iterations in step 3 is 6 times.

Beneficial effect

A small-sample low-quality image target detection method based on multi-definition integrated self-training proposed by the present invention is improved based on the self-training method in the semi-supervised technology, and active learning is added. At the same time, for low-quality image scenes, a multi-resolution integrated self-training method is proposed to effectively combine labeled low-quality image data and unlabeled low-quality image data, and improve the accuracy of small-sample low-quality image target detection.

Description of drawings

The drawings are for the purpose of illustrating specific embodiments only and are not to be considered as limitations of the invention, and like reference numerals refer to like parts throughout the drawings.

Figure 1 Schematic diagram of the overall structure of the model;

Figure 2 The main flow of the semi-supervised low-quality image target detection algorithm;

Fig. 3 Multi-resolution ensemble self-training method.

Detailed ways

In order to make the object, technical solution and advantages of the present invention clearer, the present invention will be further described in detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present invention, not to limit the present invention. In addition, the technical features involved in the various embodiments of the present invention described below can be combined with each other as long as they do not constitute a conflict with each other.

The technical solution module of the present invention is described from two aspects: the overall structure of the model, and multi-definition integrated self-training. The overall model structure is based on the existing two-stage target detection framework Faster-R-CNN. First, the model is trained using the provided labeled data, and then the trained model is used to predict labels on unlabeled samples. Finally, these The "labeled" data is added to the original training data and retrained together to obtain the final model for testing.

1. The overall structure of the model

As shown in Figure 1, the model of the present invention is constructed based on the Faster-R-CNN framework, assuming that the input data are labeled data pairs (X ₁ , Y ₁ ) and unlabeled data X ₂ , the first step is to use labeled The data is used for preliminary training of the model, and the optimization goal is:

MIN Loss(Y ₁ ,F ₁ (X ₁ ))

Y ₂ =F ₁ (X ₂ )

Then (X ₂ , Y ₂ ) is added to the original data as labeled data to obtain an enhanced data set D ₁ = (X ₁ , Y ₁ , X ₂ , Y ₂ ), where the labeled information of Y ₂ has a certain Noisy, you can increase its credibility by increasing the final confidence threshold. At the same time, in order to improve the prediction quality of labeling information, the above process can be repeated, using D ₁ to retrain the model to get F ₂ , and then use F ₂ to predict better labeling information for X ₂ to get Y ₃ , and so on, continue to iterate The formula is updated until there is a better data label. There is no doubt that the increase in the amount of data will bring greater improvement to the model.

After the above-mentioned multiple rounds of iterative training, the final detection model F _n is obtained, and this model is used for detection on the test data set without any labeling, and the final low-quality image detection result is obtained. The main steps of the whole algorithm are shown in Figure 2.

2. Multi-resolution integrated self-training

The reason why deep learning technology can be widely used includes two reasons: the large amount of data and the progress of hardware technology. Generally speaking, the data in deep learning will be divided into three different types: training data, verification data, and test data. The training data needs to be labeled according to the task type, which is very time-consuming and costly. At the same time, in low-quality scenarios, Due to the limitations of weather and other reasons, it is difficult to obtain sufficient image data, so for this scenario, you can use the self-training method to increase the amount of labeled data. However, the general self-training method does not consider the application scenarios of low-quality images, and lacks the effective use of the "image quality" factor. This method is improved on the basis of general self-training combined with low-quality image scenarios. Specifically, , has the following two points: firstly, for low-quality images, it is proposed to use multi-definition to jointly predict its annotation information, and increase the model's ability to fit images with different levels of clarity; secondly, an integrated method combining active learning and confidence voting is used, Fusion of low-quality image prediction results under different resolutions. Based on the above two points, the present invention proposes a multi-definition integrated self-training strategy as follows: use the dehazing algorithm to dehaze low-quality images, generate images with different degrees of clarity for predicting labels, and then use active learning to fuse these label information Get the final label information.

In the integrated self-training method, by voting, when the performance of the detector is low, samples close to the class boundary may be mislabeled, and samples close to the center of the class are selected by confidence, although the mislabeling rate can be reduced. , but even if each integrated detector judges a sample with high confidence, the labels predicted by the heterogeneous detectors for the sample may be inconsistent, and in the iterative process, only samples with high confidence are selected to be added to the training set of samples Among them, such samples have a high degree of similarity with the training set samples, and adding these samples may not improve the performance of the algorithm. On the contrary, the samples with low confidence have a small similarity with the training set samples, correctly mark these samples, and then add them to the training set Among them, the performance of the classifier can be greatly improved. Active learning is to let experts label a small number of unlabeled samples, so as to obtain the correct labeling information of unlabeled samples. Combining the two and using an integrated self-training algorithm that combines active learning and confidence voting can effectively solve the above-mentioned problems in integrated self-training.

Finally, after obtaining more accurate data labels, they are added to the original labeled data set and sent to the target detection model Faster-R-CNN for collaborative training, which is equivalent to using two different parts of data (one part is accurately labeled, and the other part is sub-accurate). ) to optimize the final model together, so the final optimization objective can be written as the following expression:

Among them, Y _i is the real label, f is the model output, Loss1 and Loss2 are the respective loss functions in the target detection model, n is the number of labeled training data, and the latter part is the loss calculation of unlabeled data, where Y' _i is passed The aforementioned optimized ensemble self-training method obtains more accurate pseudo-labels.

In order to enable those skilled in the art to better understand the present invention, the present invention will be described in detail below in conjunction with specific embodiments.

The entire method flow is divided into three parts, the initial network training phase, the integrated self-training phase, and the testing phase. The overall framework is improved based on the two-stage target detection framework Faster-R-CNN.

1) Initial network training phase:

The present invention uses the Faster-R-CNN model in the original training stage of the network. In order to increase the generalization performance of the model, the model is firstly pre-trained on the Pascal-VOC and COCO data sets. Since the former has 20 categories and the latter has 80 categories, which cannot be directly merged, so the labels of the data set were clearly defined, similar labels were merged, and finally all were classified into 80 categories of COCO, with a total of 135,412 pictures, and a total of 12 Epochs were pre-trained. The SGD optimizer with a learning rate of 0.001 and a batch-size of 16 attenuates the learning rate to half of the original at the ninth and eleventh Epoch respectively, reaching a better convergence condition; then use the trained weight As the initialization parameter of the network, and adjust the output category of the model to 5 categories in the RTTS data set, re-training, this training is fine-tuning, so only 6 Epochs are trained in total, and the initial learning rate is 0.0005, other are consistent with the previous.

The data used in the initial training phase includes two general-purpose target detection data sets with large data volume and the RTTS training data set with small data volume. There are a total of 500 pictures. The training data here has manual annotation information.

2) Multi-resolution integrated self-training stage:

In the self-training stage of multi-resolution integration, firstly, the low-quality images are cleared to obtain images of different resolutions, and then the model obtained in the first step of training is used to label the unlabeled data of different resolutions. A total of 4,000 unlabeled data are used. The image of the label. In this step, different confidence threshold scores are used, which are 0.3, 0.5, and 0.7, a total of three models. In the final integration, the NMS algorithm is first used to suppress and deduplicate the non-maximum value, and the four models generated by the model under the three parameters Coordinates are weighted and fused to obtain the final coordinates (x ₁ , y ₁ , x ₂ , y ₂ ), and the label y _c with the highest probability is selected as the final category label, then y'=(x ₁ ,y ₁ , x ₂ ,y ₂ )+y _c is added to the training set as the final pseudo-label.

The above-mentioned self-training process requires a total of 6 iterations. The main purpose is to obtain more realistic labeling information for unlabeled data, increase the amount of effective information in the data set, and finally use all the data to train together to obtain the final test model parameters.

3) Testing phase:

The test phase is used to finally verify the effectiveness of the above multi-resolution integrated self-training method, and the parameter settings in this part are still consistent with frameworks such as Faster-R-CNN.

The above is only a specific embodiment of the present invention, but the protection scope of the present invention is not limited thereto. Any person familiar with the technical field can easily think of various equivalents within the technical scope disclosed by the present invention. Modifications or replacements shall all fall within the protection scope of the present invention.

Claims

A small-sample low-quality image target detection method based on multi-definition integrated self-training, characterized in that the steps are as follows:

Step 1: Assuming that the input image data is a labeled data pair (X 1 , Y 1 ) and unlabeled data X 2 , first use the labeled data to perform preliminary training on the target detection model Faster-R-CNN, and the optimization goal is:

MIN Loss(Y 1 ,F 1 (X 1 ))

After getting the first training model F1, use it to predict the unlabeled data, namely:

Y 2 =F 1 (X 2 )

Step 2: Then add (X 2 ,Y 2 ) as labeled data to the original data to obtain an enhanced data set D 1 =(X 1 ,Y 1 ,X 2 ,Y 2 ), use D 1 to retrain target detection The model Faster-R-CNN gets F 2 , and then uses F 2 to predict better labeling information for X 2 to get Y 3 ;

Step 3: By analogy, the final detection model F n is obtained through continuous iterative updating;

Step 4: Use the final detection model Fn to detect the image data to be detected to obtain the final image target.
A kind of small-sample low-quality image target detection method based on multi-definition integrated self-training according to claim 1, characterized in that after each training obtains the latest model F, use it to predict the data without labels: Firstly, the dark channel dehazing model is used to clear the original low-quality foggy pictures, and the dehazing pictures I 1 I 2 ...I k with different degrees of clarity can be generated by controlling its window parameters, where k means that there are k types of clearness in total degrees, and then input the pictures of the above k levels of clarity to F for prediction, which will generate k sets of (x, y, w, h, c) quintuple prediction results, in which the first four numbers predict the position, and the last number c predicts the confidence degree belonging to the current category; when integrating the prediction results of k groups of quintuples, according to the size of the confidence degree c: when c is greater than the first given threshold, the current prediction result is kept in the final result set; When c is less than the second given threshold, the current prediction result is added to the set to be manually corrected; for the remaining prediction results, the intra-class fusion is performed according to the intersection ratio, that is, for any two of the same class For the coordinate frame whose intersection ratio of prediction frame is greater than the third given threshold, keep the one with larger c value and add it to the final result set; finally, correct the "wrong" prediction result in the set to be manually corrected generated in the above process and add it final result set.
According to claim 2, a small-sample low-quality image target detection method based on multi-resolution integrated self-training is characterized in that the first given threshold is 0.8.
A small-sample low-quality image target detection method based on multi-resolution integrated self-training according to claim 2, characterized in that the second given threshold is 0.3.
A small-sample low-quality image target detection method based on multi-resolution integrated self-training according to claim 2, characterized in that the third given threshold is 0.7.
A small-sample low-quality image target detection method based on multi-resolution integrated self-training according to claim 1, characterized in that the number of iterations in step 3 is 6 times.