CN116977710A

CN116977710A - Remote sensing image long tail distribution target semi-supervised detection method

Info

Publication number: CN116977710A
Application number: CN202310710247.1A
Authority: CN
Inventors: 张浩鹏; 姚黎帆; 王毓浩; 张信耶; 宋佳芸; 张芳芳
Original assignee: Qingdao Research Institute Of Beihang University; Beihang University
Current assignee: Qingdao Research Institute Of Beihang University; Beihang University
Priority date: 2023-06-15
Filing date: 2023-06-15
Publication date: 2023-10-31

Abstract

The invention discloses a remote sensing image long tail distribution target semi-supervised detection method, which comprises the following steps: the teacher-student learning framework is expanded into an iteration framework of pseudo tag data by utilizing an active sampling strategy, so that limited tag information can be utilized to the maximum extent and the quality of the pseudo tag can be improved. In the process, the optimal semi-supervised target detection data is explored through comprehensive indexes such as difficulty, information quantity, diversity and the like to select labels. During training, limited, annotated data is partially initialized and gradually increased. After each iteration, the importance of the unlabeled data is assessed using metrics set forth by the trained teacher model, and active data enhancement is performed based on these metrics. Using a method based on transfer learning, the problem of training data imbalance is solved by transferring features learned from head classes rich in training examples to tail classes of low representation.

Description

Remote sensing image long tail distribution target semi-supervised detection method

Technical Field

The invention relates to the technical field of digital image processing, in particular to a semi-supervised remote sensing image long tail target detection method based on active learning.

Background

Semi-supervised target detection is a research topic in the field of computer vision, aimed at improving the performance of target detection by combining annotated and non-annotated data. In the traditional target detection algorithm, the supervised method requires a large amount of marked data for training, but the cost for acquiring the marked data is high, and the marked data often has the problems of inaccuracy, incompleteness and the like, so that the performance and the application range of the algorithm are limited.

The semi-supervised learning method combines the labeled data and the unlabeled data to train the model. Specifically, the target detection performance is improved by performing supervised learning by using a small amount of marked data and performing self-supervised learning by combining a large amount of unmarked data. The non-labeling data can be a picture or video downloaded from the internet or data obtained from other application scenes. By means of semi-supervised learning, limited marked data and a large amount of unmarked data can be effectively utilized, and therefore target detection performance and reliability are improved.

However, in the task of detecting the target of the remote sensing image, the situation that the number of samples in some categories is smaller and the number of samples in other categories is larger often occurs, so that long tail distribution is formed. Because the model is easier to learn the category with higher occurrence frequency in the training process, and the category with lower occurrence frequency is ignored, the long tail distribution can have a larger influence on the performance of the model.

The traditional algorithm, such as SVM, detects the target based on the image characteristics, needs to manually screen the characteristics, has large workload, and is difficult to solve the problem of long tail distribution of unbalanced categories. The remote sensing image target detection technology based on deep learning mostly extracts image features through a deep convolutional network (CNN), then separates a foreground region from a background region of a generated feature map, classifies the obtained foreground region and carries out regression of a target detection frame, and finally obtains a target detection result. However, this method has poor treatment effect on long tail distribution problems with a small number of categories and a large number of categories in the sample. Because the model learns more frequently occurring categories more easily during the training process, and ignores less frequently occurring categories. This allows the model to learn much of the bias information, thereby affecting the accuracy of the network's detection. In addition, some images in the remote sensing image have complex background, a plurality of target types, a plurality of target numbers and a plurality of target groups are densely distributed, and other target numbers are less and are sparsely distributed, namely the difficulty of samples is greatly different. The screening of the monitoring sample for the semi-monitoring target detection also puts high requirements, and the traditional semi-monitoring target detection also ignores the characteristic, so that the detection result is often unsatisfactory. And traditional random screening may miss a large number of valuable samples resulting in insufficient knowledge learned by the model.

Therefore, how to provide a semi-supervised detection method aiming at a long-tail distribution target of a remote sensing image and capable of improving the detection performance of the remote sensing image target is a technical problem to be solved by a person skilled in the art.

Disclosure of Invention

Aiming at the current research situation and the existing problems, the invention provides a remote sensing image long tail distribution target semi-supervised detection method, which improves the quality of pseudo tag generation by expanding a teacher-student learning network frame into an iteration frame based on active learning; the problem of class imbalance in the remote sensing dataset is alleviated by adjusting the classifier using the balanced class dataset.

The invention provides a remote sensing image long tail distribution target semi-supervised detection method, which comprises the following steps:

s1: building a teacher-student learning model: screening labeled data from the remote sensing image public data set, training a reference target detector by using the labeled data, and optimizing the training process of the reference target detector by adopting a two-stage training strategy:

training a feature extraction part of the reference target detector by using the labeling data in the head category of the long-tail data set, and fixing network parameters of the feature extraction part to obtain a one-stage reference target detector;

extracting a given number of marked data samples on average for all categories in a long-tail data set to obtain a balanced category data set, training a classification regression part of a one-stage reference target detector by using the balanced category data set to obtain an optimized reference target detector, and adopting an optimized reference target detector network structure by a teacher model and a student model;

s2: active sampling of teacher model: screening unmarked data from a public data set, actively sampling the unmarked data by a teacher model according to a set measurement index, manually marking the unmarked data meeting the measurement index requirement to obtain new marked data, combining the new marked data with the marked data screened in the S1, predicting the rest unmarked data in the public data set of the remote sensing image to generate a pseudo tag, and obtaining a prediction result as pseudo tag data;

s3: semi-supervised learning of student models: training a student model by using the marked data and the pseudo tag data in the remote sensing image public data set;

s4: pseudo tag screening: after the current training round of the student model is finished, screening the pseudo tag according to the similarity of the detected labeling category and the tag category of the pseudo tag, which are generated by predicting the non-labeling data by the student model, wherein the screened pseudo tag data participate in the training of the next round;

s5: iterative training: and repeating the steps S2-S4 until a preset training round is reached or the training performance of the teacher-student learning model meets the preset value requirement.

Preferably, the two-stage training strategy specifically includes:

training a feature extraction part of the reference target detector by using the marked data of the head category with the number of samples meeting the requirement in the long-tail data set, and fixing network parameters of the feature extraction part to obtain a one-stage reference target detector;

and taking the number N of marked data of the tail class with the least number of samples in the long-tail data set as the reference, extracting N for all classes of marked data in the long-tail data set, wherein N is more than or equal to 1, obtaining a balance class data set, and training a classification regression part of the one-stage reference target detector by using the balance class data set to obtain the optimized reference target detector.

Preferably, the metric set in S2 includes: the difficulty index, the information quantity index and the diversity index are combined by an L1 normal form to obtain the measurement index.

Preferably, the S2 further includes: and setting a threshold value, predicting unlabeled data meeting the measurement index requirement by the teacher model to obtain a prediction frame, determining the positive sample when the confidence coefficient of the prediction frame is higher than the threshold value, and generating a pseudo tag.

Preferably, the S2 further includes: and after the screened unlabeled data is manually marked, removing the screened unlabeled data from the long-tail data set and adding the long-tail data set into marked data in the public data set of the remote sensing image.

Preferably, the S4 includes:

after the current training round of the student model is finished, predicting the non-labeling data to generate a first detection frame, and predicting the labeling data by the student model to generate a second detection frame;

the pseudo tag meeting the similarity requirement with the first detection frame is reserved, and corresponding pseudo tag data participate in the next round of training of the student model;

the non-supervision loss function is a loss function generated by the student model and the pseudo tag data, and the supervision loss function is a loss function generated by the student model and the marked data; and (3) propagating the loss value from the output layer to the input layer through a back propagation algorithm, calculating the contribution degree of each parameter of the student model to the loss, and updating the parameter value of the student model.

Preferably, the step S5 includes:

and repeating the steps S2-S4 until a preset training round is reached or the performance index of the student model meets the preset value requirement.

Preferably, the S4 further includes:

judging whether the similarity between a first detection frame generated by predicting the non-labeling data by the student model and the corresponding pseudo tag meets the requirement or not;

if yes, the pseudo tag data participate in the training of the next round;

if not, determining that the pseudo tag is a noise tag, and removing data corresponding to the noise tag from training data.

Preferably, the labeling data selected in the step S1 and the non-labeling data selected in the step S2 are small images obtained by randomly segmenting the remote sensing image into a set proportion, setting the overlapping rate between the small images, and using the small images in the training steps of S1-S5.

Preferably, the reference object detector includes: fasterR-CNN, SSD, or YOLO networks.

Compared with the prior art, the invention has the following beneficial effects:

1. the invention applies semi-supervised target detection to remote sensing data and expands a teacher-student learning model framework into an iterative framework, wherein a marked data set is partially initialized, enhanced and updated through an active sampling strategy. The network can maximally utilize limited labeling information and improve the quality of the pseudo tag, and provides assistance for the data labeling of the remote sensing image.

2. The invention decomposes the learning process into a representation learning stage and a detection learning stage of the distributed data so as to relieve the problem of unbalanced category in the remote sensing data set and further improve the quality of pseudo tag generation.

3. The method is helpful for obtaining the optimal semi-supervised target detection data through comprehensive index selection marks such as difficulty, information quantity, diversity and the like.

4. The invention selects the real satellite remote sensing data set to test the model, and the experimental result proves the effectiveness of the method in qualitative and quantitative aspects.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below. It is apparent that the drawings in the following description are only embodiments of the present invention, and that other drawings may be obtained from the provided drawings without inventive labor for those skilled in the art.

FIG. 1 is a schematic diagram of a remote sensing image long tail distribution target semi-supervised detection method provided by an embodiment of the invention;

fig. 2 is a visual comparison of the detection results of the method provided by the embodiment of the invention and other prior art methods.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The principle of application of the invention is described in detail below with reference to the accompanying drawings.

The long-tail distribution target semi-supervised detection method for the remote sensing image is applied to real remote sensing data.

First, the data set with a small number of labels is trained by the teacher-student learning framework, in which way the student model can learn progressively more knowledge from unlabeled data and approach progressively to the performance level of the reference target detector.

And secondly, expanding a teacher-student learning framework into an iteration framework of pseudo tag data, wherein a labeling data set is partially initialized, enhanced and updated through an active sampling strategy of a teacher model. The network can maximize the utilization of limited tag information and improve the quality of the pseudo tags. In the process, the optimal semi-supervised target detection data is explored through comprehensive indexes such as difficulty, information quantity, diversity and the like to select labels. During training, limited, annotated data is partially initialized and gradually increased.

Again, after each iteration, the importance of the unlabeled data is assessed using the metrics proposed by the trained teacher model, and active data enhancement is performed based on these metrics.

Finally, for long tail problems, the invention uses a method based on transfer learning to solve the problem of unbalanced training data by transferring the features learned from the head class rich in training examples to the tail class with low representation.

As shown in fig. 1, the following details of the specific implementation flow of this embodiment:

s2: active sampling of teacher model: screening non-labeling data from a remote sensing image public data set, actively sampling the non-labeling data by a teacher model according to a set measurement index, manually labeling the non-labeling data meeting the measurement index requirement to obtain new labeling data, combining the new labeling data with the labeled data screened in the S1, predicting the rest non-labeling data in the remote sensing image public data set to generate a pseudo label, and obtaining a prediction result as pseudo label data; the pseudo tag is a detection frame containing a prediction tag;

s4: pseudo tag screening: after the current training round of the student model is finished, screening the pseudo tag according to the similarity of the labeling category of the first detection frame and the tag category of the pseudo tag, which are generated by predicting the non-labeling data by the student model, wherein the screened pseudo tag data participate in the training of the next round;

In this embodiment, a two-stage training strategy of the reference target detector is proposed:

the two-stage training strategy specifically includes:

In the above-described optimization process, a strong long-tail recognition capability can be obtained by learning high-quality feature information using a long-tail dataset and adjusting the classifier only by using a balanced category dataset, with respect to the category imbalance problem. A two-stage training strategy is adopted to decompose the learning process into a representation learning stage and a detector learning stage of distributed data. In the learning phase, the neural network of reference target detectors is trained on the head class data of the long tail dataset. During the detector learning phase, the detector parameters are retrained by simply updating the detector parameters by freezing backbone parameters (network parameters of the feature extraction portion of the neural network) on the balance class dataset. Wherein these balanced class datasets are constructed from the tag data in a given long-tail dataset without additional images.

In this embodiment, an active sampling strategy of the teacher model is also proposed:

the background of remote sensing images is complex, and the number and scale of different categories of targets are very different, which is a great challenge for the generation of pseudo tags. Ground real label information plays a key role in the training phase of the student model, and determines the quality of the pseudo labels and the performance of the offline teacher model. For this feature, the traditional teacher-student learning framework is extended to an active learning-based iterative framework in which the tag data is not directly generated but generated through iteration. During the training process, limited marker data is partially initialized and gradually increased.

In one embodiment, the reference object detector includes: fasterR-CNN, SSD, or YOLO networks. The teacher model and the student model both adopt copied network structures of the optimized reference target detector, and can also adopt network structures similar to the optimized reference target detector, including relevant network structures and parameters of a reserved backbone network and a classified regression network.

In one embodiment, after each iteration, the importance of the unlabeled data is assessed using the teacher model and active data screening is performed according to the set metrics. And generating pseudo labels by using the teacher model in each round, wherein the teacher model performs pseudo label generation on a limited number of unmarked data, and re-selecting a limited number of participation pseudo labels from the rest unmarked data in the next round of iteration. Valuable samples are screened for marking, and iteration is continued until the desired number of labels are completely marked.

The network can make maximum use of limited tag information and improve the quality of the pseudo tag. The metrics set in S2 include: the method comprises the steps of combining a difficulty index, an information quantity index and a diversity index in an L1 normal form to obtain a measurement index. The comprehensive measurement is used for selecting which samples to label, and the index formula is as follows:

wherein, the liquid crystal display device comprises a liquid crystal display device,respectively represent difficulty, information quantity and diversity. />Representing the number of the detection frame in the image, i representing the ith detection frame, b representing the b-th image, N _c Representing the number of categories, p (c) _k ；b _j ,θ _t ) Refers to the teaching teacher model for the kth category c _k Is used for the prediction probability of (1). Wherein b _j Is input non-labeling data, theta _t Is a parameter of the teacher's network. confidence (b) _j ,θ _t ) Is the highest confidence score in the jth bounding box predicted by the teacher model. c _j Is the category predicted by the j-th boundary detection box, and |·| represents the cardinality (number of elements) of the set.

The three indices are combined using the L1 paradigm. For each of the screened unlabeled data +.>And sorting, and reserving the unmarked data of the set quantity.

In this embodiment, S2 further includes: and setting a threshold value, predicting unlabeled data meeting the measurement index requirement by the teacher model to obtain a prediction frame, determining the prediction frame as a positive sample when the confidence coefficient of the prediction frame is higher than the threshold value, and generating a pseudo tag.

In this embodiment, S2 further includes: and after the screened unlabeled data is manually marked, removing the screened unlabeled data from the long-tail data set and adding the long-tail data set into marked data in the public data set of the remote sensing image.

In one embodiment, the teacher model obtains the strongly enhanced non-labeling data to obtain a pseudo tag with higher quality; the student model obtains non-labeling data after the weak enhanced labeling pseudo labels so as to improve the generalization performance of the student model.

In one embodiment, S4 comprises:

in general, only the pseudo tag very similar to the output of the student model is reserved, so that the pseudo tag meeting the similarity requirement with the detection frame is reserved, and corresponding pseudo tag data participates in the training of the next round of the student model;

during training, the model predicts the input data and compares it to the actual tag. The loss function measures the degree of difference between the model predictive result and the actual label. By minimizing the loss function, the model can be more similar to the real situation, and the prediction accuracy is improved.

The unsupervised loss function in this embodiment refers to a loss function generated by the student model and the pseudo tag data, and the supervised loss function refers to a loss function generated by the student model and the tag data.

And propagating the loss value from the output layer to the input layer through a back propagation algorithm, calculating the contribution degree of each parameter to the loss, and updating the parameter value. This process uses a gradient descent algorithm to minimize the loss function. Based on the gradient information calculated by the back-propagation algorithm, the parameters of the model are updated using an optimization algorithm (e.g., random gradient descent). By iterating this process continuously, the parameters of the model are gradually adjusted so that the loss function reaches a minimum.

In this embodiment, S5 includes:

and repeating the steps S2-S4 until a preset training round is reached or the performance index of the student model meets the preset value requirement. May be determined by monitoring performance metrics on the validation set.

Common performance metrics include accuracy, precision, recall, F1 values, and the like. The loss function is used for measuring the difference between the model prediction result and the actual label, and is a target of model optimization. As the loss function gradually decreases, the difference between the predicted result and the actual label representing the model decreases, and the model performance improves.

The validation set is used to evaluate the performance of the model on unseen data. During training, performance indicators on the validation set may be monitored and compared as training proceeds. When the performance index is not improved or begins to decline, the performance of the model can be considered to reach a bottleneck, namely, when the performance of the student model is not improved, training can be stopped, and the influence on the generalization capability of the model caused by continuous fitting of training data is avoided.

In one embodiment, S4 further comprises:

if yes, the pseudo tag data participates in the training of the next round;

if not, determining the pseudo tag as a noise tag, removing data corresponding to the noise tag from training data, avoiding influence of noise on the training process, and preventing the model from learning wrong information.

The training test results of the teacher-student learning model provided by the invention are described below based on a specific data set:

in this embodiment, dotav1.0 (The Dataset for Object Detection in Aerial Images) dotav1.0 (The Dataset for Object Detection in Aerial Images) is selected as a data set for the aerial image target detection task. The dataset contained mainly 15,000 Zhang Hang images, including 1,411 training set pictures, 458 verification set pictures, and 937 test set pictures. These images are taken by aerial and unmanned aerial vehicles, with resolutions varying from 800 x 800 to 4000 x 4000. The target categories in the dataset include aircraft, ship, storage tank, basketball court, port, bridge, runway, etc. for a total of 15 categories, the number of each category is shown in table 1.

Each target in the dotav1.0 dataset is labeled with a bounding box and class label, making it an important resource for training and evaluating aerial image target detection algorithms. The dotav1.0 dataset has been widely used in research and applications in the relevant fields. In this example, 2.5%,5%,10% and 20% of the tagged data in the training set were retained, respectively, and the tags of the other data were removed to generate untagged data. And the number of the labeled data is iterated to 5%,10%,20% and 40% step by step through the active sampling strategy in s2 in the training process, and the number of the unlabeled data is correspondingly reduced.

TABLE 1 DOTAv1.0 dataset class target quantity

Category(s)	Target quantity	Category(s)	Target quantity
				Plane	7908	Ship	27485
Baseball-discout	407	Tennis-court	2280
				Bridge	1881	Baskeball-court	463
Ground-trace	289	Storage-tank	4024
				Small-vehicle	23480	Soccer-ball-court	206
Large-vehicle	16456	Roundabout	393
				Harbor	5889	Swimming-pool	1709
Helicopter	626

When executing S1, randomly segmenting the remote sensing image into 1024×1024 small images, wherein the overlapping rate is 10%, performing 0 pixel supplementing operation on the remote sensing image with height or width smaller than 1024, and using the segmented image for training the image.

In the experimental part, the invention is based on DOTAv1.0 real remote sensing data, and has 15 categories. And selecting mAP as an evaluation index, wherein the lower the mAP is, the higher the target detection precision is.

The experiment was trained using 5%,10%,20% and 40% of the labeled samples in the training dataset. All models were tested on the test set using the conventional evaluation index mAP in object detection as an evaluation index, and the results are shown in Table 2.

TABLE 2 mAP indicators of different tag numbers for the method of this example compared with other methods

The example method was 9.18%, 7.49%, 6.34% and 5.87% higher than the UnbiasedTeacher model with 5%,10%,20% and 40% of the marker sample training, respectively.

As shown in fig. 2 (a), a visual effect diagram of the target detection result of the method for the long-tail dataset according to the embodiment of the present invention is shown in fig. 2 (b), and a visual effect diagram of the target detection result of the method for the long-tail dataset according to the prior art is shown in fig. 2 (b), and compared with the method in the prior art, the method for detecting more targets in the same remote sensing image according to the embodiment of the present invention obviously ensures the accuracy of target detection.

The invention provides a method for semi-supervised detection of long-tail distribution targets of remote sensing images, which is described in detail above, and specific examples are applied to illustrate the principles and the implementation modes of the invention, and the description of the above examples is only used for helping to understand the method and the core ideas of the invention; meanwhile, as those skilled in the art will vary in the specific embodiments and application scope according to the idea of the present invention, the present disclosure should not be construed as limiting the present invention in summary.

In this document, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

Claims

1. A remote sensing image long tail distribution target semi-supervised detection method is characterized by comprising the following steps of: the method comprises the following steps:

s2: active sampling of teacher model: screening non-labeling data from a remote sensing image public data set, actively sampling the non-labeling data by a teacher model according to a set measurement index, manually labeling the non-labeling data meeting the measurement index requirement to obtain new labeling data, combining the new labeling data with the labeled data screened in the S1, predicting the rest non-labeling data in the remote sensing image public data set to generate a pseudo label, and obtaining a prediction result as pseudo label data;

s4: pseudo tag screening: after the current training round of the student model is finished, screening the pseudo tag according to the similarity of the label category detected by the first detection frame and the tag category of the pseudo tag, wherein the similarity is generated by predicting the non-label data by the student model, and the screened pseudo tag data participates in the training of the next round;

2. The method for semi-supervised detection of long-tail distributed targets in remote sensing images according to claim 1, wherein the two-stage training strategy specifically comprises:

3. The method for semi-supervised detection of long tail distribution targets in remote sensing images according to claim 1, wherein the metrics set in S2 include: the difficulty index, the information quantity index and the diversity index are combined by an L1 normal form to obtain the measurement index.

4. The method for semi-supervised detection of long tail distribution targets in remote sensing images according to claim 1, wherein the step S2 further comprises: and setting a threshold value, predicting unlabeled data meeting the measurement index requirement by the teacher model to obtain a prediction frame, determining the positive sample when the confidence coefficient of the prediction frame is higher than the threshold value, and generating a pseudo tag.

5. The method for semi-supervised detection of long tail distribution targets in remote sensing images according to claim 1, wherein the step S2 further comprises: and after the screened unlabeled data is manually marked, removing the screened unlabeled data from the long-tail data set and adding the long-tail data set into marked data in the public data set of the remote sensing image.

6. The method for semi-supervised detection of long tail distribution targets in remote sensing images according to claim 1, wherein the step S4 comprises:

7. The method for semi-supervised detection of long tail distribution targets in remote sensing images according to claim 6, wherein the step S5 comprises:

8. The method for semi-supervised detection of long tail distribution targets in remote sensing images according to claim 1, wherein the step S4 further comprises:

if yes, the pseudo tag data participate in the training of the next round;

9. The method for semi-supervised detection of long-tail distribution targets of remote sensing images according to claim 1, wherein the marked data screened in the step S1 and the unmarked data screened in the step S2 are small images obtained by randomly segmenting the remote sensing images into set proportions, setting the overlapping rate between the small images, and using the small images in the training steps of S1-S5.

10. The method for semi-supervised detection of long tail distribution targets of remote sensing images according to claim 1, wherein the reference target detector comprises: fasterR-CNN, SSD, or YOLO networks.