CN116524258A - Landslide detection method and system based on multi-label classification - Google Patents
Landslide detection method and system based on multi-label classification Download PDFInfo
- Publication number
- CN116524258A CN116524258A CN202310451861.0A CN202310451861A CN116524258A CN 116524258 A CN116524258 A CN 116524258A CN 202310451861 A CN202310451861 A CN 202310451861A CN 116524258 A CN116524258 A CN 116524258A
- Authority
- CN
- China
- Prior art keywords
- category
- swin
- recall
- score
- accuracy
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 28
- 238000000034 method Methods 0.000 claims abstract description 43
- 238000000605 extraction Methods 0.000 claims abstract description 17
- 238000013507 mapping Methods 0.000 claims abstract description 7
- 238000011156 evaluation Methods 0.000 claims description 15
- 239000013598 vector Substances 0.000 claims description 12
- 230000006870 function Effects 0.000 claims description 9
- 230000001419 dependent effect Effects 0.000 claims description 7
- 230000009467 reduction Effects 0.000 claims description 7
- 238000010276 construction Methods 0.000 claims description 5
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 claims description 3
- 230000002708 enhancing effect Effects 0.000 claims description 2
- 230000008901 benefit Effects 0.000 description 10
- 238000013527 convolutional neural network Methods 0.000 description 10
- 238000012549 training Methods 0.000 description 7
- 230000007246 mechanism Effects 0.000 description 6
- 238000012545 processing Methods 0.000 description 6
- 230000000007 visual effect Effects 0.000 description 6
- 241000235648 Pichia Species 0.000 description 5
- 238000013528 artificial neural network Methods 0.000 description 5
- 230000003287 optical effect Effects 0.000 description 5
- 238000012360 testing method Methods 0.000 description 5
- 230000004913 activation Effects 0.000 description 4
- 238000012544 monitoring process Methods 0.000 description 4
- 238000004364 calculation method Methods 0.000 description 3
- 238000004590 computer program Methods 0.000 description 3
- 230000006872 improvement Effects 0.000 description 3
- 230000008569 process Effects 0.000 description 3
- 238000013459 approach Methods 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 230000018109 developmental process Effects 0.000 description 2
- 230000007613 environmental effect Effects 0.000 description 2
- 238000002474 experimental method Methods 0.000 description 2
- 230000011218 segmentation Effects 0.000 description 2
- XLYOFNOQVPJJNP-UHFFFAOYSA-N water Substances O XLYOFNOQVPJJNP-UHFFFAOYSA-N 0.000 description 2
- 244000025254 Cannabis sativa Species 0.000 description 1
- 241000209504 Poaceae Species 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 239000013256 coordination polymer Substances 0.000 description 1
- 238000002790 cross-validation Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000004927 fusion Effects 0.000 description 1
- 238000002372 labelling Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 230000002265 prevention Effects 0.000 description 1
- 230000000306 recurrent effect Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
- 238000000844 transformation Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/764—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
- G06N3/0455—Auto-encoder networks; Encoder-decoder networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/048—Activation functions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/26—Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
- G06V10/267—Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion by performing operations on regions, e.g. growing, shrinking or watersheds
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/10—Terrestrial scenes
- G06V20/13—Satellite images
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02A—TECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
- Y02A50/00—TECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE in human health protection, e.g. against extreme weather
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- Computing Systems (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Software Systems (AREA)
- Data Mining & Analysis (AREA)
- Computational Linguistics (AREA)
- Mathematical Physics (AREA)
- Multimedia (AREA)
- General Engineering & Computer Science (AREA)
- Molecular Biology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Databases & Information Systems (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Medical Informatics (AREA)
- Remote Sensing (AREA)
- Astronomy & Astrophysics (AREA)
- Image Analysis (AREA)
Abstract
The invention relates to a landslide detection method and system based on multi-label classification, wherein the method comprises the following steps: inputting the image into a Swin Transformer for feature extraction; inputting the feature map into an ML-decoder; mapping the predicted value through a sigmoid function to obtain a probability value of each category, and predicting the dependency relationship among a plurality of labels. The invention shows excellent performance in multi-label classification tasks by utilizing the characteristic extraction capability of the Swin transducer and the expandability of the ML-decoder, and provides powerful technical support for landslide detection of remote sensing images and multi-label classification problems in other fields.
Description
Technical Field
The invention relates to the technical field of remote sensing, in particular to a landslide detection method and system based on multi-label classification.
Background
With the rapid development of remote sensing technology, landslide detection has become a key task for geological disaster monitoring. Optical telemetry has the highest resolution on the sub-meter scale compared to other data sources, and thus can provide more detailed surface information. Studies have shown that the reliability of the detection results is highly dependent on the quality of the data set and the inducements under consideration. In addition, high resolution satellite data can provide daily and even hourly updates, which provides important support for real-time monitoring and handling of emergencies such as disasters. Therefore, the optical remote sensing image is widely applied in the fields of geological exploration, environmental protection, earthquake disaster monitoring and the like.
On the basis, the object detection and semantic segmentation have realized higher landslide detection precision. However, a practical problem is that the labeling of data based on object detection and semantic segmentation requires a lot of expertise and manpower, and it is difficult to obtain related datasets. For this reason, researchers have focused on landslide detection based on image classification.
In the prior art, images are divided into landslide and non-landslide, and landslide is detected, so that good landslide classification accuracy is obtained. Meanwhile, in the prior art, the Grad-CAM is utilized to visualize the landslide category area, so that landslide detection based on image classification can reflect the specific position of landslide. Also, there is a great deal of experiments in the prior art to demonstrate the effectiveness of visualizing landslide areas through Grad-CAM and Score-CAM after classification of landslide using Convolutional Neural Networks (CNNs). Many related studies assume that an image is labeled with only one semantic category. This allows the macro description of the video content, but ignores other information contained in the video, which is in and out of reality.
MLC methods in the remote sensing field mainly comprise three major categories: (1) converting the problem into a plurality of classified problems; (2) Combining a Convolutional Neural Network (CNN) and a Recurrent Neural Network (RNN); (3) combining CNN with a Graph Neural Network (GNN). These methods belong to the branch of MLC tasks in the computer vision field. GNN is a very hot topic of research and has achieved good results in MLC tasks. For example, ML-GCN proposes an end-to-end multi-tag image recognition framework that uses GCN to map tag representations to inter-dependent object classifiers. However, word vectors are trained by the Glove method, and the inability to effectively match between the content feature space and the tag space may affect the performance of the model. To address this problem, biLSTM may be a better solution. BiLSTM has shown to work well in many tasks in the natural language processing arts. Therefore, researchers expand the method to the field of multi-label remote sensing image classification. The prior art proposes a fusion model, the network structure comprising three parts: the CNN feature extraction module is used for classifying the attention learning layer, and the BiLSTM relation is dependent on the learning module. The model achieves good effect.
However, most related studies combine modules of dependencies between CNNs and learning tags, such as classification heads, GNNs, and RNNs. Moreover, when BiLSTM is used as a dependency between learning classes, class errors can cause chain reactions that can potentially degrade the performance of the model. Convolutional Neural Networks (CNNs) have also met with some success in multi-tag classification tasks, but there are still some challenges and drawbacks, such as tag imbalance, insufficient capture of long-range dependencies, and limitations in modeling of correlation between tags.
Disclosure of Invention
In order to solve the problems, the application provides a landslide detection method and system based on multi-label classification, wherein the method is a multi-label classification method based on optical remote sensing images, and the data used are optical remote sensing data so as to realize more accurate landslide detection.
Landslide detection is defined herein as a multi-label classification (MLC) task that attempts to further mine information from optically remote sensing images. MLC tasks are predicting multiple tags for an image.
The system is a new model SwinML-detection. SwinML-Detect contains two main parts: swin transducer as feature extractor and ML-decoder as classification header. Swin transducer is responsible for capturing long-range dependencies and global context information, thereby effectively extracting class features in images. After extracting the feature sequences, the present application feeds these sequences into the ML-decoder. ML-decoder serves as a powerful classification head that can accommodate challenges in large-scale multi-tag classification tasks. It takes advantage of the transducer decoder to learn efficiently the relevance between tags and further capture the correlations between input features through a self-attention mechanism. The ML-decoder is specially designed for multi-label classification tasks, and can realize efficient calculation while maintaining high performance.
The SwinML-Detect model of the present application aims to address the key challenges in the multi-tag classification task by combining Swin transformers and ML-decoders. This combination enables SwinML-Detect to take full advantage of the powerful feature extraction capabilities of Swin transducer, as well as the scalability and flexibility of ML-decoder in dealing with the large-scale multi-tag classification problem. In addition, swinML-Detect can also better capture the relevance between labels, thereby achieving higher performance in multi-label classification tasks.
The technical scheme of the invention is as follows:
a landslide detection method based on multi-label classification comprises the following steps:
inputting the image in the step (1) into a Swin Transformer for feature extraction;
the model divides the image into patches with the size of 4 multiplied by 4, and the number of channels is 4 multiplied by 3; channel dimensions map to 96; the patch is fed into Swin Transformer block;
step (2) inputting the feature map into an ML-decoder;
ML-decoder predicts the multi-label category; each category has a predictive value;
mapping the predicted value through a sigmoid function to obtain a probability value of each category, and predicting the dependency relationship among a plurality of labels;
for the prediction vector:
p={p 1 ,p 2 ,...,p C }
wherein C is the number of categories, and calculates the probability value P of each category i =sigmoid(p i ) Wherein i=1, 2,. -%, C;
setting a threshold t; the threshold t is determined according to the actual problem and performance index, if P i If t, the class i is positive, otherwise, the class i is negative; further, the binary label vector is as follows:
B={b 1 ,b 2 ,...,b C }
wherein if P i >t,b i =1; if P i ≤t,b i =0。
Further, in step (1), the size of the input image is 224×224, the type is RGB, and the number of channels is 3.
Further, in the step (1), the Swin converter consists of 4 stages, each stage consists of a plurality of Swin Transformer block and a patch raising layer;
each Swin Transformer block comprises a multi-headed self-attention layer and a positional feed-forward network layer; at the end of each stage, a patch metering layer is added, and the characteristic size of the final stage output is 7×7×768.
Further, the construction of the data set is further included before the step (1): the method is constructed based on an original Bijie landslide data set and is used for landslide remote sensing multi-label classification.
Further, on the data enhancement, random up-down turning and random left-right turning are used for enhancing the generalization capability of the model, and the generalization probability is 0.5;
the optimizer uses Adam, the learning rate reduction strategy uses cosine reduction strategy, the initial learning rate lr=0.001, the regularization coefficient wd=0.001 and batch=32; the loss function uses BCEloss.
Further, the step (3) further comprises model evaluation:
category-based accuracy, recall, and F1 score: for each class i, first, its true, false, and false positives are calculated; then, calculating the accuracy, recall and F1 score of the category i;
calculating average accuracy, recall and F1 score of all categories;
based on example accuracy, recall, and F1 score:
for each example j, its true, false, and false positives are first calculated; then, the accuracy, recall, and F1 score for example j are calculated.
The invention also relates to a landslide detection system based on multi-label classification, which comprises a collector and a processor; the collector collects landslide data; the processor comprises a data construction module, a feature extraction module and a relation dependent learning module;
constructing a data set based on collected landslide data, and dividing an image into 4×4 patches with 4×4 channels by a feature extraction module; channel dimensions map to 96; the patch is fed into Swin Transformer block;
the relation dependence learning module decodes the image and predicts the multi-label category; each category has a predictive value; mapping the predicted value through a sigmoid function to obtain a probability value of each category, and predicting the dependency relationship among a plurality of labels;
for the prediction vector:
p={p 1 ,p 2 ,...,p C }
wherein C is the number of categories, and calculates the probability value P of each category i =sigmoid(p i ) Wherein i=1, 2,. -%, C;
setting a threshold t; the threshold t is determined according to the actual problem and performance index, if P i If t, the class i is positive, otherwise, the class i is negative; further, the binary label vector is as follows:
B={b 1 ,b 2 ,...,b C }
wherein if P i >t,b i =1; if P i ≤t,b i =0。
Further, in the feature extraction module, the Swin Transformer consists of 4 stages, and each stage consists of a plurality of Swin Transformer block and a patch merge layer;
each Swin Transformer block comprises a multi-headed self-attention layer and a positional feed-forward network layer; at the end of each stage, a patch metering layer is added, and the characteristic size of the final stage output is 7×7×768.
Further, the method also comprises a model evaluation module for evaluating the model:
category-based accuracy, recall, and F1 score: for each class i, first, its true, false, and false positives are calculated; then, calculating the accuracy, recall and F1 score of the category i;
calculating average accuracy, recall and F1 score of all categories;
based on example accuracy, recall, and F1 score:
for each example j, its true, false, and false positives are first calculated; then, the accuracy, recall, and F1 score for example j are calculated.
The model provided by the invention has effectiveness in landslide detection of the optical remote sensing image. This makes it possible to provide decision support for disaster emergency rescue. SwinML-Detect can help researchers and decision makers to identify landslide areas more quickly and accurately, and provides powerful support for disaster prevention and reduction work. In addition, swinML-detection is not only limited to landslide detection, but also can be popularized and applied to other remote sensing multi-label classification tasks, such as urban planning, agricultural monitoring, environmental protection and other fields. In summary, the SwinML-Detect model of the present invention provides a novel and effective solution to the task of remote sensing multi-tag classification. By utilizing the feature extraction capability of the Swin transducer and the expandability of the ML-decoder, the SwinML-Detect shows excellent performance in a multi-label classification task, and provides powerful technical support for landslide detection of remote sensing images and multi-label classification problems in other fields.
Drawings
FIG. 1 is a sample of a multi-tag dataset of a graduation landslide acquired in the method of the present embodiment;
fig. 2 is a block diagram of the system of the present embodiment;
FIG. 3 is a performance evaluation of Resnet50, viT-T, swin-T and SwinML-Detect-T in terms of recall and F1 scoring for each class; wherein FIG. 3 (a) is a performance evaluation of a dataset frequent class; FIG. 3 (b) is a performance evaluation of rare classes (bare land, court, etc.);
FIG. 4 is a class activation map comparison of Swin-Tiny and SwinML-Detect landslide categories; wherein (a) - (f) represent six different scenarios, which are 6 samples picked out in the pichia dataset.
FIG. 5 is a class activation map comparison of Swin-Tiny and SwinML-Detect road categories; wherein (a) - (f) represent six different scenarios, which are 6 samples picked out in the pichia dataset.
Detailed Description
The technical solutions in this embodiment will be clearly and completely described in conjunction with the embodiment of the present invention, and it is obvious that the described embodiment is only a part of examples of the present invention, not all examples. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without any inventive effort, are intended to be within the scope of the invention.
The embodiment relates to a landslide detection method based on multi-label classification, which comprises the following steps:
swin transducer is a new type of visual transducer, and as a hierarchical transducer, the hierarchical structure can be expressed as:
H={H 1 ,H 2 ,...,H L } (1)
where H is the hierarchical structure of the entire network, H i The i-th layer is represented, and L represents the total number of layers.
Which can be used as a general-purpose base network in the field of computer vision. The method has the main advantages of processing images with different sizes, along with high accuracy and lower calculation cost. The network architecture of the Swin Transformer is based on a method called "shifted window",
SW(I,P)={W 1 ,W 2 ,...,W N } (2)
wherein SW represents a shiftable window method of dividing an input image I into a plurality of sub-images, P represents a division size, W i Represents the ith sub-image, and N represents the total number of sub-images.
This method can reduce the amount of calculation while maintaining high accuracy. However, the disadvantage of the Swin transducer is that it requires more computational resources and longer training time. The Swin transducer concept is derived from Vision Transformer (ViT), which is also a visual transducer, based on a neural network of self-attention mechanisms that can handle images of different sizes. ViT has the main advantage of high accuracy and low computational cost in processing images of different sizes. However, viT suffers from the disadvantage of requiring more computing resources and longer training times. The network structure of the Swin Transformer adopts a "shifted window" method, which divides an input image into a plurality of sub-images, then processes each sub-image separately, and finally integrates the results of all the sub-images together. This method effectively reduces the amount of computation while maintaining high accuracy.
ML-Decoder is an innovative classification head, and the neural network design based on the attention mechanism can effectively predict the existence of class labels. It makes better use of spatial data than other methods and is excellent in terms of computational efficiency and cost. The ML-Decoder adopts a query-based network architecture and has the capability of generalizing to unseen categories.
Q={q 1 ,q 2 ,...,q K } (3)
Wherein Q represents a query set, Q i Represents the ith query, and K represents the total number of queries.
However, it has the disadvantage of requiring more computing resources and training time. The background for the development of ML-decoders stems from Vision Transformer (ViT). ViT as an innovative visual transducer, neural networks based on self-attention mechanisms are capable of processing images of different sizes.
Attention mechanisms, which may be expressed as
Wherein A represents an attention function, q represents query, K represents key, V represents value, d k Representing the dimensions of the key. This formula describes how the similarity between each query and key is calculated and used to weight the value.
ViT has the advantage of high accuracy and low computational cost in processing images of different sizes, but also requires more computational resources and training time. In ML-decoders, the query-based network architecture can be generalized to the unseen categories. Its query employs a novel attention mechanism for predicting the presence of class labels. These queries may be learner-based or fixed, and different queries may be used during training and reasoning to achieve generalization to unseen categories.
The application provides a new model SwinML-Detect, which mainly comprises a feature extraction module and a relation dependence learning module.
First, an image is input to the Swin Transformer for feature extraction. The input image has a size 224×224, a type of RGB, and a channel number of 3. The model divides the image into 4×4 patches per block, for a total of 56×56 patches, with 4×4×3 channels. To match the multi-headed self-attention input, the channel dimension needs to be mapped to 96. These patches are then fed into Swin Transformer block.
The Swin transform consists of 4 stages, each consisting of several Swin Transformer block and one patch metering layer. Each Swin Transformer block comprises a multi-headed self-attention (MHA) layer and a positional Feed Forward Network (FFN) layer. At the end of each stage, the present application adds a patch raising layer to reduce feature size and increase channel count. This makes the structure of SwinML-Detect similar to ResNet 50. The feature size of the last stage output is 7×7×768.
Next, the feature map is input into a relationship-dependent learning module, i.e., ML-decoder. The ML-decoder is a multi-layer perceptron for decoding of images. In this process, the ML-decoder predicts the multi-label class. There is one predictor for each category. Finally, mapping the predicted value through a sigmoid function to obtain the probability value of each category. Specifically, for predictive vectors
p={p 1 ,p 2 ,...,p C } (5)
Where C is the number of categories, the present application calculates the probability value P for each category i =sigmoid(p i ) Where i=1, 2,..c.
In order to determine the final predictive label vector, a threshold t is set in the present application. The threshold may be determined based on actual problems and performance metrics, for example, cross-validation may be used to determine the optimal threshold. The probability value is then converted to a category label according to a threshold. In particular, if P i And if t, the class i is positive, otherwise, the class i is negative. Thus, the application can obtain a binary label vector
B={b 1 ,b 2 ,...,b C } (6)
Wherein b i =1 (if P i > t) or b i =0 (if P i ≤t)。
By this method, the SwinML-Detect model can achieve multi-label classification with high accuracy and efficiency. The feature extraction module (namely the Swin Transformer) is combined with the relation dependency learning module (namely the ML-decoder), so that the model can effectively learn the features of the input image and predict the dependency relation among a plurality of labels.
As shown in fig. 1, in this embodiment, a Bijie multi-label landslide data set is used, which is constructed based on an original Bijie landslide data set and is specially used for landslide remote sensing multi-label classification. The original dataset contained 2773 RGB images belonging to two categories, 770 landslide categories, 2003 non-landslide categories, with a spatial resolution of 0.8m, where the number of landslide samples is 770. According to the types of ecosystems in pichia and the specific types of ground features in the images, 11 types are set, including: bare land, buildings, courts, farmlands, grasses, greenhouses, landslide, road surfaces, semi-arid grasslands, trees and water.
Table 1 pichia landslide multi-tag dataset category
In this embodiment, the evaluation indexes of the model include two types: category-based evaluation methods and example-based evaluation methods. In the multi-label classification task, an example-based evaluation method is calculated for each data sample. In this approach, the present application focuses on the predicted performance of each sample, and then calculates the average performance of all samples. The main advantage of this assessment method is that it can reflect the performance of the model at the overall sample level. Category-based evaluation methods are calculated for each category. In this approach, the present application focuses on the predicted performance of each category, and then calculates the average performance for all categories. The main advantage of this assessment method is that it can reflect the performance differences of the model between different classes, as well as the predictive power of the model on a certain class.
Category-based accuracy (P), recall (R), and F1 score (F1):
for each class i, the present application first calculates its true positives (TP i ) False Positives (FP) i ) And False Negative (FN) i ). The accuracy, recall, and F1 score for category i may then be calculated:
accuracy of class i (P i ):
Recall of category i (R i ):
Category(s)F1 fraction of i (F1 i ):
The average accuracy, recall, and F1 score for all categories may then be calculated.
Example-based accuracy (P), recall (R), and F1 score (F1):
for each example (sample) j, its true positive (TP is first calculated j ) False Positives (FP) j ) And False Negative (FN) j ). Then, the accuracy, recall, and F1 score for example j may be calculated:
accuracy of example j (P j ):
Recall of example j (R j ):
The F1 score of example j (F1 j ):
As shown in fig. 2, the present embodiment further relates to a landslide detection system based on multi-label classification, which includes a collector and a processor; the collector collects landslide data; the processor comprises a data construction module, a feature extraction module and a relation dependent learning module.
To compare with the original paper, the ratio is 2:1 randomly divides the training set and the test set. The method does not improve the performance of the model by using the enhancement in test, but strengthens the generalization capability of the model. The proposed method is based on pytorch. The images of all datasets were resize 224 x 224. In the aspect of data enhancement, random up-down turning and random left-right turning are used, the generalization capability of a model is enhanced, and the generalization probability is 0.5. The optimizer uses Adam, the learning rate reduction strategy uses cosine reduction strategy, the initial learning rate lr=0.001, the regularization coefficient wd=0.001, and batch=32. The loss function uses BCEloss. All experiments were performed on a nividagaoforertx 3080 tigepu.
The processor may be a general-purpose processor, including a central processing unit, a network processor, etc.; but also digital signal processors, application specific integrated circuits, field programmable gate arrays or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.
Optionally, the embodiment of the present application further provides a storage medium, where instructions are stored, when the instructions are executed on a computer, cause the computer to perform the method of the embodiment as shown in the foregoing.
Optionally, the embodiment of the present application further provides a chip for executing the instruction, where the chip is used to perform the method of the foregoing embodiment.
The present application also provides a program product, which comprises a computer program stored in a storage medium, from which at least one processor can read the computer program, and the method of the above embodiment can be implemented when the at least one processor executes the computer program.
Experimental results
The proposed SwinML-Detect model was compared to conventional convolutional neural networks and vision transducer-based models. They are widely used models for remote sensing landslide detection or as baselines for evaluation of various visual transducer structures.
Table 2 shows the performance of the proposed model on unseen (test) data. To verify the performance of the self-attention model, the present application uses the unseen (test) dataset to evaluate the model's performance on the training phase entirely external data. The test set is evaluated using accuracy target criteria, one for each category.
For the sake of better fairness, swin-Tiny, vit-Tiny and response 50 were used for comparison. It can be seen that SwinML-Detect achieves high performance results.
Resnet50 has lower classification performance (CP, CR, and CF 1), but performs better in terms of accuracy (EP and ER). The number of parameters is small (23.53M), which means that the model is relatively small and efficient. However, poor classification performance may be due to the fact that Resnet50 is better suited to handle single-label tasks, while performance is limited in multi-label tasks. ViT-T has a large improvement in classification performance compared to Resnet50, but a slight increase in accuracy. The number of parameters increases substantially to 55.26M, which makes the model more complex and requires more computing resources. ViT-T is used as a visual model based on a transducer, and can better capture global features, so that the method has better performance on multi-label classification tasks. Compared with the two, the Swin-T has obviously improved classification performance and improved accuracy. The number of parameters was 27.51M, which is much lower than Resnet50, but much lower than ViT-T. Swin-T introduces a hierarchical structure to make the model more suitable for capturing image features, thereby achieving better performance on multi-label classification tasks.
The Resnet50+ML decoder is added with a multi-label decoder on the basis of the Resnet50, so that the classification performance is greatly improved, and the accuracy is slightly improved.
The number of parameters increases to 30.61M, indicating that the decoder brings additional computation. This improvement highlights the importance of the multi-tag decoder in handling multi-tag tasks. Compared with ViT-T, viT +ML decoder has the advantages that the classification performance is obviously improved, and the accuracy is also improved. The number of parameters is increased to 61.38M, making the model more complex. The ViT model in combination with the multi-tag decoder performs better in the multi-tag classification task.
SwinML-Detect has the highest classification performance and accuracy among all models. The number of parameters is 33.63M, which is relatively less computationally complex than the other two models with decoders. This further demonstrates that the efficient combination of Swin architecture with multi-tag decoder can achieve excellent performance in multi-tag classification tasks.
The recall and F1 scores for each class are shown in fig. 3. For some categories that occur more frequently, such as buildings, farms, trees and landslides, there is no substantial difference between SwinML-Detect and traditional convolution and self-attention based visual transformations. However, in fig. 3 (b), performance indicators for categories (e.g., bare land, grass, water, etc.) containing unusual samples (rare events) are shown. In fig. 3 (b), our SwinML-Detect method overall performs better than other potential alternatives on rare categories. This shows that the proposed method effectively improves the probability of detection of rare categories while ensuring good performance of high frequency categories.
TABLE 2 Performance comparison of Resnet50, viT-T, swin-T, ML-Resnet50, ML-ViT and SwinML-detection-T
FIGS. 3 (a) and 3 (b) are performance evaluations of Resnet50, viT-T, swin-T and SwinML-Detect-T for each type of recall and F1 scores.
SwinML-Detect can be seen to have the best performance among all models, with the highest precision (P), recall (R) and F1-score. The behavior of the individual models is now analyzed:
single+resnet50: this is a Resnet50 model based on single tag classification. Although its accuracy is relatively high, the recall and F1-score are relatively low, probably because in multi-label classification tasks, the single-label classification method does not adequately capture correlations between samples.
Single+resnet50+dem: this model adds DEM (Deep Embedding Model) on the basis of single+Resnet 50. DEM can boost model recall and F1-score because it can learn more complex relationships between samples.
Multi+resnet50: this is a multi-label classification based Resnet50 model. The multi-tag classification captures the correlation between samples better than the single-tag classification, thus improving both recall and F1-score.
Multi+ ViT-T: this model uses vision Transformer (ViT) as the basic architecture. Although ViT has a strong performance on image classification tasks, its performance in this task is not quite as different from multi+Resnet50, probably because both architectures are not able to fully exploit the higher order features in Multi-label classification tasks.
Multi+Swin-T: this model uses Swin transducer as the basic architecture. Swin transducer has stronger local perceptibility than ViT and therefore performs better in multi-tag classification tasks, with higher recall and F1-score.
SwinML-Detect: the model combines a Swin transducer and a multi-label decoder, has strong local perceptibility and can well process correlation among samples. Therefore, it performs best in all models, with the highest accuracy, recall and F1-score.
In summary, the SwinML-Detect model performs best, mainly because it combines the advantages of the strong local perceptibility of Swin transducer and the sample correlation handled by the multi-tag decoder. Other models either use only a single tag classification method, but cannot fully capture the correlation between samples; or using either Resnet50 or ViT as the basic architecture, the local perceptibility is relatively weak, resulting in limited performance.
Single label classification methods (e.g., single+resnet50 and single+resnet50+dem) are limited in performance in the multi-label classification task because they do not adequately capture inter-sample correlations. The use of Multi-label classification methods and the introduction of Multi-label decoders (e.g., multi+Resnet50 and multi+Resnet50+ml decoder) can enhance the performance of the model in Multi-label classification tasks, but still suffer from the fundamental architectural limitations. The Swin transducer architecture (e.g., multi+Swin-T and SwinML-Detect) performs better in the Multi-tag classification task because it has a stronger local perceptibility. The SwinML-Detect model combines the local perceptibility of the Swin transducer with the advantage of the multi-tag decoder processing sample correlation, and therefore performs best in all models, with the highest accuracy, recall and F1-score.
TABLE 3 evaluation of Performance of Resnet50, viT-T, swin-T and SwinML-Detect-T in landslide categories
FIGS. 4 and 5 are the landslide and pavement-based Grad-CAM (Gradient-weighted Class Activation Mapping), respectively. Among them, fig. 4 (a) - (f), 5 (a) - (f) are 6 samples in the pichia landslide data set.
As shown in fig. 4 and 5, several important classes of class activation maps (e.g., landslide and road surface) are compared. It can be seen that the method of this embodiment is more accurate in identifying landslide, and the range of attention is relatively concentrated. But shows encouraging performance in loess landslide which is difficult to identify.
The foregoing description of the preferred embodiments of the invention is not intended to be limiting, but rather is intended to cover all modifications, equivalents, alternatives, and improvements that fall within the spirit and scope of the invention.
Claims (9)
1. A landslide detection method based on multi-label classification is characterized by comprising the following steps: the method comprises the following steps:
inputting the image in the step (1) into a Swin Transformer for feature extraction;
the model divides the image into patches with the size of 4 multiplied by 4, and the number of channels is 4 multiplied by 3; channel dimensions map to 96; the patch is fed into Swin Transformer block;
step (2) inputting the feature map into an ML-decoder;
ML-decoder predicts the multi-label category; each category has a predictive value;
mapping the predicted value through a sigmoid function to obtain a probability value of each category, and predicting the dependency relationship among a plurality of labels;
for the prediction vector:
p={p 1 ,p 2 ,...,p C }
wherein C is the number of categories, and calculates the probability value P of each category i =sigmoid(p i ) Wherein i=1, 2,. -%, C;
setting a threshold t; the threshold t is determined according to the actual problem and performance index, if P i If t, the class i is positive, otherwise, the class i is negative; further, the binary label vector is as follows:
B={b 1 ,b 2 ,...,b C }
wherein if P i >t,b i =1; if P i ≤t,b i =0。
2. The method according to claim 1, characterized in that: in step (1), the size of the input image is 224×224, the type is RGB, and the number of channels is 3.
3. The method according to claim 1, characterized in that: in the step (1), the Swin Transformer consists of 4 stages, and each stage consists of a plurality of Swin Transformer block and a patch metering layer;
each Swin Transformer block comprises a multi-headed self-attention layer and a positional feed-forward network layer; at the end of each stage, a patch metering layer is added, and the characteristic size of the final stage output is 7×7×768.
4. The method according to claim 1, characterized in that: the construction of the data set is further included before the step (1): the method is constructed based on an original Bijie landslide data set and is used for landslide remote sensing multi-label classification.
5. The method according to claim 4, wherein: on data enhancement, random up-down turning and random left-right turning are used for enhancing the generalization capability of the model, and the generalization probability is 0.5;
the optimizer uses Adam, the learning rate reduction strategy uses cosine reduction strategy, the initial learning rate lr=0.001, the regularization coefficient wd=0.001 and batch=32; the loss function uses BCEloss.
6. The method according to claim 1, characterized in that: the step (3) further comprises the step of evaluating a model:
category-based accuracy, recall, and F1 score: for each class i, first, its true, false, and false positives are calculated; then, calculating the accuracy, recall and F1 score of the category i;
calculating average accuracy, recall and F1 score of all categories;
based on example accuracy, recall, and F1 score:
for each example j, its true, false, and false positives are first calculated; then, the accuracy, recall, and F1 score for example j are calculated.
7. Landslide detection system based on multi-label classification, its characterized in that: comprises a collector and a processor; the collector collects landslide data; the processor comprises a data construction module, a feature extraction module and a relation dependent learning module;
constructing a data set based on collected landslide data, and dividing an image into 4×4 patches with 4×4 channels by a feature extraction module; channel dimensions map to 96; the patch is fed into Swin Transformer block;
the relation dependence learning module decodes the image and predicts the multi-label category; each category has a predictive value; mapping the predicted value through a sigmoid function to obtain a probability value of each category, and predicting the dependency relationship among a plurality of labels;
for the prediction vector:
p={p 1 ,p 2 ,...,p C }
wherein C is the number of categories, and calculates the probability value P of each category i =sigmoid(p i ) Where i=1, 2..,C;
Setting a threshold t; the threshold t is determined according to the actual problem and performance index, if P i If t, the class i is positive, otherwise, the class i is negative; further, the binary label vector is as follows:
B={b 1 ,b 2 ,...,b C }
wherein if P i >t,b i =1; if P i ≤t,b i =0。
8. The system according to claim 7, wherein: in the feature extraction module, the Swin Transformer consists of 4 stages, and each stage consists of a plurality of Swin Transformer block and a patch merge layer;
each Swin Transformer block comprises a multi-headed self-attention layer and a positional feed-forward network layer; at the end of each stage, a patch metering layer is added, and the characteristic size of the final stage output is 7×7×768.
9. The system according to claim 7, wherein: the system also comprises a model evaluation module, wherein the model evaluation module is used for evaluating:
category-based accuracy, recall, and F1 score: for each class i, first, its true, false, and false positives are calculated; then, calculating the accuracy, recall and F1 score of the category i;
calculating average accuracy, recall and F1 score of all categories;
based on example accuracy, recall, and F1 score:
for each example j, its true, false, and false positives are first calculated; then, the accuracy, recall, and F1 score for example j are calculated.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310451861.0A CN116524258A (en) | 2023-04-25 | 2023-04-25 | Landslide detection method and system based on multi-label classification |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310451861.0A CN116524258A (en) | 2023-04-25 | 2023-04-25 | Landslide detection method and system based on multi-label classification |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116524258A true CN116524258A (en) | 2023-08-01 |
Family
ID=87407648
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310451861.0A Pending CN116524258A (en) | 2023-04-25 | 2023-04-25 | Landslide detection method and system based on multi-label classification |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116524258A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117274823A (en) * | 2023-11-21 | 2023-12-22 | 成都理工大学 | Visual transducer landslide identification method based on DEM feature enhancement |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111343730A (en) * | 2020-04-15 | 2020-06-26 | 上海交通大学 | Large-scale MIMO passive random access method under space correlation channel |
CN113837154A (en) * | 2021-11-25 | 2021-12-24 | 之江实验室 | Open set filtering system and method based on multitask assistance |
CN114842351A (en) * | 2022-04-11 | 2022-08-02 | 中国人民解放军战略支援部队航天工程大学 | Remote sensing image semantic change detection method based on twin transforms |
CN114937202A (en) * | 2022-04-11 | 2022-08-23 | 青岛理工大学 | Double-current Swin transform remote sensing scene classification method |
CN115019123A (en) * | 2022-05-20 | 2022-09-06 | 中南大学 | Self-distillation contrast learning method for remote sensing image scene classification |
CN115424059A (en) * | 2022-08-24 | 2022-12-02 | 珠江水利委员会珠江水利科学研究院 | Remote sensing land use classification method based on pixel level comparison learning |
CN115588217A (en) * | 2022-06-23 | 2023-01-10 | 西安电子科技大学 | Face attribute detection method based on deep self-attention network |
CN115601584A (en) * | 2022-09-14 | 2023-01-13 | 北京联合大学(Cn) | Remote sensing scene image multi-label classification method and device and storage medium |
CN115908946A (en) * | 2022-12-21 | 2023-04-04 | 南京信息工程大学 | Land use classification method based on multiple attention semantic segmentation |
-
2023
- 2023-04-25 CN CN202310451861.0A patent/CN116524258A/en active Pending
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111343730A (en) * | 2020-04-15 | 2020-06-26 | 上海交通大学 | Large-scale MIMO passive random access method under space correlation channel |
CN113837154A (en) * | 2021-11-25 | 2021-12-24 | 之江实验室 | Open set filtering system and method based on multitask assistance |
CN114842351A (en) * | 2022-04-11 | 2022-08-02 | 中国人民解放军战略支援部队航天工程大学 | Remote sensing image semantic change detection method based on twin transforms |
CN114937202A (en) * | 2022-04-11 | 2022-08-23 | 青岛理工大学 | Double-current Swin transform remote sensing scene classification method |
CN115019123A (en) * | 2022-05-20 | 2022-09-06 | 中南大学 | Self-distillation contrast learning method for remote sensing image scene classification |
CN115588217A (en) * | 2022-06-23 | 2023-01-10 | 西安电子科技大学 | Face attribute detection method based on deep self-attention network |
CN115424059A (en) * | 2022-08-24 | 2022-12-02 | 珠江水利委员会珠江水利科学研究院 | Remote sensing land use classification method based on pixel level comparison learning |
CN115601584A (en) * | 2022-09-14 | 2023-01-13 | 北京联合大学(Cn) | Remote sensing scene image multi-label classification method and device and storage medium |
CN115908946A (en) * | 2022-12-21 | 2023-04-04 | 南京信息工程大学 | Land use classification method based on multiple attention semantic segmentation |
Non-Patent Citations (2)
Title |
---|
TAL RIDNIK等: "ML-Decoder:Scalable and Versatile Classification Head", ARXIV, pages 1 - 14 * |
ZELIU等: "Swin Transformer:Hierarchical Vision Transformerusing Shifted Windows", ARXIV, pages 1 - 14 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117274823A (en) * | 2023-11-21 | 2023-12-22 | 成都理工大学 | Visual transducer landslide identification method based on DEM feature enhancement |
CN117274823B (en) * | 2023-11-21 | 2024-01-26 | 成都理工大学 | Visual transducer landslide identification method based on DEM feature enhancement |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN114067160B (en) | Small sample remote sensing image scene classification method based on embedded smooth graph neural network | |
CN107885764B (en) | Rapid Hash vehicle retrieval method based on multitask deep learning | |
CN111259786B (en) | Pedestrian re-identification method based on synchronous enhancement of appearance and motion information of video | |
CN113780149B (en) | Remote sensing image building target efficient extraction method based on attention mechanism | |
CN112668494A (en) | Small sample change detection method based on multi-scale feature extraction | |
Xia et al. | A deep Siamese postclassification fusion network for semantic change detection | |
CN114926746A (en) | SAR image change detection method based on multi-scale differential feature attention mechanism | |
Li et al. | A review of deep learning methods for pixel-level crack detection | |
Liu et al. | Survey of road extraction methods in remote sensing images based on deep learning | |
Zhao et al. | Mine diversified contents of multispectral cloud images along with geographical information for multilabel classification | |
CN116524258A (en) | Landslide detection method and system based on multi-label classification | |
CN115830379A (en) | Zero-sample building image classification method based on double-attention machine system | |
CN116524189A (en) | High-resolution remote sensing image semantic segmentation method based on coding and decoding indexing edge characterization | |
CN115457332A (en) | Image multi-label classification method based on graph convolution neural network and class activation mapping | |
CN111898418A (en) | Human body abnormal behavior detection method based on T-TINY-YOLO network | |
CN115239765A (en) | Infrared image target tracking system and method based on multi-scale deformable attention | |
López-Cifuentes et al. | Attention-based knowledge distillation in scene recognition: the impact of a dct-driven loss | |
Cao et al. | Face detection for rail transit passengers based on single shot detector and active learning | |
CN117809198A (en) | Remote sensing image significance detection method based on multi-scale feature aggregation network | |
Luo et al. | Infrared Road Object Detection Based on Improved YOLOv8. | |
Zhang et al. | Scale-wised feature enhancement network for change captioning of remote sensing images | |
CN115331254A (en) | Anchor frame-free example portrait semantic analysis method | |
CN115063831A (en) | High-performance pedestrian retrieval and re-identification method and device | |
Li et al. | Research on efficient detection network method for remote sensing images based on self attention mechanism | |
CN118470608B (en) | Weak supervision video anomaly detection method and system based on feature enhancement and fusion |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |