CN116524258A - Landslide detection method and system based on multi-label classification - Google Patents

Landslide detection method and system based on multi-label classification Download PDF

Info

Publication number
CN116524258A
CN116524258A CN202310451861.0A CN202310451861A CN116524258A CN 116524258 A CN116524258 A CN 116524258A CN 202310451861 A CN202310451861 A CN 202310451861A CN 116524258 A CN116524258 A CN 116524258A
Authority
CN
China
Prior art keywords
category
swin
recall
score
accuracy
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310451861.0A
Other languages
Chinese (zh)
Inventor
辛志慧
李永鑫
袁梦婷
邹蕊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Yunnan Normal University
Original Assignee
Yunnan Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Yunnan Normal University filed Critical Yunnan Normal University
Priority to CN202310451861.0A priority Critical patent/CN116524258A/en
Publication of CN116524258A publication Critical patent/CN116524258A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • G06V10/267Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion by performing operations on regions, e.g. growing, shrinking or watersheds
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/10Terrestrial scenes
    • G06V20/13Satellite images
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A50/00TECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE in human health protection, e.g. against extreme weather

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Multimedia (AREA)
  • General Engineering & Computer Science (AREA)
  • Molecular Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Databases & Information Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medical Informatics (AREA)
  • Remote Sensing (AREA)
  • Astronomy & Astrophysics (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to a landslide detection method and system based on multi-label classification, wherein the method comprises the following steps: inputting the image into a Swin Transformer for feature extraction; inputting the feature map into an ML-decoder; mapping the predicted value through a sigmoid function to obtain a probability value of each category, and predicting the dependency relationship among a plurality of labels. The invention shows excellent performance in multi-label classification tasks by utilizing the characteristic extraction capability of the Swin transducer and the expandability of the ML-decoder, and provides powerful technical support for landslide detection of remote sensing images and multi-label classification problems in other fields.

Description

Landslide detection method and system based on multi-label classification
Technical Field
The invention relates to the technical field of remote sensing, in particular to a landslide detection method and system based on multi-label classification.
Background
With the rapid development of remote sensing technology, landslide detection has become a key task for geological disaster monitoring. Optical telemetry has the highest resolution on the sub-meter scale compared to other data sources, and thus can provide more detailed surface information. Studies have shown that the reliability of the detection results is highly dependent on the quality of the data set and the inducements under consideration. In addition, high resolution satellite data can provide daily and even hourly updates, which provides important support for real-time monitoring and handling of emergencies such as disasters. Therefore, the optical remote sensing image is widely applied in the fields of geological exploration, environmental protection, earthquake disaster monitoring and the like.
On the basis, the object detection and semantic segmentation have realized higher landslide detection precision. However, a practical problem is that the labeling of data based on object detection and semantic segmentation requires a lot of expertise and manpower, and it is difficult to obtain related datasets. For this reason, researchers have focused on landslide detection based on image classification.
In the prior art, images are divided into landslide and non-landslide, and landslide is detected, so that good landslide classification accuracy is obtained. Meanwhile, in the prior art, the Grad-CAM is utilized to visualize the landslide category area, so that landslide detection based on image classification can reflect the specific position of landslide. Also, there is a great deal of experiments in the prior art to demonstrate the effectiveness of visualizing landslide areas through Grad-CAM and Score-CAM after classification of landslide using Convolutional Neural Networks (CNNs). Many related studies assume that an image is labeled with only one semantic category. This allows the macro description of the video content, but ignores other information contained in the video, which is in and out of reality.
MLC methods in the remote sensing field mainly comprise three major categories: (1) converting the problem into a plurality of classified problems; (2) Combining a Convolutional Neural Network (CNN) and a Recurrent Neural Network (RNN); (3) combining CNN with a Graph Neural Network (GNN). These methods belong to the branch of MLC tasks in the computer vision field. GNN is a very hot topic of research and has achieved good results in MLC tasks. For example, ML-GCN proposes an end-to-end multi-tag image recognition framework that uses GCN to map tag representations to inter-dependent object classifiers. However, word vectors are trained by the Glove method, and the inability to effectively match between the content feature space and the tag space may affect the performance of the model. To address this problem, biLSTM may be a better solution. BiLSTM has shown to work well in many tasks in the natural language processing arts. Therefore, researchers expand the method to the field of multi-label remote sensing image classification. The prior art proposes a fusion model, the network structure comprising three parts: the CNN feature extraction module is used for classifying the attention learning layer, and the BiLSTM relation is dependent on the learning module. The model achieves good effect.
However, most related studies combine modules of dependencies between CNNs and learning tags, such as classification heads, GNNs, and RNNs. Moreover, when BiLSTM is used as a dependency between learning classes, class errors can cause chain reactions that can potentially degrade the performance of the model. Convolutional Neural Networks (CNNs) have also met with some success in multi-tag classification tasks, but there are still some challenges and drawbacks, such as tag imbalance, insufficient capture of long-range dependencies, and limitations in modeling of correlation between tags.
Disclosure of Invention
In order to solve the problems, the application provides a landslide detection method and system based on multi-label classification, wherein the method is a multi-label classification method based on optical remote sensing images, and the data used are optical remote sensing data so as to realize more accurate landslide detection.
Landslide detection is defined herein as a multi-label classification (MLC) task that attempts to further mine information from optically remote sensing images. MLC tasks are predicting multiple tags for an image.
The system is a new model SwinML-detection. SwinML-Detect contains two main parts: swin transducer as feature extractor and ML-decoder as classification header. Swin transducer is responsible for capturing long-range dependencies and global context information, thereby effectively extracting class features in images. After extracting the feature sequences, the present application feeds these sequences into the ML-decoder. ML-decoder serves as a powerful classification head that can accommodate challenges in large-scale multi-tag classification tasks. It takes advantage of the transducer decoder to learn efficiently the relevance between tags and further capture the correlations between input features through a self-attention mechanism. The ML-decoder is specially designed for multi-label classification tasks, and can realize efficient calculation while maintaining high performance.
The SwinML-Detect model of the present application aims to address the key challenges in the multi-tag classification task by combining Swin transformers and ML-decoders. This combination enables SwinML-Detect to take full advantage of the powerful feature extraction capabilities of Swin transducer, as well as the scalability and flexibility of ML-decoder in dealing with the large-scale multi-tag classification problem. In addition, swinML-Detect can also better capture the relevance between labels, thereby achieving higher performance in multi-label classification tasks.
The technical scheme of the invention is as follows:
a landslide detection method based on multi-label classification comprises the following steps:
inputting the image in the step (1) into a Swin Transformer for feature extraction;
the model divides the image into patches with the size of 4 multiplied by 4, and the number of channels is 4 multiplied by 3; channel dimensions map to 96; the patch is fed into Swin Transformer block;
step (2) inputting the feature map into an ML-decoder;
ML-decoder predicts the multi-label category; each category has a predictive value;
mapping the predicted value through a sigmoid function to obtain a probability value of each category, and predicting the dependency relationship among a plurality of labels;
for the prediction vector:
p={p 1 ,p 2 ,...,p C }
wherein C is the number of categories, and calculates the probability value P of each category i =sigmoid(p i ) Wherein i=1, 2,. -%, C;
setting a threshold t; the threshold t is determined according to the actual problem and performance index, if P i If t, the class i is positive, otherwise, the class i is negative; further, the binary label vector is as follows:
B={b 1 ,b 2 ,...,b C }
wherein if P i >t,b i =1; if P i ≤t,b i =0。
Further, in step (1), the size of the input image is 224×224, the type is RGB, and the number of channels is 3.
Further, in the step (1), the Swin converter consists of 4 stages, each stage consists of a plurality of Swin Transformer block and a patch raising layer;
each Swin Transformer block comprises a multi-headed self-attention layer and a positional feed-forward network layer; at the end of each stage, a patch metering layer is added, and the characteristic size of the final stage output is 7×7×768.
Further, the construction of the data set is further included before the step (1): the method is constructed based on an original Bijie landslide data set and is used for landslide remote sensing multi-label classification.
Further, on the data enhancement, random up-down turning and random left-right turning are used for enhancing the generalization capability of the model, and the generalization probability is 0.5;
the optimizer uses Adam, the learning rate reduction strategy uses cosine reduction strategy, the initial learning rate lr=0.001, the regularization coefficient wd=0.001 and batch=32; the loss function uses BCEloss.
Further, the step (3) further comprises model evaluation:
category-based accuracy, recall, and F1 score: for each class i, first, its true, false, and false positives are calculated; then, calculating the accuracy, recall and F1 score of the category i;
calculating average accuracy, recall and F1 score of all categories;
based on example accuracy, recall, and F1 score:
for each example j, its true, false, and false positives are first calculated; then, the accuracy, recall, and F1 score for example j are calculated.
The invention also relates to a landslide detection system based on multi-label classification, which comprises a collector and a processor; the collector collects landslide data; the processor comprises a data construction module, a feature extraction module and a relation dependent learning module;
constructing a data set based on collected landslide data, and dividing an image into 4×4 patches with 4×4 channels by a feature extraction module; channel dimensions map to 96; the patch is fed into Swin Transformer block;
the relation dependence learning module decodes the image and predicts the multi-label category; each category has a predictive value; mapping the predicted value through a sigmoid function to obtain a probability value of each category, and predicting the dependency relationship among a plurality of labels;
for the prediction vector:
p={p 1 ,p 2 ,...,p C }
wherein C is the number of categories, and calculates the probability value P of each category i =sigmoid(p i ) Wherein i=1, 2,. -%, C;
setting a threshold t; the threshold t is determined according to the actual problem and performance index, if P i If t, the class i is positive, otherwise, the class i is negative; further, the binary label vector is as follows:
B={b 1 ,b 2 ,...,b C }
wherein if P i >t,b i =1; if P i ≤t,b i =0。
Further, in the feature extraction module, the Swin Transformer consists of 4 stages, and each stage consists of a plurality of Swin Transformer block and a patch merge layer;
each Swin Transformer block comprises a multi-headed self-attention layer and a positional feed-forward network layer; at the end of each stage, a patch metering layer is added, and the characteristic size of the final stage output is 7×7×768.
Further, the method also comprises a model evaluation module for evaluating the model:
category-based accuracy, recall, and F1 score: for each class i, first, its true, false, and false positives are calculated; then, calculating the accuracy, recall and F1 score of the category i;
calculating average accuracy, recall and F1 score of all categories;
based on example accuracy, recall, and F1 score:
for each example j, its true, false, and false positives are first calculated; then, the accuracy, recall, and F1 score for example j are calculated.
The model provided by the invention has effectiveness in landslide detection of the optical remote sensing image. This makes it possible to provide decision support for disaster emergency rescue. SwinML-Detect can help researchers and decision makers to identify landslide areas more quickly and accurately, and provides powerful support for disaster prevention and reduction work. In addition, swinML-detection is not only limited to landslide detection, but also can be popularized and applied to other remote sensing multi-label classification tasks, such as urban planning, agricultural monitoring, environmental protection and other fields. In summary, the SwinML-Detect model of the present invention provides a novel and effective solution to the task of remote sensing multi-tag classification. By utilizing the feature extraction capability of the Swin transducer and the expandability of the ML-decoder, the SwinML-Detect shows excellent performance in a multi-label classification task, and provides powerful technical support for landslide detection of remote sensing images and multi-label classification problems in other fields.
Drawings
FIG. 1 is a sample of a multi-tag dataset of a graduation landslide acquired in the method of the present embodiment;
fig. 2 is a block diagram of the system of the present embodiment;
FIG. 3 is a performance evaluation of Resnet50, viT-T, swin-T and SwinML-Detect-T in terms of recall and F1 scoring for each class; wherein FIG. 3 (a) is a performance evaluation of a dataset frequent class; FIG. 3 (b) is a performance evaluation of rare classes (bare land, court, etc.);
FIG. 4 is a class activation map comparison of Swin-Tiny and SwinML-Detect landslide categories; wherein (a) - (f) represent six different scenarios, which are 6 samples picked out in the pichia dataset.
FIG. 5 is a class activation map comparison of Swin-Tiny and SwinML-Detect road categories; wherein (a) - (f) represent six different scenarios, which are 6 samples picked out in the pichia dataset.
Detailed Description
The technical solutions in this embodiment will be clearly and completely described in conjunction with the embodiment of the present invention, and it is obvious that the described embodiment is only a part of examples of the present invention, not all examples. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without any inventive effort, are intended to be within the scope of the invention.
The embodiment relates to a landslide detection method based on multi-label classification, which comprises the following steps:
swin transducer is a new type of visual transducer, and as a hierarchical transducer, the hierarchical structure can be expressed as:
H={H 1 ,H 2 ,...,H L } (1)
where H is the hierarchical structure of the entire network, H i The i-th layer is represented, and L represents the total number of layers.
Which can be used as a general-purpose base network in the field of computer vision. The method has the main advantages of processing images with different sizes, along with high accuracy and lower calculation cost. The network architecture of the Swin Transformer is based on a method called "shifted window",
SW(I,P)={W 1 ,W 2 ,...,W N } (2)
wherein SW represents a shiftable window method of dividing an input image I into a plurality of sub-images, P represents a division size, W i Represents the ith sub-image, and N represents the total number of sub-images.
This method can reduce the amount of calculation while maintaining high accuracy. However, the disadvantage of the Swin transducer is that it requires more computational resources and longer training time. The Swin transducer concept is derived from Vision Transformer (ViT), which is also a visual transducer, based on a neural network of self-attention mechanisms that can handle images of different sizes. ViT has the main advantage of high accuracy and low computational cost in processing images of different sizes. However, viT suffers from the disadvantage of requiring more computing resources and longer training times. The network structure of the Swin Transformer adopts a "shifted window" method, which divides an input image into a plurality of sub-images, then processes each sub-image separately, and finally integrates the results of all the sub-images together. This method effectively reduces the amount of computation while maintaining high accuracy.
ML-Decoder is an innovative classification head, and the neural network design based on the attention mechanism can effectively predict the existence of class labels. It makes better use of spatial data than other methods and is excellent in terms of computational efficiency and cost. The ML-Decoder adopts a query-based network architecture and has the capability of generalizing to unseen categories.
Q={q 1 ,q 2 ,...,q K } (3)
Wherein Q represents a query set, Q i Represents the ith query, and K represents the total number of queries.
However, it has the disadvantage of requiring more computing resources and training time. The background for the development of ML-decoders stems from Vision Transformer (ViT). ViT as an innovative visual transducer, neural networks based on self-attention mechanisms are capable of processing images of different sizes.
Attention mechanisms, which may be expressed as
Wherein A represents an attention function, q represents query, K represents key, V represents value, d k Representing the dimensions of the key. This formula describes how the similarity between each query and key is calculated and used to weight the value.
ViT has the advantage of high accuracy and low computational cost in processing images of different sizes, but also requires more computational resources and training time. In ML-decoders, the query-based network architecture can be generalized to the unseen categories. Its query employs a novel attention mechanism for predicting the presence of class labels. These queries may be learner-based or fixed, and different queries may be used during training and reasoning to achieve generalization to unseen categories.
The application provides a new model SwinML-Detect, which mainly comprises a feature extraction module and a relation dependence learning module.
First, an image is input to the Swin Transformer for feature extraction. The input image has a size 224×224, a type of RGB, and a channel number of 3. The model divides the image into 4×4 patches per block, for a total of 56×56 patches, with 4×4×3 channels. To match the multi-headed self-attention input, the channel dimension needs to be mapped to 96. These patches are then fed into Swin Transformer block.
The Swin transform consists of 4 stages, each consisting of several Swin Transformer block and one patch metering layer. Each Swin Transformer block comprises a multi-headed self-attention (MHA) layer and a positional Feed Forward Network (FFN) layer. At the end of each stage, the present application adds a patch raising layer to reduce feature size and increase channel count. This makes the structure of SwinML-Detect similar to ResNet 50. The feature size of the last stage output is 7×7×768.
Next, the feature map is input into a relationship-dependent learning module, i.e., ML-decoder. The ML-decoder is a multi-layer perceptron for decoding of images. In this process, the ML-decoder predicts the multi-label class. There is one predictor for each category. Finally, mapping the predicted value through a sigmoid function to obtain the probability value of each category. Specifically, for predictive vectors
p={p 1 ,p 2 ,...,p C } (5)
Where C is the number of categories, the present application calculates the probability value P for each category i =sigmoid(p i ) Where i=1, 2,..c.
In order to determine the final predictive label vector, a threshold t is set in the present application. The threshold may be determined based on actual problems and performance metrics, for example, cross-validation may be used to determine the optimal threshold. The probability value is then converted to a category label according to a threshold. In particular, if P i And if t, the class i is positive, otherwise, the class i is negative. Thus, the application can obtain a binary label vector
B={b 1 ,b 2 ,...,b C } (6)
Wherein b i =1 (if P i > t) or b i =0 (if P i ≤t)。
By this method, the SwinML-Detect model can achieve multi-label classification with high accuracy and efficiency. The feature extraction module (namely the Swin Transformer) is combined with the relation dependency learning module (namely the ML-decoder), so that the model can effectively learn the features of the input image and predict the dependency relation among a plurality of labels.
As shown in fig. 1, in this embodiment, a Bijie multi-label landslide data set is used, which is constructed based on an original Bijie landslide data set and is specially used for landslide remote sensing multi-label classification. The original dataset contained 2773 RGB images belonging to two categories, 770 landslide categories, 2003 non-landslide categories, with a spatial resolution of 0.8m, where the number of landslide samples is 770. According to the types of ecosystems in pichia and the specific types of ground features in the images, 11 types are set, including: bare land, buildings, courts, farmlands, grasses, greenhouses, landslide, road surfaces, semi-arid grasslands, trees and water.
Table 1 pichia landslide multi-tag dataset category
In this embodiment, the evaluation indexes of the model include two types: category-based evaluation methods and example-based evaluation methods. In the multi-label classification task, an example-based evaluation method is calculated for each data sample. In this approach, the present application focuses on the predicted performance of each sample, and then calculates the average performance of all samples. The main advantage of this assessment method is that it can reflect the performance of the model at the overall sample level. Category-based evaluation methods are calculated for each category. In this approach, the present application focuses on the predicted performance of each category, and then calculates the average performance for all categories. The main advantage of this assessment method is that it can reflect the performance differences of the model between different classes, as well as the predictive power of the model on a certain class.
Category-based accuracy (P), recall (R), and F1 score (F1):
for each class i, the present application first calculates its true positives (TP i ) False Positives (FP) i ) And False Negative (FN) i ). The accuracy, recall, and F1 score for category i may then be calculated:
accuracy of class i (P i ):
Recall of category i (R i ):
Category(s)F1 fraction of i (F1 i ):
The average accuracy, recall, and F1 score for all categories may then be calculated.
Example-based accuracy (P), recall (R), and F1 score (F1):
for each example (sample) j, its true positive (TP is first calculated j ) False Positives (FP) j ) And False Negative (FN) j ). Then, the accuracy, recall, and F1 score for example j may be calculated:
accuracy of example j (P j ):
Recall of example j (R j ):
The F1 score of example j (F1 j ):
As shown in fig. 2, the present embodiment further relates to a landslide detection system based on multi-label classification, which includes a collector and a processor; the collector collects landslide data; the processor comprises a data construction module, a feature extraction module and a relation dependent learning module.
To compare with the original paper, the ratio is 2:1 randomly divides the training set and the test set. The method does not improve the performance of the model by using the enhancement in test, but strengthens the generalization capability of the model. The proposed method is based on pytorch. The images of all datasets were resize 224 x 224. In the aspect of data enhancement, random up-down turning and random left-right turning are used, the generalization capability of a model is enhanced, and the generalization probability is 0.5. The optimizer uses Adam, the learning rate reduction strategy uses cosine reduction strategy, the initial learning rate lr=0.001, the regularization coefficient wd=0.001, and batch=32. The loss function uses BCEloss. All experiments were performed on a nividagaoforertx 3080 tigepu.
The processor may be a general-purpose processor, including a central processing unit, a network processor, etc.; but also digital signal processors, application specific integrated circuits, field programmable gate arrays or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.
Optionally, the embodiment of the present application further provides a storage medium, where instructions are stored, when the instructions are executed on a computer, cause the computer to perform the method of the embodiment as shown in the foregoing.
Optionally, the embodiment of the present application further provides a chip for executing the instruction, where the chip is used to perform the method of the foregoing embodiment.
The present application also provides a program product, which comprises a computer program stored in a storage medium, from which at least one processor can read the computer program, and the method of the above embodiment can be implemented when the at least one processor executes the computer program.
Experimental results
The proposed SwinML-Detect model was compared to conventional convolutional neural networks and vision transducer-based models. They are widely used models for remote sensing landslide detection or as baselines for evaluation of various visual transducer structures.
Table 2 shows the performance of the proposed model on unseen (test) data. To verify the performance of the self-attention model, the present application uses the unseen (test) dataset to evaluate the model's performance on the training phase entirely external data. The test set is evaluated using accuracy target criteria, one for each category.
For the sake of better fairness, swin-Tiny, vit-Tiny and response 50 were used for comparison. It can be seen that SwinML-Detect achieves high performance results.
Resnet50 has lower classification performance (CP, CR, and CF 1), but performs better in terms of accuracy (EP and ER). The number of parameters is small (23.53M), which means that the model is relatively small and efficient. However, poor classification performance may be due to the fact that Resnet50 is better suited to handle single-label tasks, while performance is limited in multi-label tasks. ViT-T has a large improvement in classification performance compared to Resnet50, but a slight increase in accuracy. The number of parameters increases substantially to 55.26M, which makes the model more complex and requires more computing resources. ViT-T is used as a visual model based on a transducer, and can better capture global features, so that the method has better performance on multi-label classification tasks. Compared with the two, the Swin-T has obviously improved classification performance and improved accuracy. The number of parameters was 27.51M, which is much lower than Resnet50, but much lower than ViT-T. Swin-T introduces a hierarchical structure to make the model more suitable for capturing image features, thereby achieving better performance on multi-label classification tasks.
The Resnet50+ML decoder is added with a multi-label decoder on the basis of the Resnet50, so that the classification performance is greatly improved, and the accuracy is slightly improved.
The number of parameters increases to 30.61M, indicating that the decoder brings additional computation. This improvement highlights the importance of the multi-tag decoder in handling multi-tag tasks. Compared with ViT-T, viT +ML decoder has the advantages that the classification performance is obviously improved, and the accuracy is also improved. The number of parameters is increased to 61.38M, making the model more complex. The ViT model in combination with the multi-tag decoder performs better in the multi-tag classification task.
SwinML-Detect has the highest classification performance and accuracy among all models. The number of parameters is 33.63M, which is relatively less computationally complex than the other two models with decoders. This further demonstrates that the efficient combination of Swin architecture with multi-tag decoder can achieve excellent performance in multi-tag classification tasks.
The recall and F1 scores for each class are shown in fig. 3. For some categories that occur more frequently, such as buildings, farms, trees and landslides, there is no substantial difference between SwinML-Detect and traditional convolution and self-attention based visual transformations. However, in fig. 3 (b), performance indicators for categories (e.g., bare land, grass, water, etc.) containing unusual samples (rare events) are shown. In fig. 3 (b), our SwinML-Detect method overall performs better than other potential alternatives on rare categories. This shows that the proposed method effectively improves the probability of detection of rare categories while ensuring good performance of high frequency categories.
TABLE 2 Performance comparison of Resnet50, viT-T, swin-T, ML-Resnet50, ML-ViT and SwinML-detection-T
FIGS. 3 (a) and 3 (b) are performance evaluations of Resnet50, viT-T, swin-T and SwinML-Detect-T for each type of recall and F1 scores.
SwinML-Detect can be seen to have the best performance among all models, with the highest precision (P), recall (R) and F1-score. The behavior of the individual models is now analyzed:
single+resnet50: this is a Resnet50 model based on single tag classification. Although its accuracy is relatively high, the recall and F1-score are relatively low, probably because in multi-label classification tasks, the single-label classification method does not adequately capture correlations between samples.
Single+resnet50+dem: this model adds DEM (Deep Embedding Model) on the basis of single+Resnet 50. DEM can boost model recall and F1-score because it can learn more complex relationships between samples.
Multi+resnet50: this is a multi-label classification based Resnet50 model. The multi-tag classification captures the correlation between samples better than the single-tag classification, thus improving both recall and F1-score.
Multi+ ViT-T: this model uses vision Transformer (ViT) as the basic architecture. Although ViT has a strong performance on image classification tasks, its performance in this task is not quite as different from multi+Resnet50, probably because both architectures are not able to fully exploit the higher order features in Multi-label classification tasks.
Multi+Swin-T: this model uses Swin transducer as the basic architecture. Swin transducer has stronger local perceptibility than ViT and therefore performs better in multi-tag classification tasks, with higher recall and F1-score.
SwinML-Detect: the model combines a Swin transducer and a multi-label decoder, has strong local perceptibility and can well process correlation among samples. Therefore, it performs best in all models, with the highest accuracy, recall and F1-score.
In summary, the SwinML-Detect model performs best, mainly because it combines the advantages of the strong local perceptibility of Swin transducer and the sample correlation handled by the multi-tag decoder. Other models either use only a single tag classification method, but cannot fully capture the correlation between samples; or using either Resnet50 or ViT as the basic architecture, the local perceptibility is relatively weak, resulting in limited performance.
Single label classification methods (e.g., single+resnet50 and single+resnet50+dem) are limited in performance in the multi-label classification task because they do not adequately capture inter-sample correlations. The use of Multi-label classification methods and the introduction of Multi-label decoders (e.g., multi+Resnet50 and multi+Resnet50+ml decoder) can enhance the performance of the model in Multi-label classification tasks, but still suffer from the fundamental architectural limitations. The Swin transducer architecture (e.g., multi+Swin-T and SwinML-Detect) performs better in the Multi-tag classification task because it has a stronger local perceptibility. The SwinML-Detect model combines the local perceptibility of the Swin transducer with the advantage of the multi-tag decoder processing sample correlation, and therefore performs best in all models, with the highest accuracy, recall and F1-score.
TABLE 3 evaluation of Performance of Resnet50, viT-T, swin-T and SwinML-Detect-T in landslide categories
FIGS. 4 and 5 are the landslide and pavement-based Grad-CAM (Gradient-weighted Class Activation Mapping), respectively. Among them, fig. 4 (a) - (f), 5 (a) - (f) are 6 samples in the pichia landslide data set.
As shown in fig. 4 and 5, several important classes of class activation maps (e.g., landslide and road surface) are compared. It can be seen that the method of this embodiment is more accurate in identifying landslide, and the range of attention is relatively concentrated. But shows encouraging performance in loess landslide which is difficult to identify.
The foregoing description of the preferred embodiments of the invention is not intended to be limiting, but rather is intended to cover all modifications, equivalents, alternatives, and improvements that fall within the spirit and scope of the invention.

Claims (9)

1. A landslide detection method based on multi-label classification is characterized by comprising the following steps: the method comprises the following steps:
inputting the image in the step (1) into a Swin Transformer for feature extraction;
the model divides the image into patches with the size of 4 multiplied by 4, and the number of channels is 4 multiplied by 3; channel dimensions map to 96; the patch is fed into Swin Transformer block;
step (2) inputting the feature map into an ML-decoder;
ML-decoder predicts the multi-label category; each category has a predictive value;
mapping the predicted value through a sigmoid function to obtain a probability value of each category, and predicting the dependency relationship among a plurality of labels;
for the prediction vector:
p={p 1 ,p 2 ,...,p C }
wherein C is the number of categories, and calculates the probability value P of each category i =sigmoid(p i ) Wherein i=1, 2,. -%, C;
setting a threshold t; the threshold t is determined according to the actual problem and performance index, if P i If t, the class i is positive, otherwise, the class i is negative; further, the binary label vector is as follows:
B={b 1 ,b 2 ,...,b C }
wherein if P i >t,b i =1; if P i ≤t,b i =0。
2. The method according to claim 1, characterized in that: in step (1), the size of the input image is 224×224, the type is RGB, and the number of channels is 3.
3. The method according to claim 1, characterized in that: in the step (1), the Swin Transformer consists of 4 stages, and each stage consists of a plurality of Swin Transformer block and a patch metering layer;
each Swin Transformer block comprises a multi-headed self-attention layer and a positional feed-forward network layer; at the end of each stage, a patch metering layer is added, and the characteristic size of the final stage output is 7×7×768.
4. The method according to claim 1, characterized in that: the construction of the data set is further included before the step (1): the method is constructed based on an original Bijie landslide data set and is used for landslide remote sensing multi-label classification.
5. The method according to claim 4, wherein: on data enhancement, random up-down turning and random left-right turning are used for enhancing the generalization capability of the model, and the generalization probability is 0.5;
the optimizer uses Adam, the learning rate reduction strategy uses cosine reduction strategy, the initial learning rate lr=0.001, the regularization coefficient wd=0.001 and batch=32; the loss function uses BCEloss.
6. The method according to claim 1, characterized in that: the step (3) further comprises the step of evaluating a model:
category-based accuracy, recall, and F1 score: for each class i, first, its true, false, and false positives are calculated; then, calculating the accuracy, recall and F1 score of the category i;
calculating average accuracy, recall and F1 score of all categories;
based on example accuracy, recall, and F1 score:
for each example j, its true, false, and false positives are first calculated; then, the accuracy, recall, and F1 score for example j are calculated.
7. Landslide detection system based on multi-label classification, its characterized in that: comprises a collector and a processor; the collector collects landslide data; the processor comprises a data construction module, a feature extraction module and a relation dependent learning module;
constructing a data set based on collected landslide data, and dividing an image into 4×4 patches with 4×4 channels by a feature extraction module; channel dimensions map to 96; the patch is fed into Swin Transformer block;
the relation dependence learning module decodes the image and predicts the multi-label category; each category has a predictive value; mapping the predicted value through a sigmoid function to obtain a probability value of each category, and predicting the dependency relationship among a plurality of labels;
for the prediction vector:
p={p 1 ,p 2 ,...,p C }
wherein C is the number of categories, and calculates the probability value P of each category i =sigmoid(p i ) Where i=1, 2..,C;
Setting a threshold t; the threshold t is determined according to the actual problem and performance index, if P i If t, the class i is positive, otherwise, the class i is negative; further, the binary label vector is as follows:
B={b 1 ,b 2 ,...,b C }
wherein if P i >t,b i =1; if P i ≤t,b i =0。
8. The system according to claim 7, wherein: in the feature extraction module, the Swin Transformer consists of 4 stages, and each stage consists of a plurality of Swin Transformer block and a patch merge layer;
each Swin Transformer block comprises a multi-headed self-attention layer and a positional feed-forward network layer; at the end of each stage, a patch metering layer is added, and the characteristic size of the final stage output is 7×7×768.
9. The system according to claim 7, wherein: the system also comprises a model evaluation module, wherein the model evaluation module is used for evaluating:
category-based accuracy, recall, and F1 score: for each class i, first, its true, false, and false positives are calculated; then, calculating the accuracy, recall and F1 score of the category i;
calculating average accuracy, recall and F1 score of all categories;
based on example accuracy, recall, and F1 score:
for each example j, its true, false, and false positives are first calculated; then, the accuracy, recall, and F1 score for example j are calculated.
CN202310451861.0A 2023-04-25 2023-04-25 Landslide detection method and system based on multi-label classification Pending CN116524258A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310451861.0A CN116524258A (en) 2023-04-25 2023-04-25 Landslide detection method and system based on multi-label classification

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310451861.0A CN116524258A (en) 2023-04-25 2023-04-25 Landslide detection method and system based on multi-label classification

Publications (1)

Publication Number Publication Date
CN116524258A true CN116524258A (en) 2023-08-01

Family

ID=87407648

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310451861.0A Pending CN116524258A (en) 2023-04-25 2023-04-25 Landslide detection method and system based on multi-label classification

Country Status (1)

Country Link
CN (1) CN116524258A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117274823A (en) * 2023-11-21 2023-12-22 成都理工大学 Visual transducer landslide identification method based on DEM feature enhancement

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111343730A (en) * 2020-04-15 2020-06-26 上海交通大学 Large-scale MIMO passive random access method under space correlation channel
CN113837154A (en) * 2021-11-25 2021-12-24 之江实验室 Open set filtering system and method based on multitask assistance
CN114842351A (en) * 2022-04-11 2022-08-02 中国人民解放军战略支援部队航天工程大学 Remote sensing image semantic change detection method based on twin transforms
CN114937202A (en) * 2022-04-11 2022-08-23 青岛理工大学 Double-current Swin transform remote sensing scene classification method
CN115019123A (en) * 2022-05-20 2022-09-06 中南大学 Self-distillation contrast learning method for remote sensing image scene classification
CN115424059A (en) * 2022-08-24 2022-12-02 珠江水利委员会珠江水利科学研究院 Remote sensing land use classification method based on pixel level comparison learning
CN115588217A (en) * 2022-06-23 2023-01-10 西安电子科技大学 Face attribute detection method based on deep self-attention network
CN115601584A (en) * 2022-09-14 2023-01-13 北京联合大学(Cn) Remote sensing scene image multi-label classification method and device and storage medium
CN115908946A (en) * 2022-12-21 2023-04-04 南京信息工程大学 Land use classification method based on multiple attention semantic segmentation

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111343730A (en) * 2020-04-15 2020-06-26 上海交通大学 Large-scale MIMO passive random access method under space correlation channel
CN113837154A (en) * 2021-11-25 2021-12-24 之江实验室 Open set filtering system and method based on multitask assistance
CN114842351A (en) * 2022-04-11 2022-08-02 中国人民解放军战略支援部队航天工程大学 Remote sensing image semantic change detection method based on twin transforms
CN114937202A (en) * 2022-04-11 2022-08-23 青岛理工大学 Double-current Swin transform remote sensing scene classification method
CN115019123A (en) * 2022-05-20 2022-09-06 中南大学 Self-distillation contrast learning method for remote sensing image scene classification
CN115588217A (en) * 2022-06-23 2023-01-10 西安电子科技大学 Face attribute detection method based on deep self-attention network
CN115424059A (en) * 2022-08-24 2022-12-02 珠江水利委员会珠江水利科学研究院 Remote sensing land use classification method based on pixel level comparison learning
CN115601584A (en) * 2022-09-14 2023-01-13 北京联合大学(Cn) Remote sensing scene image multi-label classification method and device and storage medium
CN115908946A (en) * 2022-12-21 2023-04-04 南京信息工程大学 Land use classification method based on multiple attention semantic segmentation

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
TAL RIDNIK等: "ML-Decoder:Scalable and Versatile Classification Head", ARXIV, pages 1 - 14 *
ZELIU等: "Swin Transformer:Hierarchical Vision Transformerusing Shifted Windows", ARXIV, pages 1 - 14 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117274823A (en) * 2023-11-21 2023-12-22 成都理工大学 Visual transducer landslide identification method based on DEM feature enhancement
CN117274823B (en) * 2023-11-21 2024-01-26 成都理工大学 Visual transducer landslide identification method based on DEM feature enhancement

Similar Documents

Publication Publication Date Title
CN114067160B (en) Small sample remote sensing image scene classification method based on embedded smooth graph neural network
CN107885764B (en) Rapid Hash vehicle retrieval method based on multitask deep learning
CN111259786B (en) Pedestrian re-identification method based on synchronous enhancement of appearance and motion information of video
CN113780149B (en) Remote sensing image building target efficient extraction method based on attention mechanism
CN112668494A (en) Small sample change detection method based on multi-scale feature extraction
Xia et al. A deep Siamese postclassification fusion network for semantic change detection
CN114926746A (en) SAR image change detection method based on multi-scale differential feature attention mechanism
Li et al. A review of deep learning methods for pixel-level crack detection
Liu et al. Survey of road extraction methods in remote sensing images based on deep learning
Zhao et al. Mine diversified contents of multispectral cloud images along with geographical information for multilabel classification
CN116524258A (en) Landslide detection method and system based on multi-label classification
CN115830379A (en) Zero-sample building image classification method based on double-attention machine system
CN116524189A (en) High-resolution remote sensing image semantic segmentation method based on coding and decoding indexing edge characterization
CN115457332A (en) Image multi-label classification method based on graph convolution neural network and class activation mapping
CN111898418A (en) Human body abnormal behavior detection method based on T-TINY-YOLO network
CN115239765A (en) Infrared image target tracking system and method based on multi-scale deformable attention
López-Cifuentes et al. Attention-based knowledge distillation in scene recognition: the impact of a dct-driven loss
Cao et al. Face detection for rail transit passengers based on single shot detector and active learning
CN117809198A (en) Remote sensing image significance detection method based on multi-scale feature aggregation network
Luo et al. Infrared Road Object Detection Based on Improved YOLOv8.
Zhang et al. Scale-wised feature enhancement network for change captioning of remote sensing images
CN115331254A (en) Anchor frame-free example portrait semantic analysis method
CN115063831A (en) High-performance pedestrian retrieval and re-identification method and device
Li et al. Research on efficient detection network method for remote sensing images based on self attention mechanism
CN118470608B (en) Weak supervision video anomaly detection method and system based on feature enhancement and fusion

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination