CN106991382A

CN106991382A - A kind of remote sensing scene classification method

Info

Publication number: CN106991382A
Application number: CN201710147637.7A
Authority: CN
Inventors: 刘青山; 杭仁龙; 葛玲玲; 宋慧慧; 孙玉宝
Original assignee: Nanjing University of Information Science and Technology
Current assignee: Nanjing University of Information Science and Technology
Priority date: 2017-03-13
Filing date: 2017-03-13
Publication date: 2017-07-28

Abstract

The invention discloses a kind of remote sensing scene classification method, comprise the following steps：Generate multi-scale image；Extract multiple dimensioned depth characteristic；Merge convolution feature；Assemble multiple dimensioned classification results.The present invention proposes a kind of adaptive depth pyramid matching (ADPM) model, multi-scale image is sent to the convolutional neural networks with spatial pyramid pond to extract depth characteristic, SVM classifier, which is sent to, after the depth characteristic extracted in all convolutional layers is merged obtains classification results, assemble multiple dimensioned result and more information is provided, in order to remote sensing scene classification.Compared with spatial relationship pyramid (PSR), partial detection device (Partlets) method, semi-supervised projection (SSEP) method, under identical experiment condition, the remote sensing scene classification performance of the inventive method is improved, and classification results are more accurate.

Description

Remote sensing scene classification method

Technical Field

The invention belongs to the technical field of image information processing, and relates to a remote sensing scene classification method.

Background

With the development of remote sensing technology, a large number of high-resolution earth observation images are acquired from satellites and airplanes. Unlike other images, remote sensing scenes exhibit some special characteristics. For example, there are various sizes, colors, and orientations in a scene. In various applications such as land resource management and urban planning, remote sensing scene classification is a fundamental work, and is an important research topic. It has become an urgent need to automatically and accurately interpret such a large image library.

Over the past few years, a number of feature representation models have been proposed for use in scene classification. One of the most common models is the visual bag of words, which generally includes the following three steps: 1) extracting bottom layer visual features of the image, such as the description of Scale Invariant Feature Transform (SIFT) and Histogram of Oriented Gradients (HOG); 2) forming a visual vocabulary by clustering features using k-means or other methods; 3) visual features are mapped to the nearest word, and a medium level feature representation is generated through a word histogram. This model and its variants have been extensively studied in the field of remote sensing.

Although the visual bag of words is somewhat effective in remote sensing scene classification, it provides an unordered set of local descriptors and does not take into account spatial information. To overcome this drawback, a spatial pyramid matching model is developed. The model first segments the original image into different levels of resolution. Second, for each level of resolution, a histogram of local features is extracted from each space. Finally, the spatial histogram is represented by a weighted pyramid matching kernel. Since remote sensing images generally do not have an absolute frame of reference, the relative spatial arrangement of image elements becomes important. Therefore, it is proposed to represent photometric and geometric information of an image with a spatial pyramid co-occurrence model. Unlike segmenting images into uniform cells, the spatial pyramid co-occurrence model uses random spatial segmentation to describe various image layouts.

All of the above methods are based on manual extraction of features, which relies heavily on expert experience and domain knowledge. Furthermore, these features make it difficult to achieve an optimal balance between discrimination and robustness. This is mainly due to the fact that the details of the real data are not taken into account. Deep learning algorithms, especially convolutional neural networks, have shown great potential in solving this problem, because high-level semantic features can be automatically learned from a hierarchical manner of original images, which has attracted more and more attention in remote sensing communities.

However, it is difficult to apply convolutional neural networks directly to the remote sensing scene image classification because millions of parameters are often trained, and the number of available training sets is small. Many related studies have shown that features extracted from convolutional neural networks can be used as generic descriptors. Thus, image representations learned from large-scale annotation data, such as ImageNet, using neural networks can be effectively transferred to a broad visual recognition task with a limited amount of training data. With this in mind, relevant studies validated the feasibility of remote sensing scene classification using ImageNet pre-trained convolutional neural networks. The classification performance is impressive by adopting the pre-trained convolutional neural network and finely adjusting the remote sensing scene data. At present, the generalization capability of extracting features from the fully-connected layer of the convolutional neural network is evaluated on remote sensing scene classification, and the latest result is displayed on a public remote sensing scene data set.

Although the problem of overfitting can be alleviated by transfer learning, some problems still exist in remote sensing scene classification based on the convolutional neural network. First, most convolutional neural networks utilize only the last fully-connected layer as a subsequent classification feature. It is not reasonable to directly discard the features of the previous convolutional layer, as these may be beneficial to the classification goal. In fact, features extracted from convolutional layers are more generic than those extracted from the connection layers, and therefore these features may be more suitable for transfer learning. In addition, the features extracted from the convolutional layer contain more spatial information than the fully-connected layer is activated, facilitating image classification. Recently, the importance of the characteristics of convolutional layers has been recognized, but they use only the last convolutional layer, ignoring the others.

It is also a notable problem that objects of interest often have different dimensions in different remote sensing scenes, and even a scene may contain objects of different sizes. However, the most popular convolutional neural networks require a fixed size input image (e.g., 227 x 227 pixels). A common solution is to deform or fix the original remote sensing image to a predefined size, which inevitably results in loss of valid discrimination information.

Inspired by the spatial pyramid model, we consider the features of all convolutional layers as a multi-resolution representation of the input image. The pyramid matching kernel is then used to integrate a unified representation. Unlike the spatial pyramid model, we use low-level descriptors as depth features to learn optimal fusion weights between different convolutional layers from the data itself, rather than being predefined. Information loss due to the fixed input image size is reduced, and more supplementary information can be learned from different scales by feeding the multi-scale image into the convolutional neural network. Considering the computational cost of learning multi-scale depth features, we select convolutional neural networks with spatial pyramid pooling as our underlying depth network. Adding a spatial pyramid pooling layer before the fully connected network allows the input image to be of arbitrary size. Therefore, a trained spatial pyramid pooling network can extract multi-scale features from multi-scale input images, and classification of remote sensing scenes is facilitated.

Disclosure of Invention

The technical problem to be solved by the invention is to overcome the defects of the prior art, provide a remote sensing scene classification method, fully utilize the advantages of multi-scale depth feature extraction and a self-adaptive depth pyramid matching model, better classify the remote sensing scene, and have better classification performance and classification accuracy.

The remote sensing image classification method comprises the following steps:

step 1), generating an image with different scales of NxN by a remote sensing image to be classified through a deformation method, wherein N can take a plurality of values according to the size of the image;

step 2), sending the multi-scale image into a convolutional neural network with space pyramid pooling for training so as to extract multi-scale depth features;

step 3), for the input image of each scale, applying a self-adaptive depth pyramid matching model to fuse the feature representations extracted from all the convolutional layers;

and 4) sending the feature representation learned by each scale image into a classifier to obtain a final classification result, and then integrating a plurality of results of all scales by utilizing a majority voting strategy, namely correctly classifying the remote sensing image scene.

In order to avoid the loss of effective discrimination information, the invention further adopts the following improved scheme: generating the remote sensing scene images to be classified in the step 1) into different scales, such as 128 × 128, 192 × 192, 227 × 227, 256 × 256 and 384 × 384, by a deformation method.

Advantageous effects

Under the same experimental conditions, the classification accuracy of the method is higher than the accuracy of a space relation Pyramid (PSR), a local detector (Partlets) method and a semi-supervised projection (SSEP) method;

and secondly, a plurality of results with all scales are integrated by using a majority voting strategy, more identifiable information can be provided, and the classification accuracy is improved.

Drawings

FIG. 1 is a basic flow chart of the remote sensing scene classification method of the present invention;

FIG. 2 is a system architecture of the multi-scale depth feature extraction process in the remote sensing image classification method of the present invention;

FIG. 3 is a flow chart of the adaptive depth pyramid matching method of the present invention;

FIG. 4 is a histogram of each type of accuracy of the method of the present invention and the spatial relationship Pyramid (PSR), local detector (Partlets) method.

FIG. 5 is a histogram of the accuracy of each type of the method of the present invention and the semi-supervised projection (SSEP) method.

Detailed Description

The technical scheme of the invention is explained in detail in the following with the accompanying drawings:

the idea of the invention is to fully utilize the advantages of multi-scale depth feature extraction and a self-adaptive depth pyramid matching model, fully mine all convolutional layer feature information of a convolutional neural network, and integrate a plurality of results of all scales by adopting a majority voting strategy, so that the remote sensing scene can be better classified, the classification performance is better, and the classification accuracy is improved.

The basic flow of the method of the invention is shown in fig. 1, and specifically comprises the following steps:

step 1), generating a multi-scale image: and dividing the remote sensing scene image to be classified into a plurality of images with different scales NxN in a deformation mode to obtain a group of multi-scale image sets of the images.

The value of N can be determined according to factors such as the spatial resolution of the sensor, the size of a target object in the remote sensing image and the like. In specific implementation, in order to avoid the loss of the discrimination information, besides the original image is retained, the image needs to be deformed into a plurality of multi-scale image blocks, so as to implement a group of multi-scale image block set structure. Taking 256 × 256 as an example of the original image, N may be 128, 192, 227, 256, respectively.

Step 2), extracting multi-scale depth features: and (3) sending the multi-scale image into a convolutional neural network with space pyramid pooling for training so as to extract the multi-scale depth features.

The architecture for extracting multi-scale depth features, as shown in fig. 2, comprises five convolutional layers, a spatial pyramid pooling layer, two fully-connected layers, and a softmax layer. Similar to spatial pyramid matching, we map the features of the partitions to increasingly thinner sub-regions, pooling the features within each sub-region by maximal pooling. Assume that each feature map after the last convolutional layer is of size a x a and each feature map is divided into sub-regions of size n x n. The spatial pyramid pooling can be regarded as a convolution operator with a window size of a/n and a step size of a/n in a sliding window mode. Here we have chosen three levels of spatial pyramid pooling configurations, n × n being 1 × 1, 2 × 2 and 4 × 4 respectively. The final output of spatial pyramid pooling is to concatenate the three levels of pooling results into a vector, producing a fixed length representation, regardless of the size of the input image. These input images of different scales share a spatial pyramid pooling network.

The key of the multi-scale depth feature extraction is the training of the network. In order to ensure the effectiveness of the training network, the network is pre-trained by using an ImageNet 2012 data set, the weight parameters of the first five convolutional layers are transferred and fixed, and then the space pyramid pooling network is finely tuned by using a remote sensing scene training sample.

Step 3), fusing convolution characteristics: and for the input image of each scale, fusing the feature representations extracted from all the convolutional layers by using an adaptive depth pyramid matching model.

Feature representations extracted from all convolutional layers are fused using an adaptive depth pyramid matching model. An adaptive depth pyramid matching flow chart, as shown in fig. 3, forms histogram representation of feature representations extracted from all convolutional layers by visual bag of words, learns the optimal fusion weights between all convolutional layers from the data itself, rather than defining the fusion weights in advance, and weights the optimal fusion weights to obtain the histogram of fusion features extracted from all convolutional layers.

The fusion convolution characteristic uses the self-adaptive depth pyramid matching model in the method. Assuming a three-dimensional matrixThe matrix representing an image I₁The ith layer feature mapping of (1); then, in each coordinate (i, j), 1 ≦ i ≦ n_l,1≤j≤n_lWhere by f_1.l ^(i,j)Defining an image I₁A p-dimensional representation of a feature of a local block. In this way we obtain image I₁N of the l-th layer_l×n_lA local feature vector. We used the k-means method to aggregate all features into a cluster containing D centersEach feature represents f_1.l ^(i,j)Assigned to the nearest visual wordThen, F_1,lCan be expressed as a histogram representation as follows:

wherein,representation of a feature f_1.l ^(i,j)Is thatOtherwise, the reverse is carried outRepresentation of a feature f_1.l ^(i,j)Is not the most recent visual word ofFinally, image I₁And I₂The depth pyramid matching kernel of (a) is as follows:

wherein L represents the total number of convolutional layers, ω_lIs the fusion weight of the l-th layer,

for remote sensing scene classification, the method also needsLabel information of the training image is taken into account. Thus, instead of using predefined values, the optimal weights ω are adaptively learned from the training data itself_l. The kernel matrix K of the training data should be close to the ideal matrix Y. Element K in the kernel matrix K_i,jIs defined as an image I₁And I₂The deep pyramid matches the kernel. Element Y_i,jImage label y is represented by 1_i＝y_jOn the contrary, element Y_i,j0 denotes the image label y_i≠y_j。

The objective function of the adaptive depth pyramid matching model in the invention is as follows:

wherein,represents the sum of the distances between the matrices K and Y, and hasRegularization termBy all weights ω_lComposition, overfitting can be prevented.

By usingInstead of K, one can deduceWherein the element A of the matrix A_i,j＝tr(K_i ^TK_j) Element b of vector b_j＝tr(Y^TK_j)，c＝tr(Y^TY). Then, the objective function can be transformed into a typical quadratic programming problem:

after the quadratic programming optimal solution omega is obtained, the depth pyramid matching kernel matrix K of the training data can be calculated.

Step 4, aggregating multi-scale classification results: and (3) the fusion features learned from each scale image are sent to a classifier to obtain a final classification result, and then a plurality of results of all scales are integrated by utilizing a majority voting strategy, so that the correct classification of the remote sensing scene image can be obtained.

The invention is realized by adopting the following specific method: and (3) the fusion features learned by each scale image are represented by a depth pyramid matching kernel matrix K, the matrix K is sent to a support vector machine classifier for classification, and the classification results from all scales are integrated by a majority voting method to realize the final classification result.

The classifier in the method of the invention is a support vector machine, and the following briefly describes the classification model of the support vector machine.

First, the basic principle and training process of the two-class SVM classifier are briefly described. Given a set of annotationsWherein x_i∈R^d，y_i∈{-1,1}。x_iAs underlying visual feature vectors, y, of the feedback samples_iFor class labeling, the class of positive feedback samples is labeled 1, the class of negative feedback samples is labeled-1, R^dIs a d-dimensional vector space on R in the real number domain. The samples are mapped into the high-dimensional space using a non-linear mapping, as follows:

Φ:R^d→F x→Φ(x) (5)

where F is the mapped high dimensional space and Φ is the corresponding mapping function. The decision function is represented in the form:

g(x)＝w·Φ(x)+b (6)

accordingly, the support vector machine classification surface can be written as:

w·Φ(x)+b＝0 (7)

where w is the weight vector and b is the offset constant.

Points falling on two hyperplanes w · Φ (x) + b ± + -1 are called support vectors, the distance from a support vector to a classification plane is called a classification interval, and the size isThe size of the classification interval represents the generalization capability of the classifier, so we want to maximize the interval of the classifier:

s.t.y_i(w·Φ(x_i)+b)≥1,i＝1,…,N

and obtaining the classification surface of the support vector machine according to the solution of the formula. Solving the quadratic programming problem in the above formula by a lagrange multiplier method can obtain:

wherein x is_iTo support the vector, y_iAnd α_iAnd respectively marking the corresponding category of the support vector and the Lagrange coefficient. The output of the sample x obtained by the two-class SVM classifier is as follows:

the kernel function is utilized to avoid the display expression of the nonlinear mapping, and the output of the image sample obtained by the two-class SVM classifier can be rewritten as follows:

wherein K (·) is a kernel function, and K (x)_i,x)＝Φ(x_i)^TΦ (x), superscript T denotes the transpose matrix. According to the above formula, for any one of the samples that is standard, if the value of f (x) is greater than 0, the class of the sample is labeled 1, and if the value of f (x) is less than 0, the class is labeled-1.

And each two-class classifier generates a classification hyperplane, and calculates the distance from the fusion features learned by each scale image to the classification hyperplane, wherein each scale image belongs to the class with the largest distance. And integrating a plurality of results of all scales by utilizing a majority voting strategy, namely correctly classifying the remote sensing image scene.

To facilitate understanding of the technical solution of the present invention, two specific examples are given below.

The first embodiment of the invention applies the technical scheme provided by the invention to the classification of 21-Class-Land-Use remote sensing data sets. The data set was manually extracted from an aerial orthophoto download of national maps from the United States Geological Survey (USGS). It includes 21 different land utilizations and land cover types including agricultural products, airplanes, baseball parks, beach, buildings, jungles, dense houses, forests, highways, golf courses, ports, intersections, medium density houses, mobile home parks, overpasses, parking lots, rivers, runways, sparse houses, oil storage tanks and tennis courts. Each class contains 100 RGB images with a spatial resolution of one foot (about 0.3 m) and an image size of 256 x 256 pixels. By using the remote sensing scene classification method based on the adaptive depth space pyramid matching, the depth features of the multi-scale images extracted from the convolutional neural network are fused and sent to the classifier, so that the classification of the remote sensing scene images is obtained.

In this embodiment, a Support Vector Machine (SVM) is selected as the classification model, and in order to verify the effectiveness of the present invention, the classification result is compared with a spatial relationship Pyramid (PSR) and a local detector (Partlets) method, respectively. Generating NxN images with different scales from a remote sensing scene image to be classified by a deformation method, sending the multi-scale image into a convolutional neural network with space pyramid pooling for training so as to extract multi-scale depth features, fusing feature representations extracted from all convolutional layers for input images of each scale by using a self-adaptive depth pyramid matching model, sending the feature representations learned from the images of each scale into a classifier to obtain a final classification result, and integrating a plurality of results of all scales by using a majority voting strategy, namely correctly classifying the remote sensing scene image.

The classification process of this embodiment is specifically as follows:

1. generating a multi-scale image:

and reserving 256 × 256 original images, generating images with the dimensions of 128 × 128, 192 × 192 and 227 × 227 by the remote sensing scene image to be classified through a deformation method, and forming a group of multi-scale image sets of the images.

2. Extracting multi-scale depth features:

to ensure the effectiveness of training the network, the network is pre-trained using 227 x 227 remote sensing scenarios as input. The data set is randomly divided into a training set and a testing set, the training set is used for fine tuning a full connection layer of the spatial pyramid pooling network, and the testing set is used for evaluating the performance of the classifier. To reduce the impact of random selection, we repeatedly performed each algorithm on ten different training/test segmented data sets. The spatial pyramid pooling network, similar to spatial pyramid matching, maps the features of the partitions to increasingly thinner sub-regions, and pools the features in each sub-region by maximal pooling. Assume that each feature map after the last convolutional layer is of size a x a and each feature map is divided into sub-regions of size n x n. The spatial pyramid pooling can be regarded as a convolution operator with a window size of a/n and a step size of a/n in a sliding window mode. Here we have chosen three levels of spatial pyramid pooling configurations, n × n being 1 × 1, 2 × 2 and 4 × 4 respectively. The final output of spatial pyramid pooling is to concatenate the three levels of pooling results into a vector, producing a fixed length representation, regardless of the size of the input image. These input images of different scales share a spatial pyramid pooling network. And then, the multi-scale image is sent to a convolutional neural network with space pyramid pooling for training so as to extract the multi-scale depth features.

3. Fusing convolution characteristics:

for each scale of the input image, the feature representations extracted from all the convolutional layers, for each pixel of the convolutional layer feature map, form a visual code using K-means. f. of_1.l ^(i,j)A p-dimensional feature representing a local block of the ith layer of the image,represents the distance f_1.l ^(i,j)Most recent visual word, image I₁Is mapped toThe feature representations extracted from all convolutional layers are then formed into a histogram representation by visual bag of words.

Feature representations extracted from all convolutional layers are fused using an adaptive depth pyramid matching model. Image I₁And I₂Depth pyramid matching kernel ofL represents the total number of convolutional layers, ω_lIs the fusion weight of the l-th layer,adaptive depth pyramid matching modelThe regularization term parameter λ of (3) in the type objective function, which prevents overfitting, is empirically taken to be 0.5. Learning the optimal fusion weights among all convolutional layers from the data itself, and then weighting the optimal fusion weights can obtain histograms of the fusion features extracted from all convolutional layers.

4. Aggregating the multi-scale classification results:

the deep features of the multi-scale image with cross-kernel histograms are fed into a classifier to obtain a classification result, which can be implemented using LIBSVM software package. And integrating a plurality of results of all scales by utilizing a majority voting strategy, thereby finally completing the classification of the remote sensing scene images.

In order to verify the effect of the method, the remote sensing scene classification method based on the adaptive depth space pyramid matching, which is provided by the invention, is respectively compared with a space relation Pyramid (PSR) method and a local detector (Partlets) method.

FIG. 4 is a histogram of each classification accuracy under the method of the present invention and the spatial relationship Pyramid (PSR) and local detector (Partlets) method. As can be seen from the figure, the accuracy of the classification method of the present invention achieves the highest accuracy over 15 classes compared to the other two methods. This shows that the method of the present invention can achieve higher classification accuracy.

Table 1 shows a comparison of the classification accuracy of the 3 classification methods.

TABLE 1 Classification accuracy comparison

Method of producing a composite material	Accuracy of classification
		PSR	89.10
Partlets	91.33
		ADPM-192	92.67
ADPM-227	92.04
		ADPM-256	93.52
Multi-scale ADPM	94.86

As can be seen from Table 1, the classification accuracy of the method of the present invention is significantly higher than that of the other two classification methods, especially the multi-scale method of fusing classification results, and the accuracy is improved by nearly 4% compared with the other methods. In addition, results of different scales are different, and the classification accuracy of the multi-scale method by fusing classification results is obviously higher than that of a single-scale method.

In conclusion, compared with the methods of a spatial relationship Pyramid (PSR) and a local detector (Partlets), the method of the present invention has obvious advantages in both classification performance and classification accuracy.

The second embodiment applies the technical scheme provided by the invention to the classification of 19-Class Satellite Scene remote sensing data sets. This data set consists of 19 scenes including airports, beaches, bridges, commercial areas, deserts, farmlands, football fields, forests, industrial areas, grasslands, mountains, parks, parking lots, ponds, ports, train stations, residential areas, rivers and viaducts. Each class has 50 images with a size of 600 x 600 pixels. These images were extracted from the larger satellite images using google earth software. By using the remote sensing scene classification method based on the adaptive depth space pyramid matching, the depth features of the multi-scale images extracted from the convolutional neural network are fused and sent to the classifier, so that the classification of the remote sensing scene images is obtained.

In this embodiment, a Support Vector Machine (SVM) is selected as the classification model, and in order to verify the effectiveness of the present invention, the classification result is compared with a semi-supervised projection (SSEP) method. Generating NxN images with different scales from a remote sensing scene image to be classified by a deformation method, sending the multi-scale image into a convolutional neural network with space pyramid pooling for training so as to extract multi-scale depth features, fusing feature representations extracted from all convolutional layers for input images of each scale by using a self-adaptive depth pyramid matching model, sending the feature representations learned from the images of each scale into a classifier to obtain a final classification result, and integrating a plurality of results of all scales by using a majority voting strategy, namely correctly classifying the remote sensing scene image.

The classification process of this embodiment is specifically as follows:

1. generating a multi-scale image:

and (2) reserving an original image with the size of 600 × 600, generating images with the scale sizes of 128 × 128, 192 × 192, 227 × 227, 256 × 256 and 384 × 384 by the remote sensing scene image to be classified through a deformation method, and forming a group of multi-scale image sets of the images.

2. Extracting multi-scale depth features:

to ensure the effectiveness of training the network, the network is pre-trained using the remote sensing scenario as an input. The data set is randomly divided into a training set and a testing set, the training set is used for fine tuning a full connection layer of the spatial pyramid pooling network, and the testing set is used for evaluating the performance of the classifier. To reduce the impact of random selection, we repeatedly performed each algorithm on ten different training/test segmented data sets. The spatial pyramid pooling network, similar to spatial pyramid matching, maps the features of the partitions to increasingly thinner sub-regions, and pools the features in each sub-region by maximal pooling. Assume that each feature map after the last convolutional layer is of size a x a and each feature map is divided into sub-regions of size n x n. The spatial pyramid pooling can be regarded as a convolution operator with a window size of a/n and a step size of a/n in a sliding window mode. Here we have chosen three levels of spatial pyramid pooling configurations, n × n being 1 × 1, 2 × 2 and 4 × 4 respectively. The final output of spatial pyramid pooling is to concatenate the three levels of pooling results into a vector, producing a fixed length representation, regardless of the size of the input image. These input images of different scales share a spatial pyramid pooling network. And then, the multi-scale image is sent to a convolutional neural network with space pyramid pooling for training so as to extract the multi-scale depth features.

3. Fusing convolution characteristics:

Feature representations extracted from all convolutional layers are fused using an adaptive depth pyramid matching model. Image I₁And I₂Depth pyramid matching kernel ofL represents the total number of convolutional layers, ω_lIs the fusion weight of the l-th layer,and (3) in the adaptive depth pyramid matching model objective function, empirically taking 0.5 as a regular term parameter lambda for preventing overfitting. Learning the optimal fusion weights among all convolutional layers from the data itself, and then weighting the optimal fusion weights can obtain histograms of the fusion features extracted from all convolutional layers.

4. Aggregating the multi-scale classification results:

In order to verify the effect of the method, the remote sensing scene classification method based on the adaptive depth space pyramid matching and the semi-supervised projection (SSEP) method are respectively compared.

FIG. 5 is a histogram of the accuracy of each classification under the method of the present invention and the semi-supervised projection (SSEP) method. As can be seen from the figure, the accuracy of the classification method of the present invention achieves the highest accuracy on class 14 compared to the semi-supervised projection (SSEP) method. This shows that the method of the present invention can achieve higher classification accuracy.

TABLE 1 Classification accuracy comparison

Method of producing a composite material	Accuracy of classification
		SSEP	73.82
ADPM-227	82.14
		ADPM-256	83.71
ADPM-384	81.91
		Multi-scale ADPM	84.67

As can be seen from Table 1, the classification accuracy of the method of the present invention is significantly higher than that of a semi-supervised projection (SSEP) method, especially the accuracy of a multi-scale method with a fused classification result is improved by nearly 8% compared with other methods. In addition, results of different scales are different, and the classification accuracy of the multi-scale method by fusing classification results is obviously higher than that of a single-scale method.

In conclusion, compared with a semi-supervised ensemble projection (SSEP) method, the method of the present invention has obvious advantages in both classification performance and classification accuracy.

Claims

1. A remote sensing scene classification method is characterized by comprising the following steps:

step 1), generating an image with different scales of NxN by a remote sensing scene image to be classified through a deformation method, wherein N can take a plurality of values according to the size of the image;

step 3), for the input image of each scale, fusing the feature representations extracted from all the convolutional layers by using a self-adaptive depth pyramid matching model;

and 4) sending the feature representation learned by each scale image into a classifier to obtain a final classification result, and then integrating a plurality of results of all scales by utilizing a majority voting strategy, namely correctly classifying the remote sensing scene image.

2. The remote sensing image classification method according to claim 1, characterized in that the remote sensing scene image to be classified in step 1) is generated into different scales by deformation methods, such as 128 x 128, 192 x 192, 227 x 227, 256 x 256, 384 x 384.