WO2024061050A1

WO2024061050A1 - Remote-sensing sample labeling method based on geoscientific information and active learning

Info

Publication number: WO2024061050A1
Application number: PCT/CN2023/118178
Authority: WO
Inventors: 陈婷; 段红伟; 李洁; 董铱斐; 邹圣兵
Original assignee: 北京数慧时空信息技术有限公司
Priority date: 2022-09-19
Filing date: 2023-09-12
Publication date: 2024-03-28
Also published as: CN115272870A

Abstract

The present invention belongs to the field of classification of remote-sensing images. Disclosed is a remote-sensing sample labeling method based on geoscientific information and active learning. The method comprises: acquiring remote-sensing sample sets; performing geoscientific calculation on the remote-sensing sample sets, so as to obtain geoscientific information; clustering the remote-sensing sample sets according to the geoscientific information; obtaining a set of labeled samples and a set of unlabeled samples in combination with an active learning method; performing model training on a first classifier model by means of the set of labeled samples, inputting the set of unlabeled samples into the first classifier model for prediction, and performing screening in combination with the geoscientific information and a sample query strategy, so as to obtain a value sample set; after the labeling of the value sample set is performed by an expert, adding, into the set of labeled samples, the value sample set, the labeling of which is performed by the expert; and performing labeling on the set of unlabeled samples by means of the first classifier model, so as to obtain a labeling result. The labeling method in the present invention can improve the accuracy of labeling.

Description

Remote sensing sample labeling method based on geoscience information and active learning

Technical field

The invention relates to the field of remote sensing image classification, and specifically relates to a remote sensing sample labeling method based on geoscience information and active learning.

Background technique

This invention is oriented to labeling remote sensing samples in large-area scenarios. The traditional supervised learning method needs to label each sample, so it is difficult to be practically applied in the context of large-area scenarios. Active learning, as a method to ensure the accuracy of sample labeling, At the same time, it can reduce the cost of sample labeling. Traditional supervised learning methods require experts to label samples. In fact, the labeling process of training samples by experts is usually completed based on the visual characteristics of the scene. Therefore, if the samples are directly handed over to experts for labeling without screening, The consequence is that experts will spend a lot of valuable time to fully label samples with similar amounts of information, which not only wastes a lot of manual resources, but also makes the information in the training set very redundant. This redundant information greatly reduces the training speed. , and may even cause overfitting. Therefore, for satellite remote sensing images, we need an automatic process of defining an effective training set. The number of samples in this training set should be as small as possible and can effectively improve the accuracy of the classification model. Therefore, active learning comes into being. born. Active learning requires a very small number of labeled samples for initial training of the classifier. The number of these labeled samples is far less than the number required to fully train a classifier; then, a specific screening strategy is used to select samples from the current samples to be labeled. A specific number of samples are selected, and these selected samples are manually labeled; finally these newly labeled samples are used for incremental training of the classifier.

However, at a large regional or global scale, even if active learning screening strategies are used to reduce the number of labeled samples, the number of samples that require manual labeling is still relatively large, resulting in very high labor costs, large data processing volume, and the trained classification The accuracy of the sensor model is low and difficult to To complete sample labeling at a large regional or global scale. The main reason is that existing active learning methods cannot fully utilize the information of remote sensing samples.

Contents of the invention

The technical problem to be solved by this invention is to comprehensively utilize the spatial characteristics and statistical characteristics of remote sensing samples, organically combine geological information and data mining methods, and increase the accuracy of sample labeling.

In order to achieve the above-mentioned object of the invention, the present invention provides a remote sensing sample labeling method based on geoscience information and active learning, including:

S1 obtains a remote sensing sample set, which consists of multiple remote sensing samples. The remote sensing samples are divided into unlabeled samples and labeled samples;

S2 performs geoscience calculations on the remote sensing sample set to obtain geoscience information, where the geoscience information includes elevation information, spectral information, texture information, shape information, and statistical measurement information;

S3 clusters the remote sensing sample set according to the geoscientific information to obtain k clusters and k cluster centers, where each cluster includes a cluster center, k ≥ 1;

S4 calculates the distance between each cluster center and the remote sensing samples in the corresponding cluster. Each cluster selects the remote sensing sample closest to the cluster center and the remote sensing sample farthest away, resulting in 2k remote sensing samples;

S5 hands unlabeled samples among 2k remote sensing samples to experts for labeling, combines the expert labeling results and the labeled samples in the remote sensing sample set to form a labeled sample set, and divides the remote sensing sample set into labeled sample sets and unlabeled samples. sample set;

S6 performs model training on the first classifier model through the labeled sample set, and determines whether the conditions for terminating the training of the first classifier model are met:

If satisfied, end training and execute step S9;

If not satisfied, execute step S7;

S7 inputs the unlabeled sample set into the first classifier model for prediction, and combines geoscience information and sample query strategies for screening to obtain a valuable sample set;

S8 After handing over the value sample set to experts for labeling, add the value sample set labeled by the experts to the labeled sample set, update the unlabeled sample set and return to step S6;

S9 uses the first classifier model to label the unlabeled sample set to obtain the labeling result.

In a specific embodiment of the present invention, step S3 includes:

S31 obtains the location information of each remote sensing sample and constructs a distance calculation strategy based on geoscience information. The distance calculation strategy includes the spatial distance method and the characteristic distance method;

S32 obtains k initial clustering centers based on the distance calculation strategy;

S33 combines the location information and distance calculation strategy of remote sensing samples to iteratively optimize the k initial clustering centers to obtain k clusters and k clustering centers.

In a specific embodiment of the present invention, step S32 includes:

S321 randomly selects a remote sensing sample from the remote sensing sample set, uses the remote sensing sample as the initial clustering center, and adds it to the initial clustering center set;

S322 calculates the distance between a single remote sensing sample and all initial cluster centers based on the distance calculation strategy, uses the maximum distance as the first distance of the remote sensing sample, and sorts the first distances of all remote sensing samples from large to small. Select the remote sensing sample with the largest first distance as the new initial cluster center and add it to the initial cluster center set;

S323 Repeat step S322 until the number of initial clustering centers in the initial clustering center set reaches k.

In a specific embodiment of the present invention, step S33 includes:

S331 obtains the coordinate value of the remote sensing sample according to the position information of the remote sensing sample;

S332 calculates a single remote sensing sample and k initial clusters based on the distance calculation strategy. The distance between centers, the smallest distance is regarded as the second distance of the remote sensing sample;

S333 Form an initial clustering cluster by forming a single initial clustering center and the remote sensing samples whose distance from the initial clustering center is the second distance, and use the initial clustering center as the initial clustering center of the clustering cluster. , get the initial k clusters and initial k cluster centers;

S334 calculates the average value of the coordinate values of all remote sensing samples in the current single cluster, and calculates the difference between the coordinate value of each remote sensing sample and the average value, and uses the remote sensing sample corresponding to the coordinate value with the smallest difference as a new cluster center to obtain new k cluster centers;

S335 forms a new cluster from a single new cluster center and remote sensing samples whose distance from the cluster center is its second distance, and obtains new k clusters;

S336 calculates the distance between each remote sensing sample and the corresponding new cluster center according to the distance calculation strategy, and calculates the sum of squares of all distances to obtain the sum of squares of errors of the new k clusters;

S337 iteratively executes steps S334-S336. Each iteration obtains k clusters and their k cluster centers, and the sum of squared errors of the k clusters. The change value is calculated based on the sum of squared errors of the two adjacent iterations. , determine whether the change value meets the iteration stop condition, and if so, stop the iteration and obtain the final k clusters and k cluster centers.

In a specific embodiment of the present invention, the distance calculation strategy is:

Select two remote sensing samples to be calculated as the first sample and the second sample;

The spatial distance d _s between the first sample and the second sample is obtained according to the spatial distance method;

According to the characteristic distance method, the characteristic distance d _Eu between the first sample and the second sample is obtained;

Normalize d _s and d _Eu to obtain the normalized results d' _s and d' _Eu , where the ranges of d' _s and d' _Eu are both [0,1];

Calculate the sum of d' _s and d' _Eu as the distance between the first sample and the second sample.

In a specific embodiment of the present invention, the spatial distance method is:

Construct a Delaunay triangulation {Del} according to the location information of the remote sensing samples, {Del} includes multiple Delaunay triangles, and each Delaunay triangle includes three vertices and adjacent edges;

Obtain the Delaunay triangles Del ¹ and Del ² of the first sample and the second sample in the Delaunay triangle network {Del};

Get the vertex set {Node1} of Del ¹ on its adjacent edges, and get the vertex set {Node2} of Del ² on its adjacent edges;

According to the coordinates of each vertex in {Node1} and {Node2}, the two farthest vertices Node ₁ and Node ₂ are obtained;

Calculate the distance between Node ₁ and Node ₂ according to the spatial topological relationship, as the spatial distance d _s between the first sample and the second sample.

In a specific embodiment of the present invention, the adjacent sides of the Delaunay triangle are the sides shared by the Delaunay triangle and other Delaunay triangles, and the number of adjacent sides of each Delaunay triangle is different.

In a specific embodiment of the present invention, the feature distance method is:

Obtain the geoscience information vectors f ¹ and f ² of the first sample and the second sample according to the geoscience information;

Calculate the Euclidean distance between f ¹ and f ² as the characteristic distance d _Eu between the first sample and the second sample:

In a specific embodiment of the present invention, step S7 includes:

S71 calculates the information entropy and probability density of each unlabeled sample in the unlabeled sample set, and calculates the product of the information entropy and probability density of each unlabeled sample, and combines the product and difference constraints to screen the unlabeled samples to obtain key samples;

S72 obtains the labeled samples in the same cluster as the key samples as important samples Book;

S73 calculates the characteristic distance between each key sample and its corresponding important sample as the third distance, and adds the key samples whose third distance is greater than the distance threshold to the value sample set.

In a specific embodiment of the present invention, it is characterized in that:

The elevation information includes DEM information, ground slope information, and terrain roughness information;

The spectral information includes normalized vegetation index and enhanced vegetation index;

The texture information includes gray level co-occurrence matrix information, gray level run length matrix information, and neighborhood gray level difference matrix information;

The shape information includes rectangularity, elongation, major axis length, and longest diameter;

The statistical measurement information includes maximum value, minimum value, range, and skewness.

The present invention provides a remote sensing sample labeling method based on geoscience information and active learning. In summary, due to the adoption of the above technical solution, the beneficial effects of the present invention are:

(1) This invention performs sample clustering based on geoscientific information, and can comprehensively utilize the spatial characteristics and statistical characteristics of remote sensing samples to obtain clusters with continuous characteristics and spatial continuity, and perform initial sample selection and labeling from the clusters, and Compared with existing active learning methods, it can better ensure the polymorphism of samples.

(2) The present invention can reduce the cost of sample labeling and quickly improve the classification effect of the classifier model.

(3) The present invention uses a sample query strategy combined with geoscience information to screen unlabeled samples and obtain a value sample set, which can obtain value samples that are both representative and informative.

Description of drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are intended to illustrate preferred implementations only. are intended to be used in this manner and are not considered to be limitations of the present invention. Also throughout the drawings, the same reference characters are used to designate the same components. In the attached picture:

Figure 1 is a method flow chart of an embodiment of the present invention.

Detailed ways

Specific implementations of the present invention will be described in further detail below with reference to the accompanying drawings and examples. The following examples are used to illustrate the invention but are not intended to limit the scope of the invention.

As shown in Figure 1, this embodiment provides a remote sensing sample labeling method based on geoscience information and active learning, including:

S1 obtains a remote sensing sample set. The remote sensing sample set consists of multiple remote sensing samples. The remote sensing samples are divided into unlabeled samples and labeled samples.

First, multiple remote sensing samples are obtained, including unlabeled samples and labeled samples, to form a remote sensing sample set. Among them, the number of unlabeled samples is much larger than the number of labeled samples.

S2 performs geoscience calculations on the remote sensing sample set to obtain geoscience information, where the geoscience information includes elevation information, spectral information, texture information, shape information, and statistical measurement information.

Wherein, the elevation information includes DEM information, ground slope information, and terrain roughness information; the spectral information includes normalized vegetation index and enhanced vegetation index; the texture information includes gray level co-occurrence matrix information and gray level run length matrix Information, neighborhood gray level difference matrix information; the shape information includes rectangularity, elongation, major axis length, and longest diameter; the statistical measurement information includes maximum value, minimum value, range, and skewness.

Specifically, geoscientific information is used to reflect geographical information such as spatial location distribution characteristics of ground object entities in remote sensing samples, attributes of ground object entities, etc. Geoscientific information of remote sensing samples can be obtained through geoscientific calculation methods, such as geoscientific data extraction and analysis methods.

S3 clusters the remote sensing sample set based on geoscientific information and obtains k clusters and k cluster centers, where each cluster includes a cluster center, k ≥ 1.

Specifically, in an embodiment of the present invention, step S3 includes:

S31 obtains the location information of each remote sensing sample and constructs a distance calculation strategy based on geoscience information. The distance calculation strategy includes the spatial distance method and the characteristic distance method.

In an embodiment of the present invention, the distance calculation strategy is:

Select two remote sensing samples to be calculated as the first sample and the second sample.

The spatial distance d _s between the first sample and the second sample is obtained according to the spatial distance method.

Specifically, the spatial distance method is:

The Delaunay triangulation network {Del} is constructed based on the position information of the remote sensing samples. {Del} includes multiple Delaunay triangles, and each Delaunay triangle includes three vertices and adjacent edges.

It should be noted that the Delaunay triangle network is a set of connected but non-overlapping Delaunay triangles, and the circumcircles of these Delaunay triangles do not include any other points in this area. When constructing the Delaunay triangulation network based on the location information of remote sensing samples, the geographical location of the remote sensing samples during imaging, such as spatial coordinates, longitude and latitude, etc. is used. In the Delaunay triangulation network, each remote sensing sample falls inside the corresponding Delaunay triangle. .

Among them, each Delaunay triangle has three vertices and three sides. When a Delaunay triangle is connected to another Delaunay triangle, that is, the two Delaunay triangles will share the same side. The sides shared by the Delaunay triangle and other Delaunay triangles are regarded as the Delaunay Adjacent sides of a triangle. There are many situations for a Delaunay triangle. When it is connected to another Delaunay triangle, its adjacent sides are one. When it is connected to two other Delaunay triangles, its adjacent sides are Two, when connected to three other Delaunay triangles, its adjacent sides are three, so each Delaunay triangle has a different number of adjacent sides.

Obtain the Delaunay triangles Del ¹ and Del ² of the first sample and the second sample in the Delaunay triangle network {Del}.

Get the vertex set {Node1} of Del ¹ on its adjacent edges, and get the vertex set {Node2} of Del ² on its adjacent edges.

According to the coordinates of each vertex in {Node1} and {Node2}, the two farthest vertices Node ₁ and Node ₂ are obtained.

Specifically, the spatial position between each two vertices is obtained according to the position of the coordinates of each vertex in the spatial coordinate system.

Specifically, the distance between Node ₁ and Node ₂ is a spatial distance, which cannot be calculated according to a two-dimensional plane method. Therefore, this embodiment adopts a spatial topological calculation method and uses the adjacent edges of the Delaunay triangle to obtain the distance between the two points. For example, there are two Delaunay triangles between Del ¹ where Node ₁ is located and Del ² where Node ₂ is located, which are recorded as Del ³ and Del ^4. Del ¹ is connected to Del ³ , Del ³ is connected to Del ⁴ , and Del ⁴ is connected to Del ^2. Starting from Node ₁ , and then along the adjacent edges of Del ¹ , Del ³ , Del ⁴ , and Del ² , to Node ₁ , the shortest spatial path between the two points is obtained, and the distance between the two points is obtained through topological calculation.

The characteristic distance d _Eu between the first sample and the second sample is obtained according to the characteristic distance method.

Specifically, the feature distance method is:

Among them, the geoscience information vector is extracted and calculated based on the geoscience information. Specifically, it can be one or more of the following: elevation information vector, spectral information vector, texture information vector, shape information vector, and statistical measurement information vector. When there are multiple types, , a variety of vectors can be spliced or fused to obtain geoscience information vectors.

Normalize d _s and d _Eu to obtain the normalized results d' _s and d' _Eu , where the ranges of d' _s and d' _Eu are both [0,1].

S32 obtains k initial clustering centers based on the distance calculation strategy.

Specifically, step S32 may include:

S321 randomly selects a remote sensing sample from the remote sensing sample set, uses the remote sensing sample as the initial clustering center, and adds it to the initial clustering center set.

S322 calculates the distance between a single remote sensing sample and all initial cluster centers based on the distance calculation strategy, uses the maximum distance as the first distance of the remote sensing sample, and sorts the first distances of all remote sensing samples from large to small. Select the remote sensing sample with the largest first distance as the new initial cluster center and add it to the initial cluster center set.

Specifically, step S32 is described using an embodiment:

_Record _the _remote sensing sample set _as The distance between n-1 remote sensing samples {X ₁ ,X ₂ ,...,X _i-1 ,X _i+1 ,..., _X _n } and ₁ ,X ₂ ,...,X _i-1 ,X _i+1 _, _... ,X _n } their respective _first distances, for { The first distance of _i+1 ,...,X _n } is sorted from large to small, and it will be ranked first. Remote sensing samples are screened out. Assuming that the remote sensing sample is X ₁ , then both X ₁ and _Xi are used as initial clustering centers, and an initial clustering center set is constructed.

Calculate the distance between the remaining n-2 remote sensing samples {X ₂ ,...,X _i-1 ,X _i+1 ,...,X _n } and X _i and X ₁ respectively, and divide the largest distance As the first distance corresponding to the remote sensing sample, for example, the distance between X ₂ and Xi _i is greater than the distance between X ₂ and X ₁ , then the first distance of X ₂ is its distance from Xi _i , and {X ₂ ,. .., X _i-1 ,X _i+1 , _... , And join the initial cluster center set.

Initial clustering centers are selected sequentially according to the rules described above until the number of initial clustering centers in the initial clustering center set reaches k. In this embodiment, k can be 6.

In one embodiment of the present invention, step S33 includes:

S331 obtains the coordinate value of the remote sensing sample based on the location information of the remote sensing sample.

Specifically, the location information of the remote sensing sample can be obtained based on the metadata of the remote sensing sample, which is the data obtained when the remote sensing sample is imaged. It refers to the actual geographical location information of the remote sensing sample during imaging. The remote sensing can be obtained based on the location information. The coordinate value of the sample in the global geographical coordinate system.

S332 calculates the distance between a single remote sensing sample and the k initial cluster centers based on the distance calculation strategy, and uses the smallest distance as the second distance of the remote sensing sample.

Specifically, the distance between each remote sensing sample and k initial cluster centers is calculated, that is, k distances can be obtained for each remote sensing sample, and the smallest of these k distances is used as the second distance of the corresponding remote sensing sample.

S333 forms an initial clustering cluster by forming a single initial clustering center and the remote sensing samples whose distance from the initial clustering center is its second distance, and uses the initial clustering center as The initial clustering center of the clustering cluster is the initial k clustering clusters and the initial k clustering centers.

Specifically, in an initial clustering cluster, there is an initial clustering center and multiple remote sensing samples. In the initial clustering cluster, the distance between each remote sensing sample and the initial clustering center is its second distance. The initial clustering center is recorded as the initial clustering center of the initial clustering cluster. Finally, the initial k clusters and the initial k clustering centers are obtained.

S334 averages the coordinate values of all remote sensing samples within the current single cluster, calculates the difference between the coordinate value of each remote sensing sample and the average value, and assigns the remote sensing coordinate value corresponding to the smallest difference The sample is used as the new clustering center, and new k clustering centers are obtained.

Specifically, taking the current cluster as the target, calculate the average value of the coordinate values of all remote sensing samples in a single cluster. It should be noted that all remote sensing samples described here refer to all remote sensing samples except the current cluster. Remote sensing samples outside the class center. Then calculate the difference between the coordinate value of each remote sensing sample and the average value, and use the remote sensing sample with the smallest difference as the new cluster center, that is, replace the cluster center. According to the above steps, all current clusters The centers are replaced and new k clustering centers are obtained.

S335 forms a new cluster from a single new cluster center and the remote sensing samples whose distance from the cluster center is its second distance, and obtains new k clusters.

Specifically, after obtaining the new k clustering centers, new k clustering clusters are formed around the new k clustering centers according to the second distance to complete the update of the clustering clusters.

S336 calculates the distance between each remote sensing sample and the corresponding new cluster center according to the distance calculation strategy, and calculates the sum of squares of all distances to obtain the sum of squares of errors of the new k clusters.

It can be understood that, taking a separate new cluster as the object, calculate the distance between the remote sensing sample and the corresponding new cluster center, that is, the second distance of the remote sensing sample, and combine the remote sensing of all new clusters The second distance of the samples is calculated together and the sum of squares is obtained to obtain the sum of squares of errors of the new k clusters. That is, the sum of squares of errors of the new k clusters is one value. The calculation formula is as follows:

Among them, SSE represents the sum of squared errors, k is the number of clusters, m _i is the number of remote sensing samples in the i-th cluster, ||X _i -μ _i || is the remote sensing sample and cluster center in the i-th cluster distance.

Specifically, the iteration stop condition may be that the change value between the sum of squares of errors obtained in two adjacent iterations is 0, that is, the sum of squares of errors has been minimized. Or the iteration stop condition reaches the maximum number of iterations. For example, if the maximum number of iterations is 6, the iteration will stop after 6 iterations. Or the iteration stop condition is that the change value reaches a threshold, which can be set to 0.2.

S4 calculates the distance between each cluster center and the remote sensing samples in the corresponding cluster. Each cluster selects the remote sensing sample closest to the cluster center and the farthest remote sensing sample to obtain 2k remote sensing samples.

Specifically, taking a single cluster as the object, the distance between each remote sensing sample in the cluster and the cluster center is calculated. The distance is still calculated according to the distance calculation strategy. The distances are sorted from large to small, and the distance is selected. The first remote sensing sample and the last remote sensing sample can finally be selected from k clusters to obtain 2k remote sensing samples.

S5 hands unlabeled samples among 2k remote sensing samples to experts for labeling, combines the expert labeling results and the labeled samples in the remote sensing sample set to form a labeled sample set, and divides the remote sensing sample set into labeled sample sets and unlabeled samples. sample set.

Specifically, if the selected 2k remote sensing samples include unlabeled samples, they will first be handed over to experts for labeling and converted into labeled samples. Then all remote sensing samples will be re-divided according to whether they are labeled or not, obtaining labeled sample sets and unlabeled sample sets.

If satisfied, end training and execute step S9;

If not satisfied, execute step S7.

S7 inputs the unlabeled sample set into the first classifier model for prediction, and combines geoscience information and sample query strategies for screening to obtain a valuable sample set.

Specifically, step S7 includes:

S71 calculates the information entropy and probability density of each unlabeled sample in the unlabeled sample set, and calculates the product of the information entropy and probability density of each unlabeled sample. It combines the product and difference constraints to screen the unlabeled samples to obtain key samples.

S72 obtains labeled samples in the same cluster as the key samples as important samples.

Specifically, step S7 uses active learning to query samples. This embodiment chooses to use information entropy to measure the informativeness of unlabeled samples, which is defined as follows:

Among them, P(y _j |x; θ) represents the probability that the unlabeled sample x belongs to the jth category.

In addition, this embodiment chooses to use probability density to estimate the representativeness of unlabeled samples, which is defined as follows:

Among them, m is the number of unlabeled samples, is the Gaussian kernel function.

Calculate the product of information entropy and probability density of each unlabeled sample, and sort them from small to large. The first unlabeled sample is directly selected as the key sample, and the remaining unlabeled samples need to meet the difference constraints. The difference constraint refers to the difference between the currently queried unlabeled sample and the existing key sample. The specific difference can be measured according to the difference between the maximum information entropy and the product of the probability density, that is, the currently queried unlabeled sample The maximum value of the difference between the product of and the product of each existing key sample is used as the difference of the unlabeled sample. The difference needs to be lower than the difference threshold, which can be set to 0.1.

After the key samples are obtained from the query, the corresponding labeled samples are obtained according to the cluster where each key sample is located, and these labeled samples are used as important samples corresponding to the key samples.

Obtain key samples and geoinformation vectors of important samples based on geoscientific information, then calculate the characteristic distance between a single key sample and its corresponding important sample based on the characteristic distance method, select the largest characteristic distance as the third distance of the key sample, and combine all The third distance of the key samples is compared with the distance threshold, and the key samples greater than the distance threshold are added to the valuable sample set. Among them, the distance threshold can be set to 0.5.

S8 After the valuable sample set is handed over to the expert for labeling, the valuable sample set labeled by the expert is added to the labeled sample set, and the unlabeled sample set is updated and then the process returns to step S6;

The above embodiments are only used to illustrate the present invention and are not intended to limit the present invention. Those of ordinary skill in the relevant technical fields can, without departing from the spirit and scope of the present invention, Various changes and modifications can also be made, so all equivalent technical solutions also fall within the scope of the present invention, and the patent protection scope of the present invention should be limited by the claims.

Claims

A remote sensing sample labeling method based on geoscience information and active learning, which is characterized by including the following steps:

S1 obtains a remote sensing sample set, which consists of multiple remote sensing samples. The remote sensing samples are divided into unlabeled samples and labeled samples;

S2 performs geoscience calculations on the remote sensing sample set to obtain geoscience information, where the geoscience information includes elevation information, spectral information, texture information, shape information, and statistical measurement information;

S3 clusters the remote sensing sample set based on geoscience information and obtains k clusters and k cluster centers, where each cluster includes a cluster center, k ≥ 1;

S4 calculates the distance between each cluster center and the remote sensing samples in the corresponding cluster. Each cluster selects the remote sensing sample closest to the cluster center and the farthest remote sensing sample to obtain 2k remote sensing samples;

S5 hands unlabeled samples among 2k remote sensing samples to experts for labeling, combines the expert labeling results and the labeled samples in the remote sensing sample set to form a labeled sample set, and divides the remote sensing sample set into labeled sample sets and unlabeled samples. sample set;

S6 performs model training on the first classifier model through the labeled sample set, and determines whether the conditions for terminating the training of the first classifier model are met:

If satisfied, end training and execute step S9;

If not satisfied, execute step S7;

S7 inputs the unlabeled sample set into the first classifier model for prediction, and screens it in combination with the geoscientific information and the sample query strategy to obtain a valuable sample set;

S8 After handing over the value sample set to experts for labeling, add the value sample set labeled by the experts to the labeled sample set, update the unlabeled sample set and return to step S6;

S9 uses the first classifier model to label the unlabeled sample set and obtains the labeling result.
A remote sensing sample labeling method based on geoscience information and active learning as claimed in claim 1, characterized in that step S3 includes:

S31 obtains the location information of each remote sensing sample and constructs a distance calculation strategy based on geoscience information. The distance calculation strategy includes the spatial distance method and the characteristic distance method;

S32 obtains k initial clustering centers based on the distance calculation strategy;

S33 combines the location information and distance calculation strategy of remote sensing samples to iteratively optimize the k initial clustering centers to obtain k clusters and k clustering centers.
A remote sensing sample labeling method based on geoscience information and active learning as claimed in claim 2, characterized in that step S32 includes:

S321 randomly selects a remote sensing sample from the remote sensing sample set, uses the remote sensing sample as the initial clustering center, and adds it to the initial clustering center set;

S322 calculates the distance between a single remote sensing sample and all initial cluster centers based on the distance calculation strategy, uses the maximum distance as the first distance of the remote sensing sample, and sorts the first distances of all remote sensing samples from large to small. Select the remote sensing sample with the largest first distance as the new initial cluster center and add it to the initial cluster center set;

S323 Repeat step S322 until the number of initial clustering centers in the initial clustering center set reaches k.
A remote sensing sample labeling method based on geoscience information and active learning as claimed in claim 3, characterized in that step S33 includes:

S331 obtains the coordinate value of the remote sensing sample based on the location information of the remote sensing sample;

S332 calculates the distance between a single remote sensing sample and k initial clustering centers based on the distance calculation strategy. The distance between them, the smallest distance is regarded as the second distance of the remote sensing sample;

S333: forming an initial cluster cluster with a single initial cluster center and remote sensing samples whose distance from the initial cluster center is the second distance thereof, and using the initial cluster center as the initial cluster center of the cluster cluster, thereby obtaining initial k cluster clusters and initial k cluster centers;

S334 averages the coordinate values of all remote sensing samples within the current single cluster, calculates the difference between the coordinate value of each remote sensing sample and the average value, and assigns the remote sensing coordinate value corresponding to the smallest difference The sample is used as a new clustering center, and new k clustering centers are obtained;

S335 forms a new cluster from a single new cluster center and remote sensing samples whose distance from the cluster center is its second distance, and obtains new k clusters;

S336 calculates the distance between each remote sensing sample and the corresponding new cluster center according to the distance calculation strategy, and calculates the sum of squares of all distances to obtain the sum of squares of errors of the new k clusters;

S337 iteratively executes steps S334-S336. Each iteration obtains k clusters and their k cluster centers, and the sum of squared errors of the k clusters. The change value is calculated based on the sum of squared errors of the two adjacent iterations. , determine whether the change value meets the iteration stop condition, and if so, stop the iteration and obtain the final k clusters and k cluster centers.
A remote sensing sample annotation method based on geoscientific information and active learning as claimed in claim 2, characterized in that the distance calculation strategy is:

Select two remote sensing samples to be calculated as the first sample and the second sample;

The spatial distance d s between the first sample and the second sample is obtained according to the spatial distance method;

According to the characteristic distance method, the characteristic distance d Eu between the first sample and the second sample is obtained;

Normalize d s and d Eu to obtain the normalized results d' s and d' Eu , where the ranges of d' s and d' Eu are both [0,1];

Calculate the sum of d' s and d' Eu as the distance between the first sample and the second sample.
A remote sensing sample labeling method based on geoscience information and active learning as claimed in claim 5, characterized in that the spatial distance method is:

Construct a Delaunay triangulation network {Del} based on the position information of the remote sensing sample. {Del} includes multiple Delaunay triangles. Each Delaunay triangle includes three vertices and adjacent edges; obtain the first sample and the second sample in the Delaunay triangulation network. Delaunay triangles Del 1 and Del 2 in {Del};

Get the vertex set {Node1} of Del 1 on its adjacent edges, and get the vertex set {Node2} of Del 2 on its adjacent edges;

According to the coordinates of each vertex in {Node1} and {Node2}, get the two vertices Node 1 and Node 2 with the farthest spatial position;

Calculate the distance between Node 1 and Node 2 according to the spatial topological relationship, as the spatial distance d s between the first sample and the second sample.
A remote sensing sample labeling method based on geoscience information and active learning as claimed in claim 6, characterized in that the adjacent sides of the Delaunay triangle are the sides shared by the Delaunay triangle and other Delaunay triangles, and the adjacent sides of each Delaunay triangle are The number of adjacent edges varies.
A remote sensing sample annotation method based on geoscientific information and active learning as claimed in claim 5, characterized in that the feature distance method is:

Obtain the geoscience information vectors f 1 and f 2 of the first sample and the second sample according to the geoscience information;

Calculate the Euclidean distance between f1 and f2 as the characteristic distance dEu between the first sample and the second sample:
A remote sensing sample labeling method based on geoscience information and active learning as claimed in claim 8, characterized in that step S7 includes:

S71 calculates the information entropy and probability density of each unlabeled sample in the unlabeled sample set, and calculates the product of the information entropy and probability density of each unlabeled sample, and combines the product and difference constraints to screen the unlabeled samples to obtain key samples;

S72 obtains the labeled samples in the same cluster as the key samples as important samples;

S73 calculates the characteristic distance between each key sample and its corresponding important sample as the third distance, and adds the key samples whose third distance is greater than the distance threshold to the value sample set.
A remote sensing sample labeling method based on geoscience information and active learning as claimed in claim 1, characterized by:

The elevation information includes DEM information, ground slope information, and terrain roughness information;

The spectral information includes normalized vegetation index and enhanced vegetation index;

The texture information includes gray level co-occurrence matrix information, gray level run length matrix information, and neighborhood gray level difference matrix information;

The shape information includes rectangularity, elongation, major axis length, and longest diameter;

The statistical measurement information includes maximum value, minimum value, range, and skewness.