WO2024061050A1 - Remote-sensing sample labeling method based on geoscientific information and active learning - Google Patents

Remote-sensing sample labeling method based on geoscientific information and active learning Download PDF

Info

Publication number
WO2024061050A1
WO2024061050A1 PCT/CN2023/118178 CN2023118178W WO2024061050A1 WO 2024061050 A1 WO2024061050 A1 WO 2024061050A1 CN 2023118178 W CN2023118178 W CN 2023118178W WO 2024061050 A1 WO2024061050 A1 WO 2024061050A1
Authority
WO
WIPO (PCT)
Prior art keywords
sample
remote sensing
distance
information
samples
Prior art date
Application number
PCT/CN2023/118178
Other languages
French (fr)
Chinese (zh)
Inventor
陈婷
段红伟
李洁
董铱斐
邹圣兵
Original Assignee
北京数慧时空信息技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京数慧时空信息技术有限公司 filed Critical 北京数慧时空信息技术有限公司
Publication of WO2024061050A1 publication Critical patent/WO2024061050A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/10Terrestrial scenes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/762Arrangements for image or video recognition or understanding using pattern recognition or machine learning using clustering, e.g. of similar faces in social networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/70Labelling scene content, e.g. deriving syntactic or semantic representations

Definitions

  • the invention relates to the field of remote sensing image classification, and specifically relates to a remote sensing sample labeling method based on geoscience information and active learning.
  • This invention is oriented to labeling remote sensing samples in large-area scenarios.
  • the traditional supervised learning method needs to label each sample, so it is difficult to be practically applied in the context of large-area scenarios.
  • Active learning as a method to ensure the accuracy of sample labeling, At the same time, it can reduce the cost of sample labeling.
  • Traditional supervised learning methods require experts to label samples. In fact, the labeling process of training samples by experts is usually completed based on the visual characteristics of the scene. Therefore, if the samples are directly handed over to experts for labeling without screening, The consequence is that experts will spend a lot of valuable time to fully label samples with similar amounts of information, which not only wastes a lot of manual resources, but also makes the information in the training set very redundant. This redundant information greatly reduces the training speed.
  • the technical problem to be solved by this invention is to comprehensively utilize the spatial characteristics and statistical characteristics of remote sensing samples, organically combine geological information and data mining methods, and increase the accuracy of sample labeling.
  • the present invention provides a remote sensing sample labeling method based on geoscience information and active learning, including:
  • S1 obtains a remote sensing sample set, which consists of multiple remote sensing samples.
  • the remote sensing samples are divided into unlabeled samples and labeled samples;
  • S2 performs geoscience calculations on the remote sensing sample set to obtain geoscience information, where the geoscience information includes elevation information, spectral information, texture information, shape information, and statistical measurement information;
  • S3 clusters the remote sensing sample set according to the geoscientific information to obtain k clusters and k cluster centers, where each cluster includes a cluster center, k ⁇ 1;
  • S4 calculates the distance between each cluster center and the remote sensing samples in the corresponding cluster.
  • Each cluster selects the remote sensing sample closest to the cluster center and the remote sensing sample farthest away, resulting in 2k remote sensing samples;
  • S5 hands unlabeled samples among 2k remote sensing samples to experts for labeling, combines the expert labeling results and the labeled samples in the remote sensing sample set to form a labeled sample set, and divides the remote sensing sample set into labeled sample sets and unlabeled samples. sample set;
  • S6 performs model training on the first classifier model through the labeled sample set, and determines whether the conditions for terminating the training of the first classifier model are met:
  • S7 inputs the unlabeled sample set into the first classifier model for prediction, and combines geoscience information and sample query strategies for screening to obtain a valuable sample set;
  • S9 uses the first classifier model to label the unlabeled sample set to obtain the labeling result.
  • step S3 includes:
  • the distance calculation strategy includes the spatial distance method and the characteristic distance method
  • S33 combines the location information and distance calculation strategy of remote sensing samples to iteratively optimize the k initial clustering centers to obtain k clusters and k clustering centers.
  • step S32 includes:
  • S321 randomly selects a remote sensing sample from the remote sensing sample set, uses the remote sensing sample as the initial clustering center, and adds it to the initial clustering center set;
  • S322 calculates the distance between a single remote sensing sample and all initial cluster centers based on the distance calculation strategy, uses the maximum distance as the first distance of the remote sensing sample, and sorts the first distances of all remote sensing samples from large to small. Select the remote sensing sample with the largest first distance as the new initial cluster center and add it to the initial cluster center set;
  • step S323 Repeat step S322 until the number of initial clustering centers in the initial clustering center set reaches k.
  • step S33 includes:
  • S331 obtains the coordinate value of the remote sensing sample according to the position information of the remote sensing sample
  • S332 calculates a single remote sensing sample and k initial clusters based on the distance calculation strategy.
  • the distance between centers, the smallest distance is regarded as the second distance of the remote sensing sample;
  • S333 Form an initial clustering cluster by forming a single initial clustering center and the remote sensing samples whose distance from the initial clustering center is the second distance, and use the initial clustering center as the initial clustering center of the clustering cluster. , get the initial k clusters and initial k cluster centers;
  • S334 calculates the average value of the coordinate values of all remote sensing samples in the current single cluster, and calculates the difference between the coordinate value of each remote sensing sample and the average value, and uses the remote sensing sample corresponding to the coordinate value with the smallest difference as a new cluster center to obtain new k cluster centers;
  • S335 forms a new cluster from a single new cluster center and remote sensing samples whose distance from the cluster center is its second distance, and obtains new k clusters;
  • S336 calculates the distance between each remote sensing sample and the corresponding new cluster center according to the distance calculation strategy, and calculates the sum of squares of all distances to obtain the sum of squares of errors of the new k clusters;
  • S337 iteratively executes steps S334-S336.
  • Each iteration obtains k clusters and their k cluster centers, and the sum of squared errors of the k clusters.
  • the change value is calculated based on the sum of squared errors of the two adjacent iterations. , determine whether the change value meets the iteration stop condition, and if so, stop the iteration and obtain the final k clusters and k cluster centers.
  • the distance calculation strategy is:
  • the spatial distance d s between the first sample and the second sample is obtained according to the spatial distance method
  • the characteristic distance d Eu between the first sample and the second sample is obtained
  • the spatial distance method is:
  • ⁇ Del ⁇ Construct a Delaunay triangulation ⁇ Del ⁇ according to the location information of the remote sensing samples, ⁇ Del ⁇ includes multiple Delaunay triangles, and each Delaunay triangle includes three vertices and adjacent edges;
  • the adjacent sides of the Delaunay triangle are the sides shared by the Delaunay triangle and other Delaunay triangles, and the number of adjacent sides of each Delaunay triangle is different.
  • the feature distance method is:
  • step S7 includes:
  • S71 calculates the information entropy and probability density of each unlabeled sample in the unlabeled sample set, and calculates the product of the information entropy and probability density of each unlabeled sample, and combines the product and difference constraints to screen the unlabeled samples to obtain key samples;
  • S73 calculates the characteristic distance between each key sample and its corresponding important sample as the third distance, and adds the key samples whose third distance is greater than the distance threshold to the value sample set.
  • the elevation information includes DEM information, ground slope information, and terrain roughness information
  • the spectral information includes normalized vegetation index and enhanced vegetation index
  • the texture information includes gray level co-occurrence matrix information, gray level run length matrix information, and neighborhood gray level difference matrix information;
  • the shape information includes rectangularity, elongation, major axis length, and longest diameter
  • the statistical measurement information includes maximum value, minimum value, range, and skewness.
  • the present invention provides a remote sensing sample labeling method based on geoscience information and active learning.
  • the beneficial effects of the present invention are:
  • This invention performs sample clustering based on geoscientific information, and can comprehensively utilize the spatial characteristics and statistical characteristics of remote sensing samples to obtain clusters with continuous characteristics and spatial continuity, and perform initial sample selection and labeling from the clusters, and Compared with existing active learning methods, it can better ensure the polymorphism of samples.
  • the present invention can reduce the cost of sample labeling and quickly improve the classification effect of the classifier model.
  • the present invention uses a sample query strategy combined with geoscience information to screen unlabeled samples and obtain a value sample set, which can obtain value samples that are both representative and informative.
  • Figure 1 is a method flow chart of an embodiment of the present invention.
  • this embodiment provides a remote sensing sample labeling method based on geoscience information and active learning, including:
  • the remote sensing sample set consists of multiple remote sensing samples.
  • the remote sensing samples are divided into unlabeled samples and labeled samples.
  • multiple remote sensing samples are obtained, including unlabeled samples and labeled samples, to form a remote sensing sample set.
  • the number of unlabeled samples is much larger than the number of labeled samples.
  • S2 performs geoscience calculations on the remote sensing sample set to obtain geoscience information, where the geoscience information includes elevation information, spectral information, texture information, shape information, and statistical measurement information.
  • the elevation information includes DEM information, ground slope information, and terrain roughness information
  • the spectral information includes normalized vegetation index and enhanced vegetation index
  • the texture information includes gray level co-occurrence matrix information and gray level run length matrix Information, neighborhood gray level difference matrix information
  • the shape information includes rectangularity, elongation, major axis length, and longest diameter
  • the statistical measurement information includes maximum value, minimum value, range, and skewness.
  • geoscientific information is used to reflect geographical information such as spatial location distribution characteristics of ground object entities in remote sensing samples, attributes of ground object entities, etc.
  • Geoscientific information of remote sensing samples can be obtained through geoscientific calculation methods, such as geoscientific data extraction and analysis methods.
  • S3 clusters the remote sensing sample set based on geoscientific information and obtains k clusters and k cluster centers, where each cluster includes a cluster center, k ⁇ 1.
  • step S3 includes:
  • the distance calculation strategy includes the spatial distance method and the characteristic distance method.
  • the distance calculation strategy is:
  • the spatial distance d s between the first sample and the second sample is obtained according to the spatial distance method.
  • the spatial distance method is:
  • the Delaunay triangulation network ⁇ Del ⁇ is constructed based on the position information of the remote sensing samples.
  • ⁇ Del ⁇ includes multiple Delaunay triangles, and each Delaunay triangle includes three vertices and adjacent edges.
  • the Delaunay triangle network is a set of connected but non-overlapping Delaunay triangles, and the circumcircles of these Delaunay triangles do not include any other points in this area.
  • the geographical location of the remote sensing samples during imaging such as spatial coordinates, longitude and latitude, etc. is used.
  • each remote sensing sample falls inside the corresponding Delaunay triangle. .
  • each Delaunay triangle has three vertices and three sides.
  • a Delaunay triangle When a Delaunay triangle is connected to another Delaunay triangle, that is, the two Delaunay triangles will share the same side.
  • the sides shared by the Delaunay triangle and other Delaunay triangles are regarded as the Delaunay Adjacent sides of a triangle.
  • the spatial position between each two vertices is obtained according to the position of the coordinates of each vertex in the spatial coordinate system.
  • the distance between Node 1 and Node 2 is a spatial distance, which cannot be calculated according to a two-dimensional plane method. Therefore, this embodiment adopts a spatial topological calculation method and uses the adjacent edges of the Delaunay triangle to obtain the distance between the two points. For example, there are two Delaunay triangles between Del 1 where Node 1 is located and Del 2 where Node 2 is located, which are recorded as Del 3 and Del 4. Del 1 is connected to Del 3 , Del 3 is connected to Del 4 , and Del 4 is connected to Del 2. Starting from Node 1 , and then along the adjacent edges of Del 1 , Del 3 , Del 4 , and Del 2 , to Node 1 , the shortest spatial path between the two points is obtained, and the distance between the two points is obtained through topological calculation.
  • the characteristic distance d Eu between the first sample and the second sample is obtained according to the characteristic distance method.
  • the feature distance method is:
  • the geoscience information vector is extracted and calculated based on the geoscience information. Specifically, it can be one or more of the following: elevation information vector, spectral information vector, texture information vector, shape information vector, and statistical measurement information vector. When there are multiple types, , a variety of vectors can be spliced or fused to obtain geoscience information vectors.
  • S32 obtains k initial clustering centers based on the distance calculation strategy.
  • step S32 may include:
  • S321 randomly selects a remote sensing sample from the remote sensing sample set, uses the remote sensing sample as the initial clustering center, and adds it to the initial clustering center set.
  • S322 calculates the distance between a single remote sensing sample and all initial cluster centers based on the distance calculation strategy, uses the maximum distance as the first distance of the remote sensing sample, and sorts the first distances of all remote sensing samples from large to small. Select the remote sensing sample with the largest first distance as the new initial cluster center and add it to the initial cluster center set.
  • step S323 Repeat step S322 until the number of initial clustering centers in the initial clustering center set reaches k.
  • step S32 is described using an embodiment:
  • the remote sensing sample set as The distance between n-1 remote sensing samples ⁇ X 1 ,X 2 ,...,X i-1 ,X i+1 ,..., X n ⁇ and 1 ,X 2 ,...,X i-1 ,X i+1 , ... ,X n ⁇ their respective first distances, for ⁇
  • the first distance of i+1 ,...,X n ⁇ is sorted from large to small, and it will be ranked first.
  • Remote sensing samples are screened out. Assuming that the remote sensing sample is X 1 , then both X 1 and Xi are used as initial clustering centers, and an initial clustering center set is constructed.
  • Initial clustering centers are selected sequentially according to the rules described above until the number of initial clustering centers in the initial clustering center set reaches k.
  • k can be 6.
  • S33 combines the location information and distance calculation strategy of remote sensing samples to iteratively optimize the k initial clustering centers to obtain k clusters and k clustering centers.
  • step S33 includes:
  • S331 obtains the coordinate value of the remote sensing sample based on the location information of the remote sensing sample.
  • the location information of the remote sensing sample can be obtained based on the metadata of the remote sensing sample, which is the data obtained when the remote sensing sample is imaged. It refers to the actual geographical location information of the remote sensing sample during imaging.
  • the remote sensing can be obtained based on the location information.
  • S332 calculates the distance between a single remote sensing sample and the k initial cluster centers based on the distance calculation strategy, and uses the smallest distance as the second distance of the remote sensing sample.
  • the distance between each remote sensing sample and k initial cluster centers is calculated, that is, k distances can be obtained for each remote sensing sample, and the smallest of these k distances is used as the second distance of the corresponding remote sensing sample.
  • S333 forms an initial clustering cluster by forming a single initial clustering center and the remote sensing samples whose distance from the initial clustering center is its second distance, and uses the initial clustering center as The initial clustering center of the clustering cluster is the initial k clustering clusters and the initial k clustering centers.
  • an initial clustering cluster there is an initial clustering center and multiple remote sensing samples.
  • the distance between each remote sensing sample and the initial clustering center is its second distance.
  • the initial clustering center is recorded as the initial clustering center of the initial clustering cluster.
  • the initial k clusters and the initial k clustering centers are obtained.
  • S334 averages the coordinate values of all remote sensing samples within the current single cluster, calculates the difference between the coordinate value of each remote sensing sample and the average value, and assigns the remote sensing coordinate value corresponding to the smallest difference
  • the sample is used as the new clustering center, and new k clustering centers are obtained.
  • S335 forms a new cluster from a single new cluster center and the remote sensing samples whose distance from the cluster center is its second distance, and obtains new k clusters.
  • new k clustering clusters are formed around the new k clustering centers according to the second distance to complete the update of the clustering clusters.
  • S336 calculates the distance between each remote sensing sample and the corresponding new cluster center according to the distance calculation strategy, and calculates the sum of squares of all distances to obtain the sum of squares of errors of the new k clusters.
  • SSE represents the sum of squared errors
  • k is the number of clusters
  • m i is the number of remote sensing samples in the i-th cluster
  • is the remote sensing sample and cluster center in the i-th cluster distance.
  • S337 iteratively executes steps S334-S336.
  • Each iteration obtains k clusters and their k cluster centers, and the sum of squared errors of the k clusters.
  • the change value is calculated based on the sum of squared errors of the two adjacent iterations. , determine whether the change value meets the iteration stop condition, and if so, stop the iteration and obtain the final k clusters and k cluster centers.
  • the iteration stop condition may be that the change value between the sum of squares of errors obtained in two adjacent iterations is 0, that is, the sum of squares of errors has been minimized. Or the iteration stop condition reaches the maximum number of iterations. For example, if the maximum number of iterations is 6, the iteration will stop after 6 iterations. Or the iteration stop condition is that the change value reaches a threshold, which can be set to 0.2.
  • S4 calculates the distance between each cluster center and the remote sensing samples in the corresponding cluster.
  • Each cluster selects the remote sensing sample closest to the cluster center and the farthest remote sensing sample to obtain 2k remote sensing samples.
  • the distance between each remote sensing sample in the cluster and the cluster center is calculated.
  • the distance is still calculated according to the distance calculation strategy.
  • the distances are sorted from large to small, and the distance is selected.
  • the first remote sensing sample and the last remote sensing sample can finally be selected from k clusters to obtain 2k remote sensing samples.
  • S5 hands unlabeled samples among 2k remote sensing samples to experts for labeling, combines the expert labeling results and the labeled samples in the remote sensing sample set to form a labeled sample set, and divides the remote sensing sample set into labeled sample sets and unlabeled samples. sample set.
  • the selected 2k remote sensing samples include unlabeled samples, they will first be handed over to experts for labeling and converted into labeled samples. Then all remote sensing samples will be re-divided according to whether they are labeled or not, obtaining labeled sample sets and unlabeled sample sets.
  • S6 performs model training on the first classifier model through the labeled sample set, and determines whether the conditions for terminating the training of the first classifier model are met:
  • S7 inputs the unlabeled sample set into the first classifier model for prediction, and combines geoscience information and sample query strategies for screening to obtain a valuable sample set.
  • step S7 includes:
  • S71 calculates the information entropy and probability density of each unlabeled sample in the unlabeled sample set, and calculates the product of the information entropy and probability density of each unlabeled sample. It combines the product and difference constraints to screen the unlabeled samples to obtain key samples.
  • S72 obtains labeled samples in the same cluster as the key samples as important samples.
  • S73 calculates the characteristic distance between each key sample and its corresponding important sample as the third distance, and adds the key samples whose third distance is greater than the distance threshold to the value sample set.
  • step S7 uses active learning to query samples.
  • This embodiment chooses to use information entropy to measure the informativeness of unlabeled samples, which is defined as follows:
  • x; ⁇ ) represents the probability that the unlabeled sample x belongs to the jth category.
  • this embodiment chooses to use probability density to estimate the representativeness of unlabeled samples, which is defined as follows:
  • m is the number of unlabeled samples, is the Gaussian kernel function.
  • the difference constraint refers to the difference between the currently queried unlabeled sample and the existing key sample.
  • the specific difference can be measured according to the difference between the maximum information entropy and the product of the probability density, that is, the currently queried unlabeled sample.
  • the maximum value of the difference between the product of and the product of each existing key sample is used as the difference of the unlabeled sample.
  • the difference needs to be lower than the difference threshold, which can be set to 0.1.
  • the corresponding labeled samples are obtained according to the cluster where each key sample is located, and these labeled samples are used as important samples corresponding to the key samples.
  • the distance threshold can be set to 0.5.
  • S9 uses the first classifier model to label the unlabeled sample set to obtain the labeling result.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Image Analysis (AREA)

Abstract

The present invention belongs to the field of classification of remote-sensing images. Disclosed is a remote-sensing sample labeling method based on geoscientific information and active learning. The method comprises: acquiring remote-sensing sample sets; performing geoscientific calculation on the remote-sensing sample sets, so as to obtain geoscientific information; clustering the remote-sensing sample sets according to the geoscientific information; obtaining a set of labeled samples and a set of unlabeled samples in combination with an active learning method; performing model training on a first classifier model by means of the set of labeled samples, inputting the set of unlabeled samples into the first classifier model for prediction, and performing screening in combination with the geoscientific information and a sample query strategy, so as to obtain a value sample set; after the labeling of the value sample set is performed by an expert, adding, into the set of labeled samples, the value sample set, the labeling of which is performed by the expert; and performing labeling on the set of unlabeled samples by means of the first classifier model, so as to obtain a labeling result. The labeling method in the present invention can improve the accuracy of labeling.

Description

基于地学信息和主动学习的遥感样本标注方法Remote sensing sample labeling method based on geoscience information and active learning 技术领域Technical field
本发明涉及到遥感图像分类领域,具体涉及一种基于地学信息和主动学习的遥感样本标注方法。The invention relates to the field of remote sensing image classification, and specifically relates to a remote sensing sample labeling method based on geoscience information and active learning.
背景技术Background technique
本发明面向大区域场景下的遥感样本标注,传统的监督学习方法由于需要标注每个样本,因此,在大区域场景背景下难以得到实际应用,而主动学习作为一种在保证样本标注准确率的同时可以减少样本标注成本的方法。传统的监督学习方法需要专家对样本进行标注,实际上,专家对训练样本的标注过程通常是根据场景的视觉特征来完成的,因此,如果样本在未加筛选之前就直接交给专家进行标注,带来的后果是专家会花费大量的宝贵时间对具有类似信息量的样本进行充分标注,不但浪费大量人工资源,而且会使得训练集的信息非常冗余,这种冗余信息大大降低了训练速度,甚至会产生过拟合的现象。所以,对于卫星遥感图像而言,我们需要一个自动的定义有效训练集的过程,这个训练集的样本数量要尽可能的少而且能有效地提高分类模型的准确率,由此,主动学习应运而生。主动学习需要一些极少数的标注样本去进行分类器的初始化训练,这些标注样本的数量远远少于完整训练一个分类器所需的数量;之后,利用特定的筛选策略从当前待标注的样本中挑选出特定数量的样本,这些挑选出来的样本由人工进行标注;最后这些新标注的样本用于分类器的增量训练。This invention is oriented to labeling remote sensing samples in large-area scenarios. The traditional supervised learning method needs to label each sample, so it is difficult to be practically applied in the context of large-area scenarios. Active learning, as a method to ensure the accuracy of sample labeling, At the same time, it can reduce the cost of sample labeling. Traditional supervised learning methods require experts to label samples. In fact, the labeling process of training samples by experts is usually completed based on the visual characteristics of the scene. Therefore, if the samples are directly handed over to experts for labeling without screening, The consequence is that experts will spend a lot of valuable time to fully label samples with similar amounts of information, which not only wastes a lot of manual resources, but also makes the information in the training set very redundant. This redundant information greatly reduces the training speed. , and may even cause overfitting. Therefore, for satellite remote sensing images, we need an automatic process of defining an effective training set. The number of samples in this training set should be as small as possible and can effectively improve the accuracy of the classification model. Therefore, active learning comes into being. born. Active learning requires a very small number of labeled samples for initial training of the classifier. The number of these labeled samples is far less than the number required to fully train a classifier; then, a specific screening strategy is used to select samples from the current samples to be labeled. A specific number of samples are selected, and these selected samples are manually labeled; finally these newly labeled samples are used for incremental training of the classifier.
但是,在大区域或全球尺度下,即使使用主动学习的筛选策略进行标注样本的减量,需要人工标注的样本量依然比较大,导致人工成本非常高,数据处理量大,且训练出来的分类器模型准确率较低,难 以完成大区域或全球尺度下的样本标注。主要原因是现有的主动学习方法不能充分利用遥感样本的信息。However, at a large regional or global scale, even if active learning screening strategies are used to reduce the number of labeled samples, the number of samples that require manual labeling is still relatively large, resulting in very high labor costs, large data processing volume, and the trained classification The accuracy of the sensor model is low and difficult to To complete sample labeling at a large regional or global scale. The main reason is that existing active learning methods cannot fully utilize the information of remote sensing samples.
发明内容Contents of the invention
本发明要解决的技术问题在于综合利用遥感样本的空间特征和统计特征,有机结合地学信息与数据挖掘方法,增加样本标注的准确率。The technical problem to be solved by this invention is to comprehensively utilize the spatial characteristics and statistical characteristics of remote sensing samples, organically combine geological information and data mining methods, and increase the accuracy of sample labeling.
为实现上述的发明目的,本发明提供了一种基于地学信息和主动学习的遥感样本标注方法,包括:In order to achieve the above-mentioned object of the invention, the present invention provides a remote sensing sample labeling method based on geoscience information and active learning, including:
S1获取遥感样本集,遥感样本集由多个遥感样本组成,遥感样本分为未标注样本和已标注样本;S1 obtains a remote sensing sample set, which consists of multiple remote sensing samples. The remote sensing samples are divided into unlabeled samples and labeled samples;
S2对遥感样本集进行地学计算,得到地学信息,其中,所述地学信息包括高程信息、光谱信息、纹理信息、形状信息、统计计量信息;S2 performs geoscience calculations on the remote sensing sample set to obtain geoscience information, where the geoscience information includes elevation information, spectral information, texture information, shape information, and statistical measurement information;
S3根据地学信息对遥感样本集进行聚类,得到k个聚类簇和k个聚类中心,其中,每个聚类簇均包括一个聚类中心,k≥1;S3 clusters the remote sensing sample set according to the geoscientific information to obtain k clusters and k cluster centers, where each cluster includes a cluster center, k ≥ 1;
S4计算每个聚类中心与对应聚类簇中遥感样本之间的距离,每个聚类簇均选取离聚类中心最近的遥感样本和最远的遥感样本,得到2k个遥感样本;S4 calculates the distance between each cluster center and the remote sensing samples in the corresponding cluster. Each cluster selects the remote sensing sample closest to the cluster center and the remote sensing sample farthest away, resulting in 2k remote sensing samples;
S5将2k个遥感样本中的未标注样本交给专家进行标注,将专家标注的结果和遥感样本集中的已标注样本组成已标注样本集,并将遥感样本集分为已标注样本集和未标注样本集;S5 hands unlabeled samples among 2k remote sensing samples to experts for labeling, combines the expert labeling results and the labeled samples in the remote sensing sample set to form a labeled sample set, and divides the remote sensing sample set into labeled sample sets and unlabeled samples. sample set;
S6通过已标注样本集对第一分类器模型进行模型训练,并判断是否满足第一分类器模型训练终止的条件:S6 performs model training on the first classifier model through the labeled sample set, and determines whether the conditions for terminating the training of the first classifier model are met:
若满足,结束训练,执行步骤S9; If satisfied, end training and execute step S9;
若不满足,执行步骤S7;If not satisfied, execute step S7;
S7将未标注样本集输入第一分类器模型进行预测,并结合地学信息和样本查询策略进行筛选,得到价值样本集;S7 inputs the unlabeled sample set into the first classifier model for prediction, and combines geoscience information and sample query strategies for screening to obtain a valuable sample set;
S8将价值样本集交由专家进行标注后,将专家标注的价值样本集加入已标注样本集,并更新未标注样本集后返回至步骤S6;S8 After handing over the value sample set to experts for labeling, add the value sample set labeled by the experts to the labeled sample set, update the unlabeled sample set and return to step S6;
S9通过所述第一分类器模型对未标注样本集进行标注,得到标注结果。S9 uses the first classifier model to label the unlabeled sample set to obtain the labeling result.
于本发明一具体实施例中,步骤S3包括:In a specific embodiment of the present invention, step S3 includes:
S31获取每个遥感样本的位置信息,并根据地学信息构建距离计算策略,距离计算策略包括空间距离方法和特征距离方法;S31 obtains the location information of each remote sensing sample and constructs a distance calculation strategy based on geoscience information. The distance calculation strategy includes the spatial distance method and the characteristic distance method;
S32基于距离计算策略得到k个初始聚类中心;S32 obtains k initial clustering centers based on the distance calculation strategy;
S33结合遥感样本的位置信息和距离计算策略对k个初始聚类中心进行迭代优化,得到k个聚类簇和k个聚类中心。S33 combines the location information and distance calculation strategy of remote sensing samples to iteratively optimize the k initial clustering centers to obtain k clusters and k clustering centers.
于本发明一具体实施例中,步骤S32包括:In a specific embodiment of the present invention, step S32 includes:
S321从遥感样本集中随机选择一个遥感样本,将该遥感样本作为初始聚类中心,并加入到初始聚类中心集;S321 randomly selects a remote sensing sample from the remote sensing sample set, uses the remote sensing sample as the initial clustering center, and adds it to the initial clustering center set;
S322基于距离计算策略计算单个遥感样本分别与所有的初始聚类中心之间的距离,将最大的距离作为该遥感样本的第一距离,将所有遥感样本的第一距离按从大到小排序,选择第一距离最大的遥感样本作为新的初始聚类中心,并加入初始聚类中心集;S322 calculates the distance between a single remote sensing sample and all initial cluster centers based on the distance calculation strategy, uses the maximum distance as the first distance of the remote sensing sample, and sorts the first distances of all remote sensing samples from large to small. Select the remote sensing sample with the largest first distance as the new initial cluster center and add it to the initial cluster center set;
S323重复步骤S322,直至初始聚类中心集中的初始聚类中心个数达到k个。S323 Repeat step S322 until the number of initial clustering centers in the initial clustering center set reaches k.
于本发明一具体实施例中,步骤S33包括:In a specific embodiment of the present invention, step S33 includes:
S331根据遥感样本的位置信息得到遥感样本的坐标值;S331 obtains the coordinate value of the remote sensing sample according to the position information of the remote sensing sample;
S332基于距离计算策略计算单个遥感样本分别与k个初始聚类 中心之间的距离,将最小的距离作为该遥感样本的第二距离;S332 calculates a single remote sensing sample and k initial clusters based on the distance calculation strategy. The distance between centers, the smallest distance is regarded as the second distance of the remote sensing sample;
S333将单个初始聚类中心以及与该初始聚类中心的距离为其第二距离的遥感样本形成一个初始的聚类簇,并将该初始聚类中心作为该聚类簇的初始的聚类中心,得到初始的k个聚类簇和初始的k个聚类中心;S333 Form an initial clustering cluster by forming a single initial clustering center and the remote sensing samples whose distance from the initial clustering center is the second distance, and use the initial clustering center as the initial clustering center of the clustering cluster. , get the initial k clusters and initial k cluster centers;
S334在当前的单个聚类簇内,对所有的遥感样本的坐标值求平均值,并计算每个遥感样本的坐标值与平均值之间的差值,将差值最小的坐标值对应的遥感样本作为新的聚类中心,得到新的k个聚类中心;S334 calculates the average value of the coordinate values of all remote sensing samples in the current single cluster, and calculates the difference between the coordinate value of each remote sensing sample and the average value, and uses the remote sensing sample corresponding to the coordinate value with the smallest difference as a new cluster center to obtain new k cluster centers;
S335将单个新的聚类中心以及与该聚类中心的距离为其第二距离的遥感样本形成一个新的聚类簇,得到新的k个聚类簇;S335 forms a new cluster from a single new cluster center and remote sensing samples whose distance from the cluster center is its second distance, and obtains new k clusters;
S336根据距离计算策略计算每个遥感样本与对应的新的聚类中心之间的距离,并计算所有距离的平方和,得到新的k个聚类簇的误差平方和;S336 calculates the distance between each remote sensing sample and the corresponding new cluster center according to the distance calculation strategy, and calculates the sum of squares of all distances to obtain the sum of squares of errors of the new k clusters;
S337迭代执行步骤S334-S336,每次迭代均得到k个聚类簇及其k个聚类中心、该k个聚类簇的误差平方和,根据相邻两次迭代的误差平方和计算变化值,判断变化值是否满足迭代停止条件,若满足,则停止迭代,得到最终的k个聚类簇和k个聚类中心。S337 iteratively executes steps S334-S336. Each iteration obtains k clusters and their k cluster centers, and the sum of squared errors of the k clusters. The change value is calculated based on the sum of squared errors of the two adjacent iterations. , determine whether the change value meets the iteration stop condition, and if so, stop the iteration and obtain the final k clusters and k cluster centers.
于本发明一具体实施例中,所述距离计算策略为:In a specific embodiment of the present invention, the distance calculation strategy is:
选定两个待计算的遥感样本,作为第一样本和第二样本;Select two remote sensing samples to be calculated as the first sample and the second sample;
根据空间距离方法得到第一样本和第二样本之间的空间距离dsThe spatial distance d s between the first sample and the second sample is obtained according to the spatial distance method;
根据特征距离方法得到第一样本和第二样本之间的特征距离dEuAccording to the characteristic distance method, the characteristic distance d Eu between the first sample and the second sample is obtained;
将ds和dEu进行归一化处理,得到归一化处理结果d's和d′Eu,其中d's和d′Eu的范围均为[0,1];Normalize d s and d Eu to obtain the normalized results d' s and d' Eu , where the ranges of d' s and d' Eu are both [0,1];
计算d's和d′Eu的和,作为第一样本和第二样本的距离。 Calculate the sum of d' s and d' Eu as the distance between the first sample and the second sample.
于本发明一具体实施例中,所述空间距离方法为:In a specific embodiment of the present invention, the spatial distance method is:
根据遥感样本的位置信息构建Delaunay三角网{Del},{Del}包括多个Delaunay三角形,每个Delaunay三角形均包括三个顶点和相邻边;Construct a Delaunay triangulation {Del} according to the location information of the remote sensing samples, {Del} includes multiple Delaunay triangles, and each Delaunay triangle includes three vertices and adjacent edges;
获取第一样本和第二样本在Delaunay三角网{Del}中的Delaunay三角形Del1和Del2Obtain the Delaunay triangles Del 1 and Del 2 of the first sample and the second sample in the Delaunay triangle network {Del};
获取Del1在其相邻边上的顶点集合{Node1},获取Del2在其相邻边上的顶点集合{Node2};Get the vertex set {Node1} of Del 1 on its adjacent edges, and get the vertex set {Node2} of Del 2 on its adjacent edges;
根据{Node1}和{Node2}中每个顶点的坐标得到空间位置最远的两个顶点Node1和Node2According to the coordinates of each vertex in {Node1} and {Node2}, the two farthest vertices Node 1 and Node 2 are obtained;
根据空间拓扑关系计算Node1和Node2的距离,作为第一样本和第二样本的空间距离dsCalculate the distance between Node 1 and Node 2 according to the spatial topological relationship, as the spatial distance d s between the first sample and the second sample.
于本发明一具体实施例中,Delaunay三角形的相邻边为该Delaunay三角形与其他Delaunay三角形共享的边,且每个Delaunay三角形的相邻边的数量不尽相同。In a specific embodiment of the present invention, the adjacent sides of the Delaunay triangle are the sides shared by the Delaunay triangle and other Delaunay triangles, and the number of adjacent sides of each Delaunay triangle is different.
于本发明一具体实施例中,所述特征距离方法为:In a specific embodiment of the present invention, the feature distance method is:
根据地学信息得到第一样本和第二样本的地学信息向量f1和f2Obtain the geoscience information vectors f 1 and f 2 of the first sample and the second sample according to the geoscience information;
计算f1和f2的Euclidean距离,作为第一样本和第二样本的特征距离dEu
Calculate the Euclidean distance between f 1 and f 2 as the characteristic distance d Eu between the first sample and the second sample:
于本发明一具体实施例中,步骤S7包括:In a specific embodiment of the present invention, step S7 includes:
S71计算未标注样本集中每个未标注样本的信息熵和概率密度,并计算每个未标注样本的信息熵和概率密度的乘积,结合乘积和差异性约束条件筛选未标注样本,得到关键样本;S71 calculates the information entropy and probability density of each unlabeled sample in the unlabeled sample set, and calculates the product of the information entropy and probability density of each unlabeled sample, and combines the product and difference constraints to screen the unlabeled samples to obtain key samples;
S72获取与关键样本在相同聚类簇中的已标注样本,作为重要样 本;S72 obtains the labeled samples in the same cluster as the key samples as important samples Book;
S73计算每个关键样本与其对应的重要样本之间的特征距离,作为第三距离,将第三距离大于距离阈值的关键样本加入到价值样本集中。S73 calculates the characteristic distance between each key sample and its corresponding important sample as the third distance, and adds the key samples whose third distance is greater than the distance threshold to the value sample set.
于本发明一具体实施例中,其特征在于:In a specific embodiment of the present invention, it is characterized in that:
所述高程信息包括DEM信息、地面坡度信息、地形粗糙度信息;The elevation information includes DEM information, ground slope information, and terrain roughness information;
所述光谱信息包括归一化植被指数、增强植被指数;The spectral information includes normalized vegetation index and enhanced vegetation index;
所述纹理信息包括灰度共生矩阵信息、灰度运行长度矩阵信息、邻域灰度差矩阵信息;The texture information includes gray level co-occurrence matrix information, gray level run length matrix information, and neighborhood gray level difference matrix information;
所述形状信息包括矩形度、伸长度、长轴长、最长直径;The shape information includes rectangularity, elongation, major axis length, and longest diameter;
所述统计计量信息包括最大值、最小值、范围、偏度。The statistical measurement information includes maximum value, minimum value, range, and skewness.
本发明提供了一种基于地学信息和主动学习的遥感样本标注方法,综上所述,由于采用上述技术方案,本发明的有益效果是:The present invention provides a remote sensing sample labeling method based on geoscience information and active learning. In summary, due to the adoption of the above technical solution, the beneficial effects of the present invention are:
(1)本发明基于地学信息进行样本聚类,可以综合利用遥感样本的空间特征和统计特征,得到特征连续且空间连续的聚类簇,并从聚类簇中进行初始样本选择和标注,与现有的主动学习方法相比,可以更好的保证样本的多态性。(1) This invention performs sample clustering based on geoscientific information, and can comprehensively utilize the spatial characteristics and statistical characteristics of remote sensing samples to obtain clusters with continuous characteristics and spatial continuity, and perform initial sample selection and labeling from the clusters, and Compared with existing active learning methods, it can better ensure the polymorphism of samples.
(2)本发明可以减少样本标注的成本,迅速提升分类器模型分类效果。(2) The present invention can reduce the cost of sample labeling and quickly improve the classification effect of the classifier model.
(3)本发明利用结合地学信息的样本查询策略来筛选未标记样本,得到价值样本集,能够得到即具有代表性、又具有信息量的价值样本。(3) The present invention uses a sample query strategy combined with geoscience information to screen unlabeled samples and obtain a value sample set, which can obtain value samples that are both representative and informative.
附图说明Description of drawings
通过阅读下文优选实施方式的详细描述,各种其他的优点和益处对于本领域普通技术人员将变得清楚明了。附图仅用于示出优选实施 方式的目的,而并不认为是对本发明的限制。而且在整个附图中,用相同的参考符号表示相同的部件。在附图中:Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are intended to illustrate preferred implementations only. are intended to be used in this manner and are not considered to be limitations of the present invention. Also throughout the drawings, the same reference characters are used to designate the same components. In the attached picture:
图1是本发明实施例的方法流程图。Figure 1 is a method flow chart of an embodiment of the present invention.
具体实施方式Detailed ways
下面结合附图和实施例,对本发明的具体实施方式作进一步详细描述。以下实施例用于说明本发明,但不用来限制本发明的范围。Specific implementations of the present invention will be described in further detail below with reference to the accompanying drawings and examples. The following examples are used to illustrate the invention but are not intended to limit the scope of the invention.
如图1所示,本实施例提供一种基于地学信息和主动学习的遥感样本标注方法,包括:As shown in Figure 1, this embodiment provides a remote sensing sample labeling method based on geoscience information and active learning, including:
S1获取遥感样本集,遥感样本集由多个遥感样本组成,遥感样本分为未标注样本和已标注样本。S1 obtains a remote sensing sample set. The remote sensing sample set consists of multiple remote sensing samples. The remote sensing samples are divided into unlabeled samples and labeled samples.
首先,获取多个遥感样本,包括未标注样本和已标注样本,组成遥感样本集,其中,未标注样本的数量远大于已标注样本的数量。First, multiple remote sensing samples are obtained, including unlabeled samples and labeled samples, to form a remote sensing sample set. Among them, the number of unlabeled samples is much larger than the number of labeled samples.
S2对遥感样本集进行地学计算,得到地学信息,其中,所述地学信息包括高程信息、光谱信息、纹理信息、形状信息、统计计量信息。S2 performs geoscience calculations on the remote sensing sample set to obtain geoscience information, where the geoscience information includes elevation information, spectral information, texture information, shape information, and statistical measurement information.
其中,所述高程信息包括DEM信息、地面坡度信息、地形粗糙度信息;所述光谱信息包括归一化植被指数、增强植被指数;所述纹理信息包括灰度共生矩阵信息、灰度运行长度矩阵信息、邻域灰度差矩阵信息;所述形状信息包括矩形度、伸长度、长轴长、最长直径;所述统计计量信息包括最大值、最小值、范围、偏度。Wherein, the elevation information includes DEM information, ground slope information, and terrain roughness information; the spectral information includes normalized vegetation index and enhanced vegetation index; the texture information includes gray level co-occurrence matrix information and gray level run length matrix Information, neighborhood gray level difference matrix information; the shape information includes rectangularity, elongation, major axis length, and longest diameter; the statistical measurement information includes maximum value, minimum value, range, and skewness.
具体地,地学信息是用于反映遥感样本中地物实体空间位置分布特征、地物实体的属性等地理信息,通过地学计算方法,如地学数据提取及分析方法,可以得到遥感样本的地学信息。 Specifically, geoscientific information is used to reflect geographical information such as spatial location distribution characteristics of ground object entities in remote sensing samples, attributes of ground object entities, etc. Geoscientific information of remote sensing samples can be obtained through geoscientific calculation methods, such as geoscientific data extraction and analysis methods.
S3根据地学信息对遥感样本集进行聚类,得到k个聚类簇和k个聚类中心,其中,每个聚类簇均包括一个聚类中心,k≥1。S3 clusters the remote sensing sample set based on geoscientific information and obtains k clusters and k cluster centers, where each cluster includes a cluster center, k ≥ 1.
具体地,在本发明一实施例中,步骤S3包括:Specifically, in an embodiment of the present invention, step S3 includes:
S31获取每个遥感样本的位置信息,并根据地学信息构建距离计算策略,距离计算策略包括空间距离方法和特征距离方法。S31 obtains the location information of each remote sensing sample and constructs a distance calculation strategy based on geoscience information. The distance calculation strategy includes the spatial distance method and the characteristic distance method.
在本发明一实施例中,距离计算策略为:In an embodiment of the present invention, the distance calculation strategy is:
选定两个待计算的遥感样本,作为第一样本和第二样本。Select two remote sensing samples to be calculated as the first sample and the second sample.
根据空间距离方法得到第一样本和第二样本之间的空间距离dsThe spatial distance d s between the first sample and the second sample is obtained according to the spatial distance method.
具体地,空间距离方法为:Specifically, the spatial distance method is:
根据遥感样本的位置信息构建Delaunay三角网{Del},{Del}包括多个Delaunay三角形,每个Delaunay三角形均包括三个顶点和相邻边。The Delaunay triangulation network {Del} is constructed based on the position information of the remote sensing samples. {Del} includes multiple Delaunay triangles, and each Delaunay triangle includes three vertices and adjacent edges.
需要说明的是,Delaunay三角网是一系列相连的但不重叠的Delaunay三角形的集合,且这些Delaunay三角形的外接圆不包含这个面域的其他任何点。在根据遥感样本的位置信息构建Delaunay三角网时,利用的是遥感样本在成像时的地理位置,如空间坐标、经纬度等,在Delaunay三角网中,每个遥感样本落在了对应Delaunay三角形的内部。It should be noted that the Delaunay triangle network is a set of connected but non-overlapping Delaunay triangles, and the circumcircles of these Delaunay triangles do not include any other points in this area. When constructing the Delaunay triangulation network based on the location information of remote sensing samples, the geographical location of the remote sensing samples during imaging, such as spatial coordinates, longitude and latitude, etc. is used. In the Delaunay triangulation network, each remote sensing sample falls inside the corresponding Delaunay triangle. .
其中,每个Delaunay三角形均有三个顶点和三条边,当一个Delaunay三角形与另外的Delaunay三角形相连时,即两个Delaunay三角形会共享同一条边,将Delaunay三角形与其他Delaunay三角形共享的边作为该Delaunay三角形的相邻边。而一个Delaunay三角形存在多种情况,当其与一个另外的Delaunay三角形相连时,其相邻边为一条,当其与两个另外的Delaunay三角形相连时,其相邻边为 两条,当其与三个另外的Delaunay三角形相连时,其相邻边为三条,因此,每个Delaunay三角形的相邻边的数量不尽相同。Among them, each Delaunay triangle has three vertices and three sides. When a Delaunay triangle is connected to another Delaunay triangle, that is, the two Delaunay triangles will share the same side. The sides shared by the Delaunay triangle and other Delaunay triangles are regarded as the Delaunay Adjacent sides of a triangle. There are many situations for a Delaunay triangle. When it is connected to another Delaunay triangle, its adjacent sides are one. When it is connected to two other Delaunay triangles, its adjacent sides are Two, when connected to three other Delaunay triangles, its adjacent sides are three, so each Delaunay triangle has a different number of adjacent sides.
获取第一样本和第二样本在Delaunay三角网{Del}中的Delaunay三角形Del1和Del2Obtain the Delaunay triangles Del 1 and Del 2 of the first sample and the second sample in the Delaunay triangle network {Del}.
获取Del1在其相邻边上的顶点集合{Node1},获取Del2在其相邻边上的顶点集合{Node2}。Get the vertex set {Node1} of Del 1 on its adjacent edges, and get the vertex set {Node2} of Del 2 on its adjacent edges.
根据{Node1}和{Node2}中每个顶点的坐标得到空间位置最远的两个顶点Node1和Node2According to the coordinates of each vertex in {Node1} and {Node2}, the two farthest vertices Node 1 and Node 2 are obtained.
具体地,根据每个顶点的坐标在空间坐标系中的位置来得到每两个顶点之间的空间位置。Specifically, the spatial position between each two vertices is obtained according to the position of the coordinates of each vertex in the spatial coordinate system.
根据空间拓扑关系计算Node1和Node2的距离,作为第一样本和第二样本的空间距离dsCalculate the distance between Node 1 and Node 2 according to the spatial topological relationship, as the spatial distance d s between the first sample and the second sample.
具体地,Node1和Node2之间的距离为空间距离,不能根据二维平面的方法来进行计算,因此,本实施例采用空间拓扑的计算方法,同时利用Delaunay三角形的相邻边来得到两个点之间的距离,例如,Node1所在的Del1与Node2所在的Del2之间隔了两个Delaunay三角形,记为Del3和Del4,Del1与Del3相连,Del3与Del4相连,Del4与Del2相连,从Node1开始出发,然后沿着Del1的相邻边、Del3的相邻边、Del4的相邻边、Del2的相邻边,到Node1为止,得到两个点的最短空间路径,通过拓扑计算得到两个点之间的距离。Specifically, the distance between Node 1 and Node 2 is a spatial distance, which cannot be calculated according to a two-dimensional plane method. Therefore, this embodiment adopts a spatial topological calculation method and uses the adjacent edges of the Delaunay triangle to obtain the distance between the two points. For example, there are two Delaunay triangles between Del 1 where Node 1 is located and Del 2 where Node 2 is located, which are recorded as Del 3 and Del 4. Del 1 is connected to Del 3 , Del 3 is connected to Del 4 , and Del 4 is connected to Del 2. Starting from Node 1 , and then along the adjacent edges of Del 1 , Del 3 , Del 4 , and Del 2 , to Node 1 , the shortest spatial path between the two points is obtained, and the distance between the two points is obtained through topological calculation.
根据特征距离方法得到第一样本和第二样本之间的特征距离dEuThe characteristic distance d Eu between the first sample and the second sample is obtained according to the characteristic distance method.
具体地,特征距离方法为:Specifically, the feature distance method is:
根据地学信息得到第一样本和第二样本的地学信息向量f1和f2Obtain the geoscience information vectors f 1 and f 2 of the first sample and the second sample according to the geoscience information;
计算f1和f2的Euclidean距离,作为第一样本和第二样本的特征距离dEu
Calculate the Euclidean distance between f 1 and f 2 as the characteristic distance d Eu between the first sample and the second sample:
其中,地学信息向量根据地学信息进行提取和计算得到,具体可以是高程信息向量、光谱信息向量、纹理信息向量、形状信息向量、统计计量信息向量中的一种或多种,当为多种时,可以对多种的向量进行拼接或融合得到地学信息向量。Among them, the geoscience information vector is extracted and calculated based on the geoscience information. Specifically, it can be one or more of the following: elevation information vector, spectral information vector, texture information vector, shape information vector, and statistical measurement information vector. When there are multiple types, , a variety of vectors can be spliced or fused to obtain geoscience information vectors.
将ds和dEu进行归一化处理,得到归一化处理结果d's和d′Eu,其中d's和d′Eu的范围均为[0,1]。Normalize d s and d Eu to obtain the normalized results d' s and d' Eu , where the ranges of d' s and d' Eu are both [0,1].
计算d's和d′Eu的和,作为第一样本和第二样本的距离。Calculate the sum of d' s and d' Eu as the distance between the first sample and the second sample.
S32基于距离计算策略得到k个初始聚类中心。S32 obtains k initial clustering centers based on the distance calculation strategy.
具体地,步骤S32可以包括:Specifically, step S32 may include:
S321从遥感样本集中随机选择一个遥感样本,将该遥感样本作为初始聚类中心,并加入到初始聚类中心集。S321 randomly selects a remote sensing sample from the remote sensing sample set, uses the remote sensing sample as the initial clustering center, and adds it to the initial clustering center set.
S322基于距离计算策略计算单个遥感样本分别与所有的初始聚类中心之间的距离,将最大的距离作为该遥感样本的第一距离,将所有遥感样本的第一距离按从大到小排序,选择第一距离最大的遥感样本作为新的初始聚类中心,并加入初始聚类中心集。S322 calculates the distance between a single remote sensing sample and all initial cluster centers based on the distance calculation strategy, uses the maximum distance as the first distance of the remote sensing sample, and sorts the first distances of all remote sensing samples from large to small. Select the remote sensing sample with the largest first distance as the new initial cluster center and add it to the initial cluster center set.
S323重复步骤S322,直至初始聚类中心集中的初始聚类中心个数达到k个。S323 Repeat step S322 until the number of initial clustering centers in the initial clustering center set reaches k.
具体地,以一个实施例来说明步骤S32:Specifically, step S32 is described using an embodiment:
将遥感样本集记为X={X1,X2,...,Xn},n为遥感样本集中遥感样本的数目,从X中随机的选择一个遥感样本Xi,分别计算剩下的n-1个遥感样本{X1,X2,...,Xi-1,Xi+1,...,Xn}与Xi之间的距离,将得到的该距离作为{X1,X2,...,Xi-1,Xi+1,...,Xn}各自的第一距离,对{X1,X2,...,Xi-1,Xi+1,...,Xn}的第一距离进行从大到小的排序,将排在第一 个的遥感样本筛选出来,假设该遥感样本为X1,则将X1和Xi都作为初始聚类中心,并构建一个初始聚类中心集。 Record the remote sensing sample set as The distance between n-1 remote sensing samples {X 1 ,X 2 ,...,X i-1 ,X i+1 ,..., X n } and 1 ,X 2 ,...,X i-1 ,X i+1 , ... ,X n } their respective first distances, for { The first distance of i+1 ,...,X n } is sorted from large to small, and it will be ranked first. Remote sensing samples are screened out. Assuming that the remote sensing sample is X 1 , then both X 1 and Xi are used as initial clustering centers, and an initial clustering center set is constructed.
计算剩余的n-2个遥感样本{X2,...,Xi-1,Xi+1,...,Xn}分别与Xi和X1之间的距离,将最大的距离作为对应遥感样本的第一距离,例如,X2与Xi的距离比X2与X1的距离大,则X2的第一距离为其与Xi的距离,同样将{X2,...,Xi-1,Xi+1,...,Xn}的第一距离进行从大到小的排序,将排在第一个的遥感样本筛选出来作为新的初始聚类中心并加入初始聚类中心集。Calculate the distance between the remaining n-2 remote sensing samples {X 2 ,...,X i-1 ,X i+1 ,...,X n } and X i and X 1 respectively, and divide the largest distance As the first distance corresponding to the remote sensing sample, for example, the distance between X 2 and Xi i is greater than the distance between X 2 and X 1 , then the first distance of X 2 is its distance from Xi i , and {X 2 ,. .., X i-1 ,X i+1 , ... , And join the initial cluster center set.
按照上述描述的规律依次选择初始聚类中心,直至初始聚类中心集中的初始聚类中心的个数达到k个,本实施例中,k可以取6。Initial clustering centers are selected sequentially according to the rules described above until the number of initial clustering centers in the initial clustering center set reaches k. In this embodiment, k can be 6.
S33结合遥感样本的位置信息和距离计算策略对k个初始聚类中心进行迭代优化,得到k个聚类簇和k个聚类中心。S33 combines the location information and distance calculation strategy of remote sensing samples to iteratively optimize the k initial clustering centers to obtain k clusters and k clustering centers.
在本发明一实施例中,步骤S33包括:In one embodiment of the present invention, step S33 includes:
S331根据遥感样本的位置信息得到遥感样本的坐标值。S331 obtains the coordinate value of the remote sensing sample based on the location information of the remote sensing sample.
具体地,遥感样本的位置信息可以根据遥感样本的元数据得到,其是遥感样本成像时即得到的数据,指的是遥感样本在成像时的实际的地理位置信息,根据位置信息即可得到遥感样本在全球地理坐标系中的坐标值。Specifically, the location information of the remote sensing sample can be obtained based on the metadata of the remote sensing sample, which is the data obtained when the remote sensing sample is imaged. It refers to the actual geographical location information of the remote sensing sample during imaging. The remote sensing can be obtained based on the location information. The coordinate value of the sample in the global geographical coordinate system.
S332基于距离计算策略计算单个遥感样本分别与k个初始聚类中心之间的距离,将最小的距离作为该遥感样本的第二距离。S332 calculates the distance between a single remote sensing sample and the k initial cluster centers based on the distance calculation strategy, and uses the smallest distance as the second distance of the remote sensing sample.
具体地,计算每个遥感样本与k个初始聚类中心之间的距离,即每个遥感样本均可得到k个距离,将这k个距离中最小的作为对应遥感样本的第二距离。Specifically, the distance between each remote sensing sample and k initial cluster centers is calculated, that is, k distances can be obtained for each remote sensing sample, and the smallest of these k distances is used as the second distance of the corresponding remote sensing sample.
S333将单个初始聚类中心以及与该初始聚类中心的距离为其第二距离的遥感样本形成一个初始的聚类簇,并将该初始聚类中心作为 该聚类簇的初始的聚类中心,得到初始的k个聚类簇和初始的k个聚类中心。S333 forms an initial clustering cluster by forming a single initial clustering center and the remote sensing samples whose distance from the initial clustering center is its second distance, and uses the initial clustering center as The initial clustering center of the clustering cluster is the initial k clustering clusters and the initial k clustering centers.
具体地,在一个初始的聚类簇中,包括一个初始聚类中心和多个遥感样本,在该初始的聚类簇内,每个遥感样本与初始聚类中心之间的距离均为其第二距离,则将该初始聚类中心记为该初始的聚类簇的初始的聚类中心,最后得到的是初始的k个聚类簇和初始的k个聚类中心。Specifically, in an initial clustering cluster, there is an initial clustering center and multiple remote sensing samples. In the initial clustering cluster, the distance between each remote sensing sample and the initial clustering center is its second distance. The initial clustering center is recorded as the initial clustering center of the initial clustering cluster. Finally, the initial k clusters and the initial k clustering centers are obtained.
S334在当前的单个聚类簇内,对所有的遥感样本的坐标值求平均值,并计算每个遥感样本的坐标值与平均值之间的差值,将差值最小的坐标值对应的遥感样本作为新的聚类中心,得到新的k个聚类中心。S334 averages the coordinate values of all remote sensing samples within the current single cluster, calculates the difference between the coordinate value of each remote sensing sample and the average value, and assigns the remote sensing coordinate value corresponding to the smallest difference The sample is used as the new clustering center, and new k clustering centers are obtained.
具体地,以当前的聚类簇为目标,计算单独的聚类簇内,所有遥感样本的坐标值的平均值,需要说明的是,该处所述的所有的遥感样本指的是除当前聚类中心之外的遥感样本。之后计算每个遥感样本的坐标值与平均值之间的差值,将差值最小的遥感样本作为新的聚类中心,即进行聚类中心的替换,根据上述步骤对所有的当前的聚类中心均进行替换,得到新的k个聚类中心。Specifically, taking the current cluster as the target, calculate the average value of the coordinate values of all remote sensing samples in a single cluster. It should be noted that all remote sensing samples described here refer to all remote sensing samples except the current cluster. Remote sensing samples outside the class center. Then calculate the difference between the coordinate value of each remote sensing sample and the average value, and use the remote sensing sample with the smallest difference as the new cluster center, that is, replace the cluster center. According to the above steps, all current clusters The centers are replaced and new k clustering centers are obtained.
S335将单个新的聚类中心以及与该聚类中心的距离为其第二距离的遥感样本形成一个新的聚类簇,得到新的k个聚类簇。S335 forms a new cluster from a single new cluster center and the remote sensing samples whose distance from the cluster center is its second distance, and obtains new k clusters.
具体地,在得到新的k个聚类中心后,依然根据第二距离来围绕新的k个聚类中心形成新的k个聚类簇,完成聚类簇的更新。Specifically, after obtaining the new k clustering centers, new k clustering clusters are formed around the new k clustering centers according to the second distance to complete the update of the clustering clusters.
S336根据距离计算策略计算每个遥感样本与对应的新的聚类中心之间的距离,并计算所有距离的平方和,得到新的k个聚类簇的误差平方和。 S336 calculates the distance between each remote sensing sample and the corresponding new cluster center according to the distance calculation strategy, and calculates the sum of squares of all distances to obtain the sum of squares of errors of the new k clusters.
可以理解的是,以单独的新的聚类簇为对象,计算遥感样本与对应的新的聚类中心之间的距离,即该遥感样本的第二距离,将所有新的聚类簇的遥感样本的第二距离一起计算平方和,得到新的k个聚类簇的误差平方和,即新的k个聚类簇的误差平方和为一个值,其计算公式如下:
It can be understood that, taking a separate new cluster as the object, calculate the distance between the remote sensing sample and the corresponding new cluster center, that is, the second distance of the remote sensing sample, and combine the remote sensing of all new clusters The second distance of the samples is calculated together and the sum of squares is obtained to obtain the sum of squares of errors of the new k clusters. That is, the sum of squares of errors of the new k clusters is one value. The calculation formula is as follows:
其中,SSE表示误差平方和,k为聚类簇的数量,mi为第i个簇中遥感样本的数量,||Xii||是第i个簇中遥感样本与聚类中心的距离。Among them, SSE represents the sum of squared errors, k is the number of clusters, m i is the number of remote sensing samples in the i-th cluster, ||X ii || is the remote sensing sample and cluster center in the i-th cluster distance.
S337迭代执行步骤S334-S336,每次迭代均得到k个聚类簇及其k个聚类中心、该k个聚类簇的误差平方和,根据相邻两次迭代的误差平方和计算变化值,判断变化值是否满足迭代停止条件,若满足,则停止迭代,得到最终的k个聚类簇和k个聚类中心。S337 iteratively executes steps S334-S336. Each iteration obtains k clusters and their k cluster centers, and the sum of squared errors of the k clusters. The change value is calculated based on the sum of squared errors of the two adjacent iterations. , determine whether the change value meets the iteration stop condition, and if so, stop the iteration and obtain the final k clusters and k cluster centers.
具体地,迭代停止条件可以是相邻两次迭代得到的误差平方和之间的变化值为0,即误差平方和已经最小。或者迭代停止条件达到了最大迭代次数,例如最大迭代次数为6,则迭代6次后就停止迭代。又或者迭代停止条件为变化值达到阈值,该阈值可以设为0.2。Specifically, the iteration stop condition may be that the change value between the sum of squares of errors obtained in two adjacent iterations is 0, that is, the sum of squares of errors has been minimized. Or the iteration stop condition reaches the maximum number of iterations. For example, if the maximum number of iterations is 6, the iteration will stop after 6 iterations. Or the iteration stop condition is that the change value reaches a threshold, which can be set to 0.2.
S4计算每个聚类中心与对应聚类簇中遥感样本之间的距离,每个聚类簇均选取离聚类中心最近的遥感样本和最远的遥感样本,得到2k个遥感样本。S4 calculates the distance between each cluster center and the remote sensing samples in the corresponding cluster. Each cluster selects the remote sensing sample closest to the cluster center and the farthest remote sensing sample to obtain 2k remote sensing samples.
具体地,以单个的聚类簇为对象,计算簇内每个遥感样本与聚类中心之间的距离,该距离仍根据距离计算策略进行计算得到,将距离按照从大到小进行排序,选择第一个遥感样本和最后一个遥感样本,最后可以在k个聚类簇中选取得到2k个遥感样本。 Specifically, taking a single cluster as the object, the distance between each remote sensing sample in the cluster and the cluster center is calculated. The distance is still calculated according to the distance calculation strategy. The distances are sorted from large to small, and the distance is selected. The first remote sensing sample and the last remote sensing sample can finally be selected from k clusters to obtain 2k remote sensing samples.
S5将2k个遥感样本中的未标注样本交给专家进行标注,将专家标注的结果和遥感样本集中的已标注样本组成已标注样本集,并将遥感样本集分为已标注样本集和未标注样本集。S5 hands unlabeled samples among 2k remote sensing samples to experts for labeling, combines the expert labeling results and the labeled samples in the remote sensing sample set to form a labeled sample set, and divides the remote sensing sample set into labeled sample sets and unlabeled samples. sample set.
具体地,若选取的这2k个遥感样本中包括有未标注样本,则先将其交给专家进行标注,转为已标注样本,然后将所有的遥感样本重新按照是否标注进行划分,得到已标注样本集和未标注样本集。Specifically, if the selected 2k remote sensing samples include unlabeled samples, they will first be handed over to experts for labeling and converted into labeled samples. Then all remote sensing samples will be re-divided according to whether they are labeled or not, obtaining labeled sample sets and unlabeled sample sets.
S6通过已标注样本集对第一分类器模型进行模型训练,并判断是否满足第一分类器模型训练终止的条件:S6 performs model training on the first classifier model through the labeled sample set, and determines whether the conditions for terminating the training of the first classifier model are met:
若满足,结束训练,执行步骤S9;If satisfied, end training and execute step S9;
若不满足,执行步骤S7。If not satisfied, execute step S7.
S7将未标注样本集输入第一分类器模型进行预测,并结合地学信息和样本查询策略进行筛选,得到价值样本集。S7 inputs the unlabeled sample set into the first classifier model for prediction, and combines geoscience information and sample query strategies for screening to obtain a valuable sample set.
具体地,步骤S7包括:Specifically, step S7 includes:
S71计算未标注样本集中每个未标注样本的信息熵和概率密度,并计算每个未标注样本的信息熵和概率密度的乘积,结合乘积和差异性约束条件筛选未标注样本,得到关键样本。S71 calculates the information entropy and probability density of each unlabeled sample in the unlabeled sample set, and calculates the product of the information entropy and probability density of each unlabeled sample. It combines the product and difference constraints to screen the unlabeled samples to obtain key samples.
S72获取与关键样本在相同聚类簇中的已标注样本,作为重要样本。S72 obtains labeled samples in the same cluster as the key samples as important samples.
S73计算每个关键样本与其对应的重要样本之间的特征距离,作为第三距离,将第三距离大于距离阈值的关键样本加入到价值样本集中。S73 calculates the characteristic distance between each key sample and its corresponding important sample as the third distance, and adds the key samples whose third distance is greater than the distance threshold to the value sample set.
具体地,步骤S7采用的是主动学习的方式进行样本查询,本实施例选择用信息熵来测量未标注样本的信息性,定义如下:
Specifically, step S7 uses active learning to query samples. This embodiment chooses to use information entropy to measure the informativeness of unlabeled samples, which is defined as follows:
其中,P(yj|x;θ)表示未标注样本x属于第j个类别的概率。 Among them, P(y j |x; θ) represents the probability that the unlabeled sample x belongs to the jth category.
另外,本实施例选择用概率密度来估计未标注样本的代表性,定义如下:
In addition, this embodiment chooses to use probability density to estimate the representativeness of unlabeled samples, which is defined as follows:
其中,m是未标注样本的数量,是高斯核函数。Among them, m is the number of unlabeled samples, is the Gaussian kernel function.
计算每个未标注样本的信息熵与概率密度的乘积,并按照从小到大排序,将第一个未标注样本直接选为关键样本,其余的未标注样本则需要满足差异性约束条件。差异性约束条件指的是当前查询的未标注样本与已有的关键样本之间的差异,具体的差异可根据最大的信息熵与概率密度的乘积之差来衡量,即当前查询的未标注样本的乘积与每个已有的关键样本的乘积的差值中最大的值作为该未标注样本的差异,该差异需低于差异性阈值,该差异性阈值可设为0.1。Calculate the product of information entropy and probability density of each unlabeled sample, and sort them from small to large. The first unlabeled sample is directly selected as the key sample, and the remaining unlabeled samples need to meet the difference constraints. The difference constraint refers to the difference between the currently queried unlabeled sample and the existing key sample. The specific difference can be measured according to the difference between the maximum information entropy and the product of the probability density, that is, the currently queried unlabeled sample The maximum value of the difference between the product of and the product of each existing key sample is used as the difference of the unlabeled sample. The difference needs to be lower than the difference threshold, which can be set to 0.1.
在查询得到关键样本后,根据每个关键样本所在的聚类簇得到其对应的已标注样本,将这些已标注样本作为关键样本对应的重要样本。After the key samples are obtained from the query, the corresponding labeled samples are obtained according to the cluster where each key sample is located, and these labeled samples are used as important samples corresponding to the key samples.
根据地学信息得到关键样本和重要样本的地学信息向量,然后基于特征距离方法计算单个关键样本与其对应的重要样本之间的特征距离,选择最大的特征距离作为该关键样本的第三距离,将所有的关键样本的第三距离与距离阈值进行比较,大于距离阈值的关键样本则加入有价值样本集中。其中,距离阈值可以设为0.5。Obtain key samples and geoinformation vectors of important samples based on geoscientific information, then calculate the characteristic distance between a single key sample and its corresponding important sample based on the characteristic distance method, select the largest characteristic distance as the third distance of the key sample, and combine all The third distance of the key samples is compared with the distance threshold, and the key samples greater than the distance threshold are added to the valuable sample set. Among them, the distance threshold can be set to 0.5.
S8将价值样本集交由专家进行标注后,将专家标注的价值样本集加入已标注样本集,并更新未标注样本集后返回至步骤S6;S8 After the valuable sample set is handed over to the expert for labeling, the valuable sample set labeled by the expert is added to the labeled sample set, and the unlabeled sample set is updated and then the process returns to step S6;
S9通过所述第一分类器模型对未标注样本集进行标注,得到标注结果。S9 uses the first classifier model to label the unlabeled sample set to obtain the labeling result.
以上实施方式仅用于说明本发明,而并非对本发明的限制,有关技术领域的普通技术人员,在不脱离本发明的精神和范围的情况下, 还可以做出各种变化和变型,因此所有等同的技术方案也属于本发明的范畴,本发明的专利保护范围应由权利要求限定。 The above embodiments are only used to illustrate the present invention and are not intended to limit the present invention. Those of ordinary skill in the relevant technical fields can, without departing from the spirit and scope of the present invention, Various changes and modifications can also be made, so all equivalent technical solutions also fall within the scope of the present invention, and the patent protection scope of the present invention should be limited by the claims.

Claims (10)

  1. 一种基于地学信息和主动学习的遥感样本标注方法,其特征在于,包括以下步骤:A remote sensing sample labeling method based on geoscience information and active learning, which is characterized by including the following steps:
    S1获取遥感样本集,遥感样本集由多个遥感样本组成,遥感样本分为未标注样本和已标注样本;S1 obtains a remote sensing sample set, which consists of multiple remote sensing samples. The remote sensing samples are divided into unlabeled samples and labeled samples;
    S2对遥感样本集进行地学计算,得到地学信息,其中,所述地学信息包括高程信息、光谱信息、纹理信息、形状信息、统计计量信息;S2 performs geoscience calculations on the remote sensing sample set to obtain geoscience information, where the geoscience information includes elevation information, spectral information, texture information, shape information, and statistical measurement information;
    S3根据地学信息对遥感样本集进行聚类,得到k个聚类簇和k个聚类中心,其中,每个聚类簇均包括一个聚类中心,k≥1;S3 clusters the remote sensing sample set based on geoscience information and obtains k clusters and k cluster centers, where each cluster includes a cluster center, k ≥ 1;
    S4计算每个聚类中心与对应聚类簇中遥感样本之间的距离,每个聚类簇均选取离聚类中心最近的遥感样本和最远的遥感样本,得到2k个遥感样本;S4 calculates the distance between each cluster center and the remote sensing samples in the corresponding cluster. Each cluster selects the remote sensing sample closest to the cluster center and the farthest remote sensing sample to obtain 2k remote sensing samples;
    S5将2k个遥感样本中的未标注样本交给专家进行标注,将专家标注的结果和遥感样本集中的已标注样本组成已标注样本集,并将遥感样本集分为已标注样本集和未标注样本集;S5 hands unlabeled samples among 2k remote sensing samples to experts for labeling, combines the expert labeling results and the labeled samples in the remote sensing sample set to form a labeled sample set, and divides the remote sensing sample set into labeled sample sets and unlabeled samples. sample set;
    S6通过已标注样本集对第一分类器模型进行模型训练,并判断是否满足第一分类器模型训练终止的条件:S6 performs model training on the first classifier model through the labeled sample set, and determines whether the conditions for terminating the training of the first classifier model are met:
    若满足,结束训练,执行步骤S9;If satisfied, end training and execute step S9;
    若不满足,执行步骤S7;If not satisfied, execute step S7;
    S7将未标注样本集输入第一分类器模型进行预测,并结合地学信息和样本查询策略进行筛选,得到价值样本集;S7 inputs the unlabeled sample set into the first classifier model for prediction, and screens it in combination with the geoscientific information and the sample query strategy to obtain a valuable sample set;
    S8将价值样本集交由专家进行标注后,将专家标注的价值样本集加入已标注样本集,并更新未标注样本集后返回至步骤S6; S8 After handing over the value sample set to experts for labeling, add the value sample set labeled by the experts to the labeled sample set, update the unlabeled sample set and return to step S6;
    S9通过第一分类器模型对未标注样本集进行标注,得到标注结果。S9 uses the first classifier model to label the unlabeled sample set and obtains the labeling result.
  2. 如权利要求1所述的一种基于地学信息和主动学习的遥感样本标注方法,其特征在于,步骤S3包括:A remote sensing sample labeling method based on geoscience information and active learning as claimed in claim 1, characterized in that step S3 includes:
    S31获取每个遥感样本的位置信息,并根据地学信息构建距离计算策略,距离计算策略包括空间距离方法和特征距离方法;S31 obtains the location information of each remote sensing sample and constructs a distance calculation strategy based on geoscience information. The distance calculation strategy includes the spatial distance method and the characteristic distance method;
    S32基于距离计算策略得到k个初始聚类中心;S32 obtains k initial clustering centers based on the distance calculation strategy;
    S33结合遥感样本的位置信息和距离计算策略对k个初始聚类中心进行迭代优化,得到k个聚类簇和k个聚类中心。S33 combines the location information and distance calculation strategy of remote sensing samples to iteratively optimize the k initial clustering centers to obtain k clusters and k clustering centers.
  3. 如权利要求2所述的一种基于地学信息和主动学习的遥感样本标注方法,其特征在于,步骤S32包括:A remote sensing sample labeling method based on geoscience information and active learning as claimed in claim 2, characterized in that step S32 includes:
    S321从遥感样本集中随机选择一个遥感样本,将该遥感样本作为初始聚类中心,并加入到初始聚类中心集;S321 randomly selects a remote sensing sample from the remote sensing sample set, uses the remote sensing sample as the initial clustering center, and adds it to the initial clustering center set;
    S322基于距离计算策略计算单个遥感样本分别与所有的初始聚类中心之间的距离,将最大的距离作为该遥感样本的第一距离,将所有遥感样本的第一距离按从大到小排序,选择第一距离最大的遥感样本作为新的初始聚类中心,并加入初始聚类中心集;S322 calculates the distance between a single remote sensing sample and all initial cluster centers based on the distance calculation strategy, uses the maximum distance as the first distance of the remote sensing sample, and sorts the first distances of all remote sensing samples from large to small. Select the remote sensing sample with the largest first distance as the new initial cluster center and add it to the initial cluster center set;
    S323重复步骤S322,直至初始聚类中心集中的初始聚类中心个数达到k个。S323 Repeat step S322 until the number of initial clustering centers in the initial clustering center set reaches k.
  4. 如权利要求3所述的一种基于地学信息和主动学习的遥感样本标注方法,其特征在于,步骤S33包括:A remote sensing sample labeling method based on geoscience information and active learning as claimed in claim 3, characterized in that step S33 includes:
    S331根据遥感样本的位置信息得到遥感样本的坐标值;S331 obtains the coordinate value of the remote sensing sample based on the location information of the remote sensing sample;
    S332基于距离计算策略计算单个遥感样本分别与k个初始聚类中心 之间的距离,将最小的距离作为该遥感样本的第二距离;S332 calculates the distance between a single remote sensing sample and k initial clustering centers based on the distance calculation strategy. The distance between them, the smallest distance is regarded as the second distance of the remote sensing sample;
    S333将单个初始聚类中心以及与该初始聚类中心的距离为其第二距离的遥感样本形成一个初始的聚类簇,并将该初始聚类中心作为该聚类簇的初始的聚类中心,得到初始的k个聚类簇和初始的k个聚类中心;S333: forming an initial cluster cluster with a single initial cluster center and remote sensing samples whose distance from the initial cluster center is the second distance thereof, and using the initial cluster center as the initial cluster center of the cluster cluster, thereby obtaining initial k cluster clusters and initial k cluster centers;
    S334在当前的单个聚类簇内,对所有的遥感样本的坐标值求平均值,并计算每个遥感样本的坐标值与平均值之间的差值,将差值最小的坐标值对应的遥感样本作为新的聚类中心,得到新的k个聚类中心;S334 averages the coordinate values of all remote sensing samples within the current single cluster, calculates the difference between the coordinate value of each remote sensing sample and the average value, and assigns the remote sensing coordinate value corresponding to the smallest difference The sample is used as a new clustering center, and new k clustering centers are obtained;
    S335将单个新的聚类中心以及与该聚类中心的距离为其第二距离的遥感样本形成一个新的聚类簇,得到新的k个聚类簇;S335 forms a new cluster from a single new cluster center and remote sensing samples whose distance from the cluster center is its second distance, and obtains new k clusters;
    S336根据距离计算策略计算每个遥感样本与对应的新的聚类中心之间的距离,并计算所有距离的平方和,得到新的k个聚类簇的误差平方和;S336 calculates the distance between each remote sensing sample and the corresponding new cluster center according to the distance calculation strategy, and calculates the sum of squares of all distances to obtain the sum of squares of errors of the new k clusters;
    S337迭代执行步骤S334-S336,每次迭代均得到k个聚类簇及其k个聚类中心、该k个聚类簇的误差平方和,根据相邻两次迭代的误差平方和计算变化值,判断变化值是否满足迭代停止条件,若满足,则停止迭代,得到最终的k个聚类簇和k个聚类中心。S337 iteratively executes steps S334-S336. Each iteration obtains k clusters and their k cluster centers, and the sum of squared errors of the k clusters. The change value is calculated based on the sum of squared errors of the two adjacent iterations. , determine whether the change value meets the iteration stop condition, and if so, stop the iteration and obtain the final k clusters and k cluster centers.
  5. 如权利要求2所述的一种基于地学信息和主动学习的遥感样本标注方法,其特征在于,所述距离计算策略为:A remote sensing sample annotation method based on geoscientific information and active learning as claimed in claim 2, characterized in that the distance calculation strategy is:
    选定两个待计算的遥感样本,作为第一样本和第二样本;Select two remote sensing samples to be calculated as the first sample and the second sample;
    根据空间距离方法得到第一样本和第二样本之间的空间距离dsThe spatial distance d s between the first sample and the second sample is obtained according to the spatial distance method;
    根据特征距离方法得到第一样本和第二样本之间的特征距离dEuAccording to the characteristic distance method, the characteristic distance d Eu between the first sample and the second sample is obtained;
    将ds和dEu进行归一化处理,得到归一化处理结果d's和d′Eu,其中d's和d′Eu的范围均为[0,1];Normalize d s and d Eu to obtain the normalized results d' s and d' Eu , where the ranges of d' s and d' Eu are both [0,1];
    计算d's和d′Eu的和,作为第一样本和第二样本的距离。Calculate the sum of d' s and d' Eu as the distance between the first sample and the second sample.
  6. 如权利要求5所述的一种基于地学信息和主动学习的遥感样本标注方法,其特征在于,所述空间距离方法为:A remote sensing sample labeling method based on geoscience information and active learning as claimed in claim 5, characterized in that the spatial distance method is:
    根据遥感样本的位置信息构建Delaunay三角网{Del},{Del}包括多个Delaunay三角形,每个Delaunay三角形均包括三个顶点和相邻边;获取第一样本和第二样本在Delaunay三角网{Del}中的Delaunay三角形Del1和Del2Construct a Delaunay triangulation network {Del} based on the position information of the remote sensing sample. {Del} includes multiple Delaunay triangles. Each Delaunay triangle includes three vertices and adjacent edges; obtain the first sample and the second sample in the Delaunay triangulation network. Delaunay triangles Del 1 and Del 2 in {Del};
    获取Del1在其相邻边上的顶点集合{Node1},获取Del2在其相邻边上的顶点集合{Node2};Get the vertex set {Node1} of Del 1 on its adjacent edges, and get the vertex set {Node2} of Del 2 on its adjacent edges;
    根据{Node1}和{Node2}中每个顶点的坐标得到空间位置最远的两个顶点Node1和Node2According to the coordinates of each vertex in {Node1} and {Node2}, get the two vertices Node 1 and Node 2 with the farthest spatial position;
    根据空间拓扑关系计算Node1和Node2的距离,作为第一样本和第二样本的空间距离dsCalculate the distance between Node 1 and Node 2 according to the spatial topological relationship, as the spatial distance d s between the first sample and the second sample.
  7. 如权利要求6所述的一种基于地学信息和主动学习的遥感样本标注方法,其特征在于,Delaunay三角形的相邻边为该Delaunay三角形与其他Delaunay三角形共享的边,且每个Delaunay三角形的相邻边的数量不尽相同。A remote sensing sample labeling method based on geoscience information and active learning as claimed in claim 6, characterized in that the adjacent sides of the Delaunay triangle are the sides shared by the Delaunay triangle and other Delaunay triangles, and the adjacent sides of each Delaunay triangle are The number of adjacent edges varies.
  8. 如权利要求5所述的一种基于地学信息和主动学习的遥感样本标注方法,其特征在于,所述特征距离方法为:A remote sensing sample annotation method based on geoscientific information and active learning as claimed in claim 5, characterized in that the feature distance method is:
    根据地学信息得到第一样本和第二样本的地学信息向量f1和f2Obtain the geoscience information vectors f 1 and f 2 of the first sample and the second sample according to the geoscience information;
    计算f1和f2的Euclidean距离,作为第一样本和第二样本的特征距离dEu
    Calculate the Euclidean distance between f1 and f2 as the characteristic distance dEu between the first sample and the second sample:
  9. 如权利要求8所述的一种基于地学信息和主动学习的遥感样本标注方法,其特征在于,步骤S7包括:A remote sensing sample labeling method based on geoscience information and active learning as claimed in claim 8, characterized in that step S7 includes:
    S71计算未标注样本集中每个未标注样本的信息熵和概率密度,并计算每个未标注样本的信息熵和概率密度的乘积,结合乘积和差异性约束条件筛选未标注样本,得到关键样本;S71 calculates the information entropy and probability density of each unlabeled sample in the unlabeled sample set, and calculates the product of the information entropy and probability density of each unlabeled sample, and combines the product and difference constraints to screen the unlabeled samples to obtain key samples;
    S72获取与关键样本在相同聚类簇中的已标注样本,作为重要样本;S72 obtains the labeled samples in the same cluster as the key samples as important samples;
    S73计算每个关键样本与其对应的重要样本之间的特征距离,作为第三距离,将第三距离大于距离阈值的关键样本加入到价值样本集中。S73 calculates the characteristic distance between each key sample and its corresponding important sample as the third distance, and adds the key samples whose third distance is greater than the distance threshold to the value sample set.
  10. 如权利要求1所述的一种基于地学信息和主动学习的遥感样本标注方法,其特征在于:A remote sensing sample labeling method based on geoscience information and active learning as claimed in claim 1, characterized by:
    所述高程信息包括DEM信息、地面坡度信息、地形粗糙度信息;The elevation information includes DEM information, ground slope information, and terrain roughness information;
    所述光谱信息包括归一化植被指数、增强植被指数;The spectral information includes normalized vegetation index and enhanced vegetation index;
    所述纹理信息包括灰度共生矩阵信息、灰度运行长度矩阵信息、邻域灰度差矩阵信息;The texture information includes gray level co-occurrence matrix information, gray level run length matrix information, and neighborhood gray level difference matrix information;
    所述形状信息包括矩形度、伸长度、长轴长、最长直径;The shape information includes rectangularity, elongation, major axis length, and longest diameter;
    所述统计计量信息包括最大值、最小值、范围、偏度。 The statistical measurement information includes maximum value, minimum value, range, and skewness.
PCT/CN2023/118178 2022-09-19 2023-09-12 Remote-sensing sample labeling method based on geoscientific information and active learning WO2024061050A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211135861.1A CN115272870A (en) 2022-09-19 2022-09-19 Remote sensing sample labeling method based on geological information and active learning
CN202211135861.1 2022-09-19

Publications (1)

Publication Number Publication Date
WO2024061050A1 true WO2024061050A1 (en) 2024-03-28

Family

ID=83757662

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/118178 WO2024061050A1 (en) 2022-09-19 2023-09-12 Remote-sensing sample labeling method based on geoscientific information and active learning

Country Status (2)

Country Link
CN (1) CN115272870A (en)
WO (1) WO2024061050A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115272870A (en) * 2022-09-19 2022-11-01 北京数慧时空信息技术有限公司 Remote sensing sample labeling method based on geological information and active learning

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108710894A (en) * 2018-04-17 2018-10-26 中国科学院软件研究所 A kind of Active Learning mask method and device based on cluster representative point
CN110210534A (en) * 2019-05-21 2019-09-06 河海大学 High score remote sensing images scene multi-tag classification method based on more packet fusions
WO2020202594A1 (en) * 2019-04-04 2020-10-08 Nec Corporation Learning system, method and program
US20220036128A1 (en) * 2020-08-03 2022-02-03 International Business Machines Corporation Training machine learning models to exclude ambiguous data samples
CN114627390A (en) * 2022-05-12 2022-06-14 北京数慧时空信息技术有限公司 Improved active learning remote sensing sample marking method
CN115272870A (en) * 2022-09-19 2022-11-01 北京数慧时空信息技术有限公司 Remote sensing sample labeling method based on geological information and active learning

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108710894A (en) * 2018-04-17 2018-10-26 中国科学院软件研究所 A kind of Active Learning mask method and device based on cluster representative point
WO2020202594A1 (en) * 2019-04-04 2020-10-08 Nec Corporation Learning system, method and program
CN110210534A (en) * 2019-05-21 2019-09-06 河海大学 High score remote sensing images scene multi-tag classification method based on more packet fusions
US20220036128A1 (en) * 2020-08-03 2022-02-03 International Business Machines Corporation Training machine learning models to exclude ambiguous data samples
CN114627390A (en) * 2022-05-12 2022-06-14 北京数慧时空信息技术有限公司 Improved active learning remote sensing sample marking method
CN115272870A (en) * 2022-09-19 2022-11-01 北京数慧时空信息技术有限公司 Remote sensing sample labeling method based on geological information and active learning

Also Published As

Publication number Publication date
CN115272870A (en) 2022-11-01

Similar Documents

Publication Publication Date Title
Zhang et al. Topological structure and semantic information transfer network for cross-scene hyperspectral image classification
CN110120097B (en) Semantic modeling method for airborne point cloud of large scene
CN111191566B (en) Optical remote sensing image multi-target detection method based on pixel classification
CN112132818B (en) Pulmonary nodule detection and clinical analysis method constructed based on graph convolution neural network
CN111723780B (en) Directional migration method and system of cross-domain data based on high-resolution remote sensing image
CN112712049B (en) Satellite image ship model identification method under small sample condition
WO2024061050A1 (en) Remote-sensing sample labeling method based on geoscientific information and active learning
CN113435253B (en) Multi-source image combined urban area ground surface coverage classification method
Han et al. Parts4Feature: Learning 3D global features from generally semantic parts in multiple views
CN109255781B (en) Object-oriented multispectral high-resolution remote sensing image change detection method
CN111985325B (en) Aerial small target rapid identification method in extra-high voltage environment evaluation
Liu et al. Survey of road extraction methods in remote sensing images based on deep learning
CN102314610B (en) Object-oriented image clustering method based on probabilistic latent semantic analysis (PLSA) model
CN113610905B (en) Deep learning remote sensing image registration method based on sub-image matching and application
CN113705580A (en) Hyperspectral image classification method based on deep migration learning
CN111611960B (en) Large-area ground surface coverage classification method based on multilayer perceptive neural network
CN112115806B (en) Remote sensing image scene accurate classification method based on Dual-ResNet small sample learning
CN114821299A (en) Remote sensing image change detection method
CN110136143A (en) Geneva based on ADMM algorithm multiresolution remote sensing image segmentation method off field
CN115761240B (en) Image semantic segmentation method and device for chaotic back propagation graph neural network
CN116206158A (en) Scene image classification method and system based on double hypergraph neural network
Yang et al. Urban roads network detection from high resolution remote sensing
CN114359660B (en) Multi-modal target detection method and system suitable for modal intensity change
CN115908276A (en) Bridge apparent damage binocular vision intelligent detection method and system integrating deep learning
CN113724325A (en) Multi-scene monocular camera pose regression method based on graph convolution network

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23867339

Country of ref document: EP

Kind code of ref document: A1