CN110704624A

CN110704624A - A multi-level and multi-label classification method for geographic information service metadata text

Info

Publication number: CN110704624A
Application number: CN201910942287.2A
Authority: CN
Inventors: 桂志鹏; 张敏; 彭德华; 吴华意
Original assignee: Wuhan University WHU
Current assignee: Wuhan University WHU
Priority date: 2019-09-30
Filing date: 2019-09-30
Publication date: 2020-01-17
Anticipated expiration: 2039-09-30
Also published as: CN110704624B

Abstract

The invention discloses a multi-level and multi-label classification method for geographic information service metadata text, including: 1) acquiring a geographic information service metadata text set to perform text preprocessing, and dividing each data sample into text feature word combinations; 2) Set a first-level classification catalog, and generate a typical vocabulary related to the classification category semantics; 3) Screen the text feature words according to the typical vocabulary; 4) Select ML-KNN as a base model for collaborative training; 5) Establish The topic prediction model ML-CSW is used as another base model for collaborative training; 6) Design a collaborative mechanism to match multi-label topics for metadata texts as a first-level coarse-grained topic classification result; 7) Select the metadata corresponding to a certain classification label text, get different levels of fine-grained topic category catalogs. The method of the invention considers the domain characteristics and text semantics of the metadata of the geographic information service, only relies on a small number of labeled data samples, and the classification result has better overall performance than the traditional multi-label classification method.

Description

A multi-level and multi-label classification method for geographic information service metadata text

技术领域technical field

本发明涉及自然语言处理技术，尤其涉及一种地理信息服务元数据文本多层级多标签分类方法。The invention relates to natural language processing technology, in particular to a multi-level and multi-label classification method for geographic information service metadata text.

背景技术Background technique

文本精准分类作为数据分析的一种重要手段，是提升地理信息资源检索品质的关键，具有广泛的应用场景。传统分类方法大多适用于二分类或单分类场景，且过度依赖大量标记样本训练分类模型，限制了文本分类的精准性、全面性，及模型的适用场景。特别是针对地理信息服务元数据而言，通常缺乏标注主题的样本数据集，且文本内容杂糅纷繁，地学术语和通识词汇混杂导致特征词表复杂化；并且主题之间的交叠和隶属关系使得元数据文本主题具有多粒度与多类别特性，进一步加大了主题分类的难度。针对缺乏训练样本的问题和多类别匹配的需求，有学者提出了半监督、弱监督等机制减少分类器对训练样本的依赖，也有学者通过ML-KNN、BR-KNN和TSVM等方法实现文本多标签分类。但这些方法通常未结合领域特色，未考虑文本中专业术语的语义，无法有效贴合地理信息服务元数据的文本特性。As an important means of data analysis, accurate text classification is the key to improving the retrieval quality of geographic information resources, and has a wide range of application scenarios. Most of the traditional classification methods are suitable for binary or single classification scenarios, and they rely too much on a large number of labeled samples to train the classification model, which limits the accuracy and comprehensiveness of text classification and the applicable scenarios of the model. Especially for the metadata of geographic information services, there is usually a lack of sample datasets to annotate topics, and the text content is complex. This makes the metadata text topics have multi-granularity and multi-category characteristics, which further increases the difficulty of topic classification. In response to the lack of training samples and the need for multi-category matching, some scholars have proposed semi-supervised and weakly supervised mechanisms to reduce the dependence of classifiers on training samples, and some scholars have used ML-KNN, BR-KNN and TSVM to achieve more text Label classification. However, these methods usually do not combine domain characteristics, do not consider the semantics of technical terms in the text, and cannot effectively fit the textual characteristics of geographic information service metadata.

发明内容SUMMARY OF THE INVENTION

本发明要解决的技术问题在于针对现有技术中的缺陷，提供一种地理信息服务元数据文本多层级多标签分类方法。The technical problem to be solved by the present invention is to provide a multi-level and multi-label classification method for geographic information service metadata text in view of the defects in the prior art.

本发明解决其技术问题所采用的技术方案是：一种地理信息服务元数据文本多层级多标签分类方法，包括以下步骤：The technical scheme adopted by the present invention to solve the technical problem is: a multi-level and multi-label classification method for geographic information service metadata text, comprising the following steps:

1)获取包含未标记样本与标记样本的地理信息服务元数据文本集进行文本预处理，将每条数据样本划分为文本特征词组合；1) Acquire a geographic information service metadata text set containing unmarked samples and marked samples for text preprocessing, and divide each data sample into text feature word combinations;

2)基于地理信息资源的领域应用主题类别定义一级分类目录，生成与分类类别(以下称为“主题”)语义紧密关联的典型词词表；2) Define a first-level taxonomy based on the domain application theme category of geographic information resources, and generate a typical vocabulary closely related to the semantics of the taxonomy category (hereinafter referred to as "theme");

3)根据典型词词表对文本特征词进行筛选，滤除与典型词距离大于阈值的特征，获得根据主题分类筛选的特征子集；3) Screen the text feature words according to the typical word list, filter out the features whose distance from the typical word is greater than the threshold, and obtain the feature subsets screened according to the topic classification;

4)选取经典多标签分类算法ML-KNN(Multi-label K Nearest Neighbors)作为协同训练的一个基模型H₁；4) Select the classic multi-label classification algorithm ML-KNN (Multi-label K Nearest Neighbors) as a base model H ₁ for collaborative training;

5)依据语料库计算特征到主题的语义距离，建立主题预测模型ML-CSW(Multi-label Classification based on SWEET & WordNet)，将该模型作为协同训练的另一基模型H₂；5) Calculate the semantic distance from the feature to the subject according to the corpus, establish a subject prediction model ML-CSW (Multi-label Classification based on SWEET & WordNet), and use this model as another base model H ₂ for collaborative training;

6)基于上述两个基模型设计协同机制，为元数据文本匹配多标签主题，作为一级粗粒度主题分类结果；6) Design a collaborative mechanism based on the above two base models to match multi-label topics for metadata texts as a first-level coarse-grained topic classification result;

7)根据一级粗粒度主题分类结果，选取某一分类标签对应的元数据文本，抽取文本主题，作为下一层级的细粒度主题，同时获得元数据文本与双层主题目录的匹配关系；7) According to the first-level coarse-grained topic classification result, select the metadata text corresponding to a certain classification label, extract the text topic, as the fine-grained topic of the next level, and obtain the matching relationship between the metadata text and the two-level topic catalog;

8)重复步骤7)，得到不同级别的细粒度主题类别目录，以及元数据文本与主题目录间的匹配关系。8) Step 7) is repeated to obtain fine-grained subject category catalogues at different levels, as well as the matching relationship between the metadata text and the subject catalogue.

按上述方案，所述步骤2)中基于地理信息资源的领域应用主题类别定义一级分类目录是基于国际地球观测组织针对地学领域提出的社会受益领域SBAs进行扩展而得到一级分类。According to the above solution, in step 2), the first-level classification catalogue of the domain application subject category definition based on geographic information resources is based on the expansion of the SBAs for the social benefit field proposed by the International Earth Observation Organization for the field of geosciences to obtain the first-level classification.

按上述方案，所述步骤2)中典型词词表生成方式如下：According to the above scheme, the typical vocabulary list generation method in the step 2) is as follows:

以SBAs为主题分类目录，抽取SWEET和WordNet定义中主题的上位词、下位词和同义词作为与主题语义相关的典型词，生成典型词词表。Taking SBAs as the subject classification catalogue, the hypernyms, hyponyms and synonyms of the subjects in the SWEET and WordNet definitions are extracted as typical words related to the semantics of the subjects, and a typical vocabulary list is generated.

按上述方案，所述步骤3)中根据典型词词表对文本特征词进行筛选，具体如下：According to the above scheme, in the step 3), the text feature words are screened according to the typical vocabulary list, and the details are as follows:

S31、基于Word2vec算法将典型词与文本特征词表示为二维空间词向量；S31. Represent typical words and text feature words as two-dimensional space word vectors based on the Word2vec algorithm;

S32、计算典型词与文本特征词向量间的余弦距离；S32. Calculate the cosine distance between the typical word and the text feature word vector;

S33、设定距离阈值T，滤除掉与典型词余弦距离大于T的文本特征词。S33 , setting a distance threshold T, and filtering out text feature words whose cosine distance from typical words is greater than T.

按上述方案，所述步骤5)中主题模型的建立方法具体如下：According to the above scheme, the establishment method of the topic model in the step 5) is as follows:

依据SWEET本体库与WordNet英语词汇网的网络定义，计算文本特征f与每个主题p_i间的语义距离d_pi According to the network definition of SWEET ontology library and WordNet English vocabulary network, calculate the semantic distance d _pi between the text feature f and each topic _pi

求特征f与每个主题p_i间的语义距离d_pi的最小值，并求倒作为文本特征f与所有主题P的最大语义相关度s_f，其中P为所有主题集合；Find the minimum value of the semantic distance d _pi between the feature f and each topic pi, and find the maximum semantic correlation _s _f between the text feature f and all topics P, where P is the set of all topics;

基于文本特征与主题的最短距离定义特征权重，建立主题预测模型，为未标记样本预测多标签主题；Define feature weights based on the shortest distance between text features and topics, establish topic prediction models, and predict multi-label topics for unlabeled samples;

假定训练集中共包含n个文本特征，则可计算得到训练集中所有特征到所有主题的最大语义相关度的向量S＝[s₁,s₂,…,s_n]，将单条数据x的权重w(x)定义为1×n的向量，分别对应n个文本特征的权重，若特征f在样本x中出现，则定义为s_f，否则定义为0；Assuming that there are n text features in the training set, the vector S=[s ₁ ,s ₂ ,...,s _n ] can be calculated to obtain the maximum semantic relevance of all features in the training set to all topics, and the weight w of a single piece of data x can be calculated. (x) is defined as a 1×n vector, corresponding to the weights of n text features, if the feature f appears in the sample x, it is defined as s _f , otherwise it is defined as 0;

建立主题预测模型Y，其中F为特征的调整向量，α为平滑参数。基于标记样本数据，采用BP神经网络迭代优化训练模型Y，计算损失最小情况下F和α的最优解并得到最终的模型，依据模型预测未标记样本t的类别集合；Establish a topic prediction model Y, where F is the adjustment vector of the feature, and α is the smoothing parameter. Based on the labeled sample data, BP neural network is used to iteratively optimize the training model Y, calculate the optimal solution of F and α under the condition of minimum loss, and obtain the final model, and predict the category set of unlabeled sample t according to the model;

Y＝w(x)*F+α。Y=w(x)*F+α.

按上述方案，所述步骤6)设计协同机制，为元数据文本匹配多标签主题，作为一级粗粒度主题分类结果；具体如下：According to the above scheme, the step 6) design a collaborative mechanism to match the multi-label topic for the metadata text, as the first-level coarse-grained topic classification result; the details are as follows:

S61、根据地理信息服务元数据文本集中的标记样本生成L₁和L₂两个子集，分别作为协同训练基模型H₁和H₂的训练集；S61. Generate two subsets L ₁ and L ₂ according to the marked samples in the geographic information service metadata text set, which are respectively used as the training sets of the collaborative training base models H ₁ and H ₂ ;

S62、利用训练集训练基模型H₁和H₂，并利用训练好的基模型预测未标记样本的类别向量；S62, using the training set to train the base models H ₁ and H ₂ , and using the trained base model to predict the category vector of the unlabeled sample;

S63、从未标记样本中选出分类器H₁和H₂具有相同预测结果的样本赋予伪标记，将伪标记样本分别添加至两个训练子集L₁和L₂，更新训练集，重复步骤S62-S63，直至两个分类器的分类结果不出现明显变化，得到所有未标记样本的类别集合以及最后更新的训练集；S63. Select the samples with the same prediction result by the classifiers H ₁ and H ₂ from the unlabeled samples and assign them to pseudo-labels, add the pseudo-labeled samples to the two training subsets L ₁ and L ₂ respectively, update the training set, and repeat the steps S62-S63, until the classification results of the two classifiers do not change significantly, obtain the category set of all unlabeled samples and the last updated training set;

S64、基于所有有标记的样本训练分类器H₁，为测试样本匹配主题类别集合。S64: Train the classifier H ₁ based on all the labeled samples, and match the set of subject categories for the test samples.

按上述方案，所述步骤4)中选取经典多标签分类算法ML-KNN作为协同训练的一个基模型，具体如下：According to the above scheme, in the described step 4), the classical multi-label classification algorithm ML-KNN is selected as a base model for collaborative training, and the details are as follows:

S41、选用ML-KNN算法作为协同训练的基模型H₁，指定近邻样本个数k，以N(x)表示训练集中样本x的k个近邻样本的集合，统计N(x)中属于主题类别l的样本数量c[j]，统计N(x)中不属于主题类别l的样本数量c′[j]。下列公式中，当样本x属于主题类别l时，

为1，

为0，反之则

为0，

为1；S41. Select the ML-KNN algorithm as the base model H ₁ for collaborative training, specify the number k of neighbor samples, and use N(x) to represent the set of k neighbor samples of the sample x in the training set, and count N(x) belonging to the subject category The number of samples c[j] of l, count the number of samples c'[j] in N(x) that do not belong to the topic category l. In the following formula, when the sample x belongs to the subject category l,

is 1,

0, otherwise

is 0,

is 1;

S42、计算未标记样本t属于主题类别l的先验概率

与后验概率其中b的取值为0和1，表示样本t属于主题类别l的事件，

表示样本t不属于主题类别l的事件，s为平滑参数，m为训练样本个数，

表示样本t的k个近邻样本中样本j属于类别l的事件；S42. Calculate the prior probability that the unlabeled sample t belongs to the topic category l

with the posterior probability where b takes the

values

0 and 1, represents the event that the sample t belongs to the topic class l,

Indicates that the sample t does not belong to the event of the topic category l, s is the smoothing parameter, m is the number of training samples,

Represents the event that sample j belongs to category l in the k nearest neighbor samples of sample t;

S43、依据最大化后验概率和贝叶斯原则预测未标记样本t的类别集合

S43. Predict the class set of the unlabeled sample t according to the maximized posterior probability and the Bayesian principle

按上述方案，所述步骤7)中抽取文本主题是基于隐狄利克雷分布(LatentDirichlet Allocation,LDA)算法抽取文本主题。According to the above scheme, the extraction of text topics in step 7) is based on the Latent Dirichlet Allocation (LDA) algorithm to extract text topics.

本发明产生的有益效果是：本发明提出了一种新的针对OGC网络地图服务WMS及其他地理信息网络资源元数据文本的多层级多标签分类流程。该流程将地学本体库SWEET和通用英语词汇网络WordNet引入分类过程，结合传统分类算法ML-KNN和紧密贴合领域特性与文本语义的分类算法ML-CSW进行协同训练，以获得地理信息服务元数据文本与多层级主题目录的匹配关系。本发明方法考虑地理信息服务元数据的领域特色和文本语义，仅依赖少量的标记数据样本；同时，相比于分类器链、投票分类器等传统多标签分类算法，本发明方法的分类结果整体表现更好。The beneficial effects of the present invention are as follows: the present invention proposes a new multi-level and multi-label classification process for OGC network map service WMS and other geographic information network resource metadata texts. This process introduces the geoscience ontology library SWEET and the general English vocabulary network WordNet into the classification process, and combines the traditional classification algorithm ML-KNN and the classification algorithm ML-CSW, which closely fits the domain characteristics and text semantics, for collaborative training to obtain geographic information service metadata. The matching relationship between the text and the multi-level topic directory. The method of the present invention considers the domain characteristics and text semantics of the metadata of the geographic information service, and only relies on a small number of labeled data samples; at the same time, compared with traditional multi-label classification algorithms such as classifier chains and voting classifiers, the classification results of the method of the present invention are as a whole. perform better.

附图说明Description of drawings

下面将结合附图及实施例对本发明作进一步说明，附图中：The present invention will be further described below in conjunction with the accompanying drawings and embodiments, in which:

图1是本发明实施例的方法流程图；Fig. 1 is the method flow chart of the embodiment of the present invention;

图2是本发明实施例的方法流程图；Fig. 2 is the method flow chart of the embodiment of the present invention;

图3是本发明实施例的典型词示例图；Fig. 3 is the typical word example diagram of the embodiment of the present invention;

图4是本发明实施例的ML-CSW算法中文本特征与主题间最短距离计算示例图；4 is an example diagram of calculating the shortest distance between text features and topics in the ML-CSW algorithm according to an embodiment of the present invention;

图5是本发明实施例的示例文本的分类结果；Fig. 5 is the classification result of the example text of the embodiment of the present invention;

图6是本发明实施例的不同分类算法的分类结果对比；Fig. 6 is the classification result comparison of the different classification algorithms of the embodiment of the present invention;

图7是本发明实施例的基于不同特征选择算法的分类结果对比。FIG. 7 is a comparison of classification results based on different feature selection algorithms according to an embodiment of the present invention.

具体实施方式Detailed ways

为了使本发明的目的、技术方案及优点更加清楚明白，以下结合实施例，对本发明进行进一步详细说明。应当理解，此处所描述的具体实施例仅用以解释本发明，并不用于限定本发明。In order to make the objectives, technical solutions and advantages of the present invention clearer, the present invention will be further described in detail below with reference to the embodiments. It should be understood that the specific embodiments described herein are only used to explain the present invention, but not to limit the present invention.

现有46000条网络地图服务(Web Map Service,WMS)文本数据，其中400条标注了SBAs主题，各主题分布均匀。文本内容来自于WMS GetCapability能力文档中Service标签内的URL、Abstract、Keywords和Title字段。由于文本内容杂糅纷繁，篇幅长度各不相同，单条数据对应多个主题类别，且标注主题的样本数据量较少，传统多标签分类算法很难精准、全面地分类，也无法得到多层次的主题匹配结果。There are currently 46,000 Web Map Service (WMS) text data, of which 400 are marked with SBAs topics, and each topic is evenly distributed. The text content comes from the URL, Abstract, Keywords and Title fields in the Service tag in the WMS GetCapability capability document. Due to the variety of text content and different lengths of texts, a single piece of data corresponds to multiple topic categories, and the amount of sample data for labeling topics is small, it is difficult for traditional multi-label classification algorithms to classify accurately and comprehensively, and it is impossible to obtain multi-level topics. match results.

本发明结合半监督学习中协同训练的理论基础，引入地学本体库和通用英语词汇网设计贴合地学领域特性的基础分类模型。在分类过程中结合应用广泛的经典多标签分类模型进行协同训练，并抽取多层次细粒度主题，为WMS元数据文本匹配多层次多标签主题。Combined with the theoretical basis of collaborative training in semi-supervised learning, the invention introduces a geoscience ontology database and a general English vocabulary network to design a basic classification model that fits the characteristics of the geoscience field. In the classification process, the widely used classical multi-label classification model is used for collaborative training, and multi-level fine-grained topics are extracted to match multi-level multi-label topics for WMS metadata text.

下面将结合本发明中的附图，对本发明的算法过程进行详细阐述，具体如下：Below in conjunction with the accompanying drawings in the present invention, the algorithm process of the present invention will be described in detail, as follows:

如图1和图2所示，一种地理信息服务元数据文本多层级多标签分类方法，包括以下步骤：As shown in Figure 1 and Figure 2, a multi-level and multi-label classification method for geographic information service metadata text includes the following steps:

1)对所有WMS元数据进行文本预处理，包括分词、去除停用词和词形还原三个步骤，将每条文本分割为文本特征词组合；1) Perform text preprocessing on all WMS metadata, including three steps of word segmentation, removal of stop words and morphological restoration, and divide each text into text feature word combinations;

2)基于国际地球观测组织(Group on Earth Observations，GEO)针对地学领域提出的社会受益领域(societal benefit areas，SBAs)进行扩展而得到一级分类，SBAs包括9大兴趣主题，包括农业(Agriculture)、生物多样性(Biodiversity)、气候(Climate)、灾害(Disaster)、生态(Ecosystem)、能源(Energy)、健康(Health)、水(Water)和天气(Weather)等，SBAs是基于国际地球观测组织(Group on Earth Observations，GEO)针对地学领域提出的社会受益领域(societal benefit areas，SBAs)，包括9大兴趣主题，包括农业(Agriculture)、生物多样性(Biodiversity)、气候(Climate)、灾害(Disaster)、生态(Ecosystem)、能源(Energy)、健康(Health)、水(Water)和天气(Weather)等。本实施例的主题分类目录是在SBAs的基础上进行扩展，添加了地质(Geology)作为第10个主题，所以本实施例中涉及的所有的主题分类目录、一级主题分类目录都是指这10个主题。2) The first-level classification is based on the expansion of the social benefit areas (SBAs) proposed by the International Earth Observation Organization (Group on Earth Observations, GEO) for the field of geosciences. SBAs include 9 major topics of interest, including agriculture (Agriculture) , Biodiversity, Climate, Disaster, Ecosystem, Energy, Health, Water and Weather, etc. SBAs are based on international earth observation The social benefit areas (SBAs) proposed by the Group on Earth Observations (GEO) for the field of geosciences, including 9 major topics of interest, including Agriculture, Biodiversity, Climate, Disasters (Disaster), Ecosystem, Energy, Health, Water and Weather, etc. The subject classification catalogue of this embodiment is extended on the basis of SBAs, and Geology is added as the tenth subject, so all subject classification catalogues and first-level subject classification catalogues involved in this embodiment refer to this 10 themes.

以SBAs为主题分类目录，抽取SWEET和WordNet定义中主题的上位词、下位词和同义词作为与主题语义相关的典型词，生成典型词词表，图3(a)为从SWEET中抽取的主题“Agriculture(农业)”对应的典型词示例，图3(b)为从WordNet中抽取的主题“Agriculture”对应的典型词示例，不同颜色代表不同的词义集合；Taking SBAs as the subject category catalogue, the hypernyms, hyponyms and synonyms of the subjects in the definitions of SWEET and WordNet are extracted as typical words related to the semantics of the subject, and a typical vocabulary is generated. Figure 3(a) is the subject extracted from SWEET" An example of a typical word corresponding to "Agriculture", Figure 3(b) is an example of a typical word corresponding to the topic "Agriculture" extracted from WordNet, and different colors represent different sets of word meanings;

3)基于Word2vec算法的CBOW模型将典型词与文本特征词表示为二维空间词向量，计算典型词与文本特征词向量间的余弦距离；3) The CBOW model based on the Word2vec algorithm expresses typical words and text feature words as two-dimensional space word vectors, and calculates the cosine distance between the typical words and text feature word vectors;

4)设定距离阈值，基于距离阈值对文本特征词进行筛选，滤除与典型词距离大于阈值的特征，由此获得对主题分类贡献较大的特征子集，作为分类算法的模型输入；4) Set the distance threshold, screen the text feature words based on the distance threshold, and filter out the features whose distance from the typical word is greater than the threshold, thereby obtaining a feature subset that contributes more to the topic classification, as the model input of the classification algorithm;

5)设计贴合WMS领域特性和考虑文本语义的多标签分类算法ML-CSW，作为协同训练基模型H₁，以语料库计算文本特征到主题间的语义关联度作为特征权重，训练主题预测模型：5) Design a multi-label classification algorithm ML-CSW that fits the characteristics of the WMS domain and considers text semantics, as a collaborative training base model H ₁ , and uses the corpus to calculate the semantic correlation between text features and topics as feature weights to train the topic prediction model:

5.1)以SWEET的网络定义为主，WordNet为辅计算文本特征与主题的语义最短距离；5.1) The network definition of SWEET is the main, and WordNet is the auxiliary to calculate the shortest semantic distance between the text feature and the topic;

若文本特征词被SWEET收录，则依据SWEET的网络定义爬取特征词与主题间的最短距离，如图4(a)所示，特征“Glacier(冰川)”到主题“Water(水)”之间的距离为3；If the text feature word is included in SWEET, the shortest distance between the feature word and the topic is crawled according to the network definition of SWEET, as shown in Figure 4(a), the feature "Glacier (glacier)" to the topic "Water (water)" The distance between them is 3;

若文本特征未被SWEET收录，则在WordNet中逐层次向上查找上位词作为文本特征的替代词，直至查找到被SWEET收录的替代词，计算特征到替代词在WordNet定义中的最短距离D₁，如图4(b)所示，特征“Neve(粒雪)”的替代词为“Ice(冰)”，最短距离为1。进而依据SWEET的网络定义，基于Dijkstra算法计算替代词到主题间的最短距离D₂，如图4(b)中替代词“Ice(冰)”到主题“Water(水)”的最短距离为2。文本特征与主题的最终距离为文本特征到替代词的距离与替代词到主题的距离之和，即D＝D₁+D₂，如图4(b)中特征“Neve(粒雪)”到主题“Water(水)”的最短距离为3。If the text feature is not included in SWEET, look up the hypernym in WordNet layer by layer as the substitute word for the text feature, until the substitute word included in SWEET is found, and calculate the shortest distance D ₁ from the feature to the substitute word in the WordNet definition. , as shown in Figure 4(b), the substitute for the feature "Neve (grain snow)" is "Ice (ice)", and the shortest distance is 1. Then, according to the network definition of SWEET, the shortest distance D ₂ between the substitute word and the topic is calculated based on the Dijkstra algorithm. As shown in Figure 4(b), the shortest distance from the substitute word "Ice" to the topic "Water" is 2 . The final distance between the text feature and the topic is the sum of the distance from the text feature to the substitute word and the distance from the substitute word to the topic, that is, D=D ₁ +D ₂ , as shown in Figure 4(b), the feature "Neve (grain snow)" to The shortest distance for the topic "Water" is 3.

5.2)基于文本特征与主题的最短距离定义特征权重，建立主题预测模型，为未标记样本预测多标签主题；5.2) Define feature weights based on the shortest distance between text features and topics, establish a topic prediction model, and predict multi-label topics for unlabeled samples;

a)依据步骤5.1)可计算得到文本特征f与每个主题p_i间的语义距离

对最短距离求导作为文本特征f与所有主题P的最大语义相关度s_f，其中P为所有主题集合；a) According to step 5.1), the semantic distance between the text feature f and each topic p _i can be calculated

The derivation of the shortest distance is taken as the maximum semantic relevance s _f between the text feature f and all topics P, where P is the set of all topics;

b)若所有文本共包含n个文本特征，则可计算得到训练集中所有特征到所有主题的最大语义相关度向量S＝[s₁,s₂,…,s_n]。将单条数据x的权重w(x)定义为1×n的向量，分别对应n个文本特征的权重，若特征f在样本x中出现，则定义为s_f，否则定义为0。b) If all texts contain n text features, the maximum semantic relevance vector S=[s ₁ , s ₂ ,...,s _n ] from all features in the training set to all topics can be calculated. The weight w(x) of a single piece of data x is defined as a 1×n vector, corresponding to the weights of n text features respectively. If the feature f appears in the sample x, it is defined as s _f , otherwise it is defined as 0.

c)建立主题预测模型Y，其中F为特征的调整向量，α为平滑参数。基于标记样本数据，采用BP神经网络迭代优化训练主题预测模型，计算损失最小情况下F和α的最优解，得到最终的模型，依据模型可预测未标记样本t的类别集合；c) Establish a topic prediction model Y, where F is the adjustment vector of the feature, and α is the smoothing parameter. Based on the labeled sample data, the BP neural network is used to iteratively optimize the training topic prediction model, and the optimal solution of F and α under the condition of minimum loss is calculated to obtain the final model. According to the model, the category set of the unlabeled sample t can be predicted;

Y＝w(x)*F+αY=w(x)*F+α

6)选取应用较为广泛的经典多标签分类算法ML-KNN作为协同训练基模型H₂：6) Select the widely used classical multi-label classification algorithm ML-KNN as the collaborative training base model H ₂ :

指定近邻样本个数k，以N(x)表示训练集L₁中样本x的k个近邻样本集合，统计N(x)中属于主题类别l的样本数量c[j]，统计N(x)中不属于主题类别l的样本数量c′[j]。下列公式中，当样本x属于主题类别l时，

为1，

为0，当样本x不属于主题类别l时，

为0，

为1；Specify the number of neighbor samples k, and use N(x) to represent the k neighbor sample set of the sample x in the training set L ₁ , count the number of samples c[j] belonging to the topic category l in N(x), and count N(x) The number of samples c'[j] that do not belong to topic category l in . In the following formula, when the sample x belongs to the subject category l,

is 1,

is 0, when the sample x does not belong to the subject category l,

is 0,

is 1;

计算未标记样本t属于主题类别l的先验概率

与后验概率

其中s为平滑参数，m为训练样本个数，表示事件样本t属于主题类别l，

表示事件样本t不属于主题类别l，

表示事件样本t的k个近邻样本中实例j属于类别l；Calculate the prior probability that an unlabeled sample t belongs to topic class l

with the posterior probability

where s is the smoothing parameter, m is the number of training samples, means that the event sample t belongs to the topic category l,

indicates that the event sample t does not belong to the topic category l,

Indicates that instance j in the k nearest neighbor samples of event sample t belongs to category l;

依据最大化后验概率和贝叶斯原则预测未标记样本t的类别集合

Predict the class set of the unlabeled sample t according to the maximized posterior probability and Bayes' principle

7)将所有标记样本的80％重复随机采样分为L₁和L₂两个子集，分别作为分类器H₁和H₂的训练集，利用两个分类器预测所有未标记样本的类别集合；7) 80% repeated random sampling of all labeled samples is divided into two subsets, L ₁ and L ₂ , which are used as the training sets of classifiers H ₁ and H ₂ respectively, and the two classifiers are used to predict the category set of all unlabeled samples;

8)选出分类器H₁和H₂的相同预测结果的样本赋予伪标记，将伪标记样本分别添加至两个训练子集L₁和L₂，更新训练集，重复7)直至两个分类器的分类结果不出现明显变化，得到未标记样本的类别集合。8) Select the samples with the same prediction results of the classifiers H ₁ and H ₂ to give pseudo-labels, add the pseudo-labeled samples to the two training subsets L ₁ and L ₂ respectively, update the training set, and repeat 7) until two classifications There is no obvious change in the classification results of the classifier, and the category set of unlabeled samples is obtained.

9)以所有标记样本的10％作为测试样本，利用训练后的分类器为测试样本匹配主题类别集合，如图5中示例文本的SBAs类别标签包含Biodiversity，Climate，Disaster，Ecosystem，Water和Weather。9) Using 10% of all labeled samples as test samples, use the trained classifier to match the set of subject categories for the test samples, as shown in Figure 5, the SBA category labels of the sample text include Biodiversity, Climate, Disaster, Ecosystem, Water and Weather.

10)指定主题层次数N，对于每个层次，选择单一主题类别的元数据文本，基于隐狄利克雷分布(Latent Dirichlet Allocation,LDA)算法抽取文本细粒度主题，直至生成N层主题目录，为WMS元数据文本匹配N层主题，如图5中Biodiversity对应的二级主题为wildlife，specie和diversity，Climate对应的二级主题为forest和meteorology，Disaster对应的二级主题为pollution，Ecosystem对应的二级主题为habitat，resource和conserve，Water对应的二级主题为rain，Weather对应的二级主题为meteorology。10) Specify the number of topic layers N. For each layer, select the metadata text of a single topic category, and extract the fine-grained topics of the text based on the Latent Dirichlet Allocation (LDA) algorithm until an N-layer topic directory is generated, which is: The WMS metadata text matches N-layer themes. As shown in Figure 5, the secondary themes corresponding to Biodiversity are wildlife, specie and diversity, the secondary themes corresponding to Climate are forest and meteorology, the secondary themes corresponding to Disaster are pollution, and the secondary themes corresponding to Ecosystem The first-level themes are habitat, resource, and conserve, the second-level theme corresponding to Water is rain, and the second-level theme corresponding to Weather is meteorology.

本发明考虑地理信息服务元数据的领域特色和文本语义，仅依赖少量的标记数据样本；如图6所示，相比于分类器链、投票分类器等传统多标签分类算法，本发明方法的分类结果整体表现更好。The present invention considers the domain characteristics and textual semantics of geographic information service metadata, and only relies on a small number of labeled data samples; as shown in FIG. 6, compared with traditional multi-label classification algorithms such as classifier chains and voting classifiers, the method of the invention has The classification results perform better overall.

如图7所示，本发明的文本特征选择流程相较于卡方检验、基于WordNet的特征选择方法更能滤除掉对分类结果无贡献的特征。本发明方法可推广应用于地理信息门户和数据目录服务，辅助各类地理信息资源的检索与发现。As shown in FIG. 7 , compared with the chi-square test and the WordNet-based feature selection method, the text feature selection process of the present invention can filter out features that do not contribute to the classification result. The method of the invention can be applied to geographic information portals and data catalog services, and assists retrieval and discovery of various geographic information resources.

应当理解的是，对本领域普通技术人员来说，可以根据上述说明加以改进或变换，而所有这些改进和变换都应属于本发明所附权利要求的保护范围。It should be understood that, for those skilled in the art, improvements or changes can be made according to the above description, and all these improvements and changes should fall within the protection scope of the appended claims of the present invention.

Claims

1. a multi-level multi-label classification method for geographic information service metadata text, is characterized in that, comprises the following steps:

1) Acquire a geographic information service metadata text set containing unmarked samples and marked samples for text preprocessing, and divide each data sample into text feature word combinations;

2) Set a first-level classification directory based on the domain application theme category of geographic information resources, obtain the classification category, that is, the theme, and then generate a typical vocabulary related to the classification category semantics;

3) Screen the text feature words according to the typical word list, filter out the features whose distance from the typical word is greater than the threshold, and obtain the feature subsets screened according to the topic classification;

4) Select the classic multi-label classification algorithm ML-KNN as a base model for collaborative training, denoted as H ₁ ;

5) Calculate the semantic distance from the feature to the topic according to the corpus, establish a topic prediction model ML-CSW, and use this model as another base model for collaborative training, denoted as H ₂ ;

6) Design a collaborative mechanism based on the above two base models to match multi-label topics for metadata texts as a first-level coarse-grained topic classification result;

7) Select the metadata text corresponding to a certain classification label, extract the text subject as the fine-grained subject of the next level, and obtain the matching relationship between the metadata text and the double-layer subject directory;

8) Step 7) is repeated to obtain fine-grained subject category catalogues at different levels, as well as the matching relationship between the metadata text and the subject catalogue.

2. the multi-level multi-label classification method of geographic information service metadata text according to claim 1, is characterized in that, in the described step 2), based on the domain application subject category definition first-level classification of geographic information resources is based on international earth The observation organization expands the SBAs proposed in the field of geosciences to the social benefit field and obtains a first-level classification.

3. The multi-level multi-label classification method of geographic information service metadata text according to claim 1, is characterized in that, in described step 2), typical vocabulary list generation mode is as follows:

Taking SBAs as the subject classification catalogue, the hypernyms, hyponyms and synonyms of the subjects in the SWEET and WordNet definitions are extracted as typical words related to the semantics of the subjects, and a typical vocabulary list is generated.

4. The multi-level and multi-label classification method of geographic information service metadata text according to claim 1, is characterized in that, in described step 3), according to typical word list, text feature word is screened, specifically as follows:

S31. Represent typical words and text feature words as two-dimensional space word vectors based on the Word2vec algorithm;

S32. Calculate the cosine distance between the typical word and the text feature word vector;

S33 , setting a distance threshold T, and filtering out text feature words whose cosine distance from typical words is greater than T.

5. the multi-level multi-label classification method of geographic information service metadata text according to claim 1, is characterized in that, the establishment method of topic model in described step 5) is specifically as follows:

S51. According to the network definition of the SWEET ontology library and the WordNet English vocabulary network, calculate the semantic distance between the text feature f and each topic p _i

If the feature f is included in SWEET, the semantic distance between the feature f and each topic p _i is obtained directly based on the Dijsktra algorithm according to the SWEET network. If the feature f is not included in SWEET, then look up the hypernym that is included in SWEET as the substitute word for the feature f, and compare the distance between the feature f and the substitute word in WordNet and the distance between the substitute word in SWEET and each topic p _i sum, as the semantic distance between feature f and each topic _pi

S52. Calculate the semantic distance between the feature f and each topic p _i The minimum value of , and find the maximum semantic relevance s _f between the text feature f and all topics P, where P is the set of all topics;

S53. Define feature weights based on the shortest distance between text features and topics, establish a topic prediction model, and predict multi-label topics for unlabeled samples;

S54. Assuming that the training set contains _n text features in total, the vector S=[s ₁ , s ₂ , . The weight w(x) is defined as a 1×n vector, corresponding to the weights of n text features respectively. If the feature f appears in the sample x, it is defined as s _f , otherwise it is defined as 0;

S55. Establish a topic prediction model Y, where F is the adjustment vector of the feature, and α is a smoothing parameter; based on the labeled sample data, the BP neural network is used to iteratively optimize the training model Y, and the optimal solutions of F and α under the condition of minimum loss are calculated and obtained. The final model predicts the category set of unlabeled samples t according to the model;

Y=w(x)*F+α.

6. The multi-level and multi-label classification method of geographic information service metadata text according to claim 1, is characterized in that, described step 6) design collaboration mechanism, for metadata text matching multi-label theme, as one-level coarse-grained theme Classification results; details are as follows:

S61. Generate two subsets L ₁ and L ₂ according to the marked samples in the geographic information service metadata text set, which are respectively used as the training sets of the collaborative training base models H ₁ and H ₂ ;

S62, using the training set to train the base models H ₁ and H ₂ , and using the trained base model to predict the category vector of the unlabeled sample;

S63. Select the samples with the same prediction result by the classifiers H ₁ and H ₂ from the unlabeled samples and assign them to pseudo-labels, add the pseudo-labeled samples to the two training subsets L ₁ and L ₂ respectively, update the training set, and repeat the steps S62-S63, until there is no obvious change in the classification results of the two classifiers, obtain the category set of all unlabeled samples;

S64: Train the classifier H ₁ based on all the labeled samples, and match the set of subject categories for the test samples.

7. The multi-level and multi-label classification method of geographic information service metadata text according to claim 1, is characterized in that, in described step 4), select classic multi-label classification algorithm ML-KNN as a base model of collaborative training, specifically as follows:

S41. Specify the number k of neighboring samples, use N(x) to represent the set of k neighboring samples of the sample x in the training set, count the number of samples c[j] belonging to the topic category l in N(x), and count N(x) The number of samples c'[j] that do not belong to the subject category l in the following formula; when the sample x belongs to the subject category l,

is 1,

0, otherwise

is 0,

is 1;

S42. Calculate the prior probability that the unlabeled sample t belongs to the topic category l

with the posterior probability

where b takes the values 0 and 1,

represents the event that the sample t belongs to the topic class l,

Indicates that the sample t does not belong to the event of the topic category l, s is the smoothing parameter, m is the number of training samples, Represents the event that sample j belongs to category l in the k nearest neighbor samples of sample t;

8 . The multi-level and multi-label classification method for geographic information service metadata text according to claim 1 , wherein the extraction of text topics in the step 7) is based on the hidden Dirichlet distribution algorithm to extract text topics. 9 .