CN110704624A - A multi-level and multi-label classification method for geographic information service metadata text - Google Patents

A multi-level and multi-label classification method for geographic information service metadata text Download PDF

Info

Publication number
CN110704624A
CN110704624A CN201910942287.2A CN201910942287A CN110704624A CN 110704624 A CN110704624 A CN 110704624A CN 201910942287 A CN201910942287 A CN 201910942287A CN 110704624 A CN110704624 A CN 110704624A
Authority
CN
China
Prior art keywords
text
topic
classification
feature
samples
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910942287.2A
Other languages
Chinese (zh)
Other versions
CN110704624B (en
Inventor
桂志鹏
张敏
彭德华
吴华意
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan University WHU
Original Assignee
Wuhan University WHU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan University WHU filed Critical Wuhan University WHU
Priority to CN201910942287.2A priority Critical patent/CN110704624B/en
Publication of CN110704624A publication Critical patent/CN110704624A/en
Application granted granted Critical
Publication of CN110704624B publication Critical patent/CN110704624B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/34Browsing; Visualisation therefor
    • G06F16/345Summarisation for human users
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/38Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/387Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using geographical or spatial information, e.g. location
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • G06F18/24155Bayesian classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Library & Information Science (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明公开了一种地理信息服务元数据文本多层级多标签分类方法,包括:1)获取地理信息服务元数据文本集进行文本预处理,将每条数据样本划分为文本特征词组合;2)设定一级分类目录,生成与分类类别语义关联的典型词词表;3)根据典型词词表对文本特征词进行筛选;4)选取ML‑KNN作为协同训练的一个基模型;5)建立主题预测模型ML‑CSW作为协同训练的另一基模型;6)设计协同机制,为元数据文本匹配多标签主题,作为一级粗粒度主题分类结果;7)选取某一分类标签对应的元数据文本,得到不同级别的细粒度主题类别目录。本发明方法考虑地理信息服务元数据的领域特色和文本语义,仅依赖少量的标记数据样本且分类结果相比传统多标签分类方法整体表现更好。

Figure 201910942287

The invention discloses a multi-level and multi-label classification method for geographic information service metadata text, including: 1) acquiring a geographic information service metadata text set to perform text preprocessing, and dividing each data sample into text feature word combinations; 2) Set a first-level classification catalog, and generate a typical vocabulary related to the classification category semantics; 3) Screen the text feature words according to the typical vocabulary; 4) Select ML-KNN as a base model for collaborative training; 5) Establish The topic prediction model ML-CSW is used as another base model for collaborative training; 6) Design a collaborative mechanism to match multi-label topics for metadata texts as a first-level coarse-grained topic classification result; 7) Select the metadata corresponding to a certain classification label text, get different levels of fine-grained topic category catalogs. The method of the invention considers the domain characteristics and text semantics of the metadata of the geographic information service, only relies on a small number of labeled data samples, and the classification result has better overall performance than the traditional multi-label classification method.

Figure 201910942287

Description

一种地理信息服务元数据文本多层级多标签分类方法A multi-level and multi-label classification method for geographic information service metadata text

技术领域technical field

本发明涉及自然语言处理技术,尤其涉及一种地理信息服务元数据文本多层级多标签分类方法。The invention relates to natural language processing technology, in particular to a multi-level and multi-label classification method for geographic information service metadata text.

背景技术Background technique

文本精准分类作为数据分析的一种重要手段,是提升地理信息资源检索品质的关键,具有广泛的应用场景。传统分类方法大多适用于二分类或单分类场景,且过度依赖大量标记样本训练分类模型,限制了文本分类的精准性、全面性,及模型的适用场景。特别是针对地理信息服务元数据而言,通常缺乏标注主题的样本数据集,且文本内容杂糅纷繁,地学术语和通识词汇混杂导致特征词表复杂化;并且主题之间的交叠和隶属关系使得元数据文本主题具有多粒度与多类别特性,进一步加大了主题分类的难度。针对缺乏训练样本的问题和多类别匹配的需求,有学者提出了半监督、弱监督等机制减少分类器对训练样本的依赖,也有学者通过ML-KNN、BR-KNN和TSVM等方法实现文本多标签分类。但这些方法通常未结合领域特色,未考虑文本中专业术语的语义,无法有效贴合地理信息服务元数据的文本特性。As an important means of data analysis, accurate text classification is the key to improving the retrieval quality of geographic information resources, and has a wide range of application scenarios. Most of the traditional classification methods are suitable for binary or single classification scenarios, and they rely too much on a large number of labeled samples to train the classification model, which limits the accuracy and comprehensiveness of text classification and the applicable scenarios of the model. Especially for the metadata of geographic information services, there is usually a lack of sample datasets to annotate topics, and the text content is complex. This makes the metadata text topics have multi-granularity and multi-category characteristics, which further increases the difficulty of topic classification. In response to the lack of training samples and the need for multi-category matching, some scholars have proposed semi-supervised and weakly supervised mechanisms to reduce the dependence of classifiers on training samples, and some scholars have used ML-KNN, BR-KNN and TSVM to achieve more text Label classification. However, these methods usually do not combine domain characteristics, do not consider the semantics of technical terms in the text, and cannot effectively fit the textual characteristics of geographic information service metadata.

发明内容SUMMARY OF THE INVENTION

本发明要解决的技术问题在于针对现有技术中的缺陷,提供一种地理信息服务元数据文本多层级多标签分类方法。The technical problem to be solved by the present invention is to provide a multi-level and multi-label classification method for geographic information service metadata text in view of the defects in the prior art.

本发明解决其技术问题所采用的技术方案是:一种地理信息服务元数据文本多层级多标签分类方法,包括以下步骤:The technical scheme adopted by the present invention to solve the technical problem is: a multi-level and multi-label classification method for geographic information service metadata text, comprising the following steps:

1)获取包含未标记样本与标记样本的地理信息服务元数据文本集进行文本预处理,将每条数据样本划分为文本特征词组合;1) Acquire a geographic information service metadata text set containing unmarked samples and marked samples for text preprocessing, and divide each data sample into text feature word combinations;

2)基于地理信息资源的领域应用主题类别定义一级分类目录,生成与分类类别(以下称为“主题”)语义紧密关联的典型词词表;2) Define a first-level taxonomy based on the domain application theme category of geographic information resources, and generate a typical vocabulary closely related to the semantics of the taxonomy category (hereinafter referred to as "theme");

3)根据典型词词表对文本特征词进行筛选,滤除与典型词距离大于阈值的特征,获得根据主题分类筛选的特征子集;3) Screen the text feature words according to the typical word list, filter out the features whose distance from the typical word is greater than the threshold, and obtain the feature subsets screened according to the topic classification;

4)选取经典多标签分类算法ML-KNN(Multi-label K Nearest Neighbors)作为协同训练的一个基模型H14) Select the classic multi-label classification algorithm ML-KNN (Multi-label K Nearest Neighbors) as a base model H 1 for collaborative training;

5)依据语料库计算特征到主题的语义距离,建立主题预测模型ML-CSW(Multi-label Classification based on SWEET & WordNet),将该模型作为协同训练的另一基模型H25) Calculate the semantic distance from the feature to the subject according to the corpus, establish a subject prediction model ML-CSW (Multi-label Classification based on SWEET & WordNet), and use this model as another base model H 2 for collaborative training;

6)基于上述两个基模型设计协同机制,为元数据文本匹配多标签主题,作为一级粗粒度主题分类结果;6) Design a collaborative mechanism based on the above two base models to match multi-label topics for metadata texts as a first-level coarse-grained topic classification result;

7)根据一级粗粒度主题分类结果,选取某一分类标签对应的元数据文本,抽取文本主题,作为下一层级的细粒度主题,同时获得元数据文本与双层主题目录的匹配关系;7) According to the first-level coarse-grained topic classification result, select the metadata text corresponding to a certain classification label, extract the text topic, as the fine-grained topic of the next level, and obtain the matching relationship between the metadata text and the two-level topic catalog;

8)重复步骤7),得到不同级别的细粒度主题类别目录,以及元数据文本与主题目录间的匹配关系。8) Step 7) is repeated to obtain fine-grained subject category catalogues at different levels, as well as the matching relationship between the metadata text and the subject catalogue.

按上述方案,所述步骤2)中基于地理信息资源的领域应用主题类别定义一级分类目录是基于国际地球观测组织针对地学领域提出的社会受益领域SBAs进行扩展而得到一级分类。According to the above solution, in step 2), the first-level classification catalogue of the domain application subject category definition based on geographic information resources is based on the expansion of the SBAs for the social benefit field proposed by the International Earth Observation Organization for the field of geosciences to obtain the first-level classification.

按上述方案,所述步骤2)中典型词词表生成方式如下:According to the above scheme, the typical vocabulary list generation method in the step 2) is as follows:

以SBAs为主题分类目录,抽取SWEET和WordNet定义中主题的上位词、下位词和同义词作为与主题语义相关的典型词,生成典型词词表。Taking SBAs as the subject classification catalogue, the hypernyms, hyponyms and synonyms of the subjects in the SWEET and WordNet definitions are extracted as typical words related to the semantics of the subjects, and a typical vocabulary list is generated.

按上述方案,所述步骤3)中根据典型词词表对文本特征词进行筛选,具体如下:According to the above scheme, in the step 3), the text feature words are screened according to the typical vocabulary list, and the details are as follows:

S31、基于Word2vec算法将典型词与文本特征词表示为二维空间词向量;S31. Represent typical words and text feature words as two-dimensional space word vectors based on the Word2vec algorithm;

S32、计算典型词与文本特征词向量间的余弦距离;S32. Calculate the cosine distance between the typical word and the text feature word vector;

S33、设定距离阈值T,滤除掉与典型词余弦距离大于T的文本特征词。S33 , setting a distance threshold T, and filtering out text feature words whose cosine distance from typical words is greater than T.

按上述方案,所述步骤5)中主题模型的建立方法具体如下:According to the above scheme, the establishment method of the topic model in the step 5) is as follows:

依据SWEET本体库与WordNet英语词汇网的网络定义,计算文本特征f与每个主题pi间的语义距离dpi According to the network definition of SWEET ontology library and WordNet English vocabulary network, calculate the semantic distance d pi between the text feature f and each topic pi

求特征f与每个主题pi间的语义距离dpi的最小值,并求倒作为文本特征f与所有主题P的最大语义相关度sf,其中P为所有主题集合;Find the minimum value of the semantic distance d pi between the feature f and each topic pi, and find the maximum semantic correlation s f between the text feature f and all topics P, where P is the set of all topics;

基于文本特征与主题的最短距离定义特征权重,建立主题预测模型,为未标记样本预测多标签主题;Define feature weights based on the shortest distance between text features and topics, establish topic prediction models, and predict multi-label topics for unlabeled samples;

假定训练集中共包含n个文本特征,则可计算得到训练集中所有特征到所有主题的最大语义相关度的向量S=[s1,s2,…,sn],将单条数据x的权重w(x)定义为1×n的向量,分别对应n个文本特征的权重,若特征f在样本x中出现,则定义为sf,否则定义为0;Assuming that there are n text features in the training set, the vector S=[s 1 ,s 2 ,...,s n ] can be calculated to obtain the maximum semantic relevance of all features in the training set to all topics, and the weight w of a single piece of data x can be calculated. (x) is defined as a 1×n vector, corresponding to the weights of n text features, if the feature f appears in the sample x, it is defined as s f , otherwise it is defined as 0;

建立主题预测模型Y,其中F为特征的调整向量,α为平滑参数。基于标记样本数据,采用BP神经网络迭代优化训练模型Y,计算损失最小情况下F和α的最优解并得到最终的模型,依据模型预测未标记样本t的类别集合;Establish a topic prediction model Y, where F is the adjustment vector of the feature, and α is the smoothing parameter. Based on the labeled sample data, BP neural network is used to iteratively optimize the training model Y, calculate the optimal solution of F and α under the condition of minimum loss, and obtain the final model, and predict the category set of unlabeled sample t according to the model;

Y=w(x)*F+α。Y=w(x)*F+α.

按上述方案,所述步骤6)设计协同机制,为元数据文本匹配多标签主题,作为一级粗粒度主题分类结果;具体如下:According to the above scheme, the step 6) design a collaborative mechanism to match the multi-label topic for the metadata text, as the first-level coarse-grained topic classification result; the details are as follows:

S61、根据地理信息服务元数据文本集中的标记样本生成L1和L2两个子集,分别作为协同训练基模型H1和H2的训练集;S61. Generate two subsets L 1 and L 2 according to the marked samples in the geographic information service metadata text set, which are respectively used as the training sets of the collaborative training base models H 1 and H 2 ;

S62、利用训练集训练基模型H1和H2,并利用训练好的基模型预测未标记样本的类别向量;S62, using the training set to train the base models H 1 and H 2 , and using the trained base model to predict the category vector of the unlabeled sample;

S63、从未标记样本中选出分类器H1和H2具有相同预测结果的样本赋予伪标记,将伪标记样本分别添加至两个训练子集L1和L2,更新训练集,重复步骤S62-S63,直至两个分类器的分类结果不出现明显变化,得到所有未标记样本的类别集合以及最后更新的训练集;S63. Select the samples with the same prediction result by the classifiers H 1 and H 2 from the unlabeled samples and assign them to pseudo-labels, add the pseudo-labeled samples to the two training subsets L 1 and L 2 respectively, update the training set, and repeat the steps S62-S63, until the classification results of the two classifiers do not change significantly, obtain the category set of all unlabeled samples and the last updated training set;

S64、基于所有有标记的样本训练分类器H1,为测试样本匹配主题类别集合。S64: Train the classifier H 1 based on all the labeled samples, and match the set of subject categories for the test samples.

按上述方案,所述步骤4)中选取经典多标签分类算法ML-KNN作为协同训练的一个基模型,具体如下:According to the above scheme, in the described step 4), the classical multi-label classification algorithm ML-KNN is selected as a base model for collaborative training, and the details are as follows:

S41、选用ML-KNN算法作为协同训练的基模型H1,指定近邻样本个数k,以N(x)表示训练集中样本x的k个近邻样本的集合,统计N(x)中属于主题类别l的样本数量c[j],统计N(x)中不属于主题类别l的样本数量c′[j]。下列公式中,当样本x属于主题类别l时,

Figure BDA0002223242790000051
为1,
Figure BDA0002223242790000052
为0,反之则
Figure BDA0002223242790000053
为0,
Figure BDA0002223242790000054
为1;S41. Select the ML-KNN algorithm as the base model H 1 for collaborative training, specify the number k of neighbor samples, and use N(x) to represent the set of k neighbor samples of the sample x in the training set, and count N(x) belonging to the subject category The number of samples c[j] of l, count the number of samples c'[j] in N(x) that do not belong to the topic category l. In the following formula, when the sample x belongs to the subject category l,
Figure BDA0002223242790000051
is 1,
Figure BDA0002223242790000052
0, otherwise
Figure BDA0002223242790000053
is 0,
Figure BDA0002223242790000054
is 1;

Figure BDA0002223242790000055
Figure BDA0002223242790000055

S42、计算未标记样本t属于主题类别l的先验概率

Figure BDA0002223242790000056
与后验概率其中b的取值为0和1,表示样本t属于主题类别l的事件,
Figure BDA0002223242790000059
表示样本t不属于主题类别l的事件,s为平滑参数,m为训练样本个数,
Figure BDA0002223242790000061
表示样本t的k个近邻样本中样本j属于类别l的事件;S42. Calculate the prior probability that the unlabeled sample t belongs to the topic category l
Figure BDA0002223242790000056
with the posterior probability where b takes the values 0 and 1, represents the event that the sample t belongs to the topic class l,
Figure BDA0002223242790000059
Indicates that the sample t does not belong to the event of the topic category l, s is the smoothing parameter, m is the number of training samples,
Figure BDA0002223242790000061
Represents the event that sample j belongs to category l in the k nearest neighbor samples of sample t;

Figure BDA0002223242790000063
Figure BDA0002223242790000063

Figure BDA0002223242790000064
Figure BDA0002223242790000064

S43、依据最大化后验概率和贝叶斯原则预测未标记样本t的类别集合

Figure BDA0002223242790000065
S43. Predict the class set of the unlabeled sample t according to the maximized posterior probability and the Bayesian principle
Figure BDA0002223242790000065

Figure BDA0002223242790000066
Figure BDA0002223242790000066

按上述方案,所述步骤7)中抽取文本主题是基于隐狄利克雷分布(LatentDirichlet Allocation,LDA)算法抽取文本主题。According to the above scheme, the extraction of text topics in step 7) is based on the Latent Dirichlet Allocation (LDA) algorithm to extract text topics.

本发明产生的有益效果是:本发明提出了一种新的针对OGC网络地图服务WMS及其他地理信息网络资源元数据文本的多层级多标签分类流程。该流程将地学本体库SWEET和通用英语词汇网络WordNet引入分类过程,结合传统分类算法ML-KNN和紧密贴合领域特性与文本语义的分类算法ML-CSW进行协同训练,以获得地理信息服务元数据文本与多层级主题目录的匹配关系。本发明方法考虑地理信息服务元数据的领域特色和文本语义,仅依赖少量的标记数据样本;同时,相比于分类器链、投票分类器等传统多标签分类算法,本发明方法的分类结果整体表现更好。The beneficial effects of the present invention are as follows: the present invention proposes a new multi-level and multi-label classification process for OGC network map service WMS and other geographic information network resource metadata texts. This process introduces the geoscience ontology library SWEET and the general English vocabulary network WordNet into the classification process, and combines the traditional classification algorithm ML-KNN and the classification algorithm ML-CSW, which closely fits the domain characteristics and text semantics, for collaborative training to obtain geographic information service metadata. The matching relationship between the text and the multi-level topic directory. The method of the present invention considers the domain characteristics and text semantics of the metadata of the geographic information service, and only relies on a small number of labeled data samples; at the same time, compared with traditional multi-label classification algorithms such as classifier chains and voting classifiers, the classification results of the method of the present invention are as a whole. perform better.

附图说明Description of drawings

下面将结合附图及实施例对本发明作进一步说明,附图中:The present invention will be further described below in conjunction with the accompanying drawings and embodiments, in which:

图1是本发明实施例的方法流程图;Fig. 1 is the method flow chart of the embodiment of the present invention;

图2是本发明实施例的方法流程图;Fig. 2 is the method flow chart of the embodiment of the present invention;

图3是本发明实施例的典型词示例图;Fig. 3 is the typical word example diagram of the embodiment of the present invention;

图4是本发明实施例的ML-CSW算法中文本特征与主题间最短距离计算示例图;4 is an example diagram of calculating the shortest distance between text features and topics in the ML-CSW algorithm according to an embodiment of the present invention;

图5是本发明实施例的示例文本的分类结果;Fig. 5 is the classification result of the example text of the embodiment of the present invention;

图6是本发明实施例的不同分类算法的分类结果对比;Fig. 6 is the classification result comparison of the different classification algorithms of the embodiment of the present invention;

图7是本发明实施例的基于不同特征选择算法的分类结果对比。FIG. 7 is a comparison of classification results based on different feature selection algorithms according to an embodiment of the present invention.

具体实施方式Detailed ways

为了使本发明的目的、技术方案及优点更加清楚明白,以下结合实施例,对本发明进行进一步详细说明。应当理解,此处所描述的具体实施例仅用以解释本发明,并不用于限定本发明。In order to make the objectives, technical solutions and advantages of the present invention clearer, the present invention will be further described in detail below with reference to the embodiments. It should be understood that the specific embodiments described herein are only used to explain the present invention, but not to limit the present invention.

现有46000条网络地图服务(Web Map Service,WMS)文本数据,其中400条标注了SBAs主题,各主题分布均匀。文本内容来自于WMS GetCapability能力文档中Service标签内的URL、Abstract、Keywords和Title字段。由于文本内容杂糅纷繁,篇幅长度各不相同,单条数据对应多个主题类别,且标注主题的样本数据量较少,传统多标签分类算法很难精准、全面地分类,也无法得到多层次的主题匹配结果。There are currently 46,000 Web Map Service (WMS) text data, of which 400 are marked with SBAs topics, and each topic is evenly distributed. The text content comes from the URL, Abstract, Keywords and Title fields in the Service tag in the WMS GetCapability capability document. Due to the variety of text content and different lengths of texts, a single piece of data corresponds to multiple topic categories, and the amount of sample data for labeling topics is small, it is difficult for traditional multi-label classification algorithms to classify accurately and comprehensively, and it is impossible to obtain multi-level topics. match results.

本发明结合半监督学习中协同训练的理论基础,引入地学本体库和通用英语词汇网设计贴合地学领域特性的基础分类模型。在分类过程中结合应用广泛的经典多标签分类模型进行协同训练,并抽取多层次细粒度主题,为WMS元数据文本匹配多层次多标签主题。Combined with the theoretical basis of collaborative training in semi-supervised learning, the invention introduces a geoscience ontology database and a general English vocabulary network to design a basic classification model that fits the characteristics of the geoscience field. In the classification process, the widely used classical multi-label classification model is used for collaborative training, and multi-level fine-grained topics are extracted to match multi-level multi-label topics for WMS metadata text.

下面将结合本发明中的附图,对本发明的算法过程进行详细阐述,具体如下:Below in conjunction with the accompanying drawings in the present invention, the algorithm process of the present invention will be described in detail, as follows:

如图1和图2所示,一种地理信息服务元数据文本多层级多标签分类方法,包括以下步骤:As shown in Figure 1 and Figure 2, a multi-level and multi-label classification method for geographic information service metadata text includes the following steps:

1)对所有WMS元数据进行文本预处理,包括分词、去除停用词和词形还原三个步骤,将每条文本分割为文本特征词组合;1) Perform text preprocessing on all WMS metadata, including three steps of word segmentation, removal of stop words and morphological restoration, and divide each text into text feature word combinations;

2)基于国际地球观测组织(Group on Earth Observations,GEO)针对地学领域提出的社会受益领域(societal benefit areas,SBAs)进行扩展而得到一级分类,SBAs包括9大兴趣主题,包括农业(Agriculture)、生物多样性(Biodiversity)、气候(Climate)、灾害(Disaster)、生态(Ecosystem)、能源(Energy)、健康(Health)、水(Water)和天气(Weather)等,SBAs是基于国际地球观测组织(Group on Earth Observations,GEO)针对地学领域提出的社会受益领域(societal benefit areas,SBAs),包括9大兴趣主题,包括农业(Agriculture)、生物多样性(Biodiversity)、气候(Climate)、灾害(Disaster)、生态(Ecosystem)、能源(Energy)、健康(Health)、水(Water)和天气(Weather)等。本实施例的主题分类目录是在SBAs的基础上进行扩展,添加了地质(Geology)作为第10个主题,所以本实施例中涉及的所有的主题分类目录、一级主题分类目录都是指这10个主题。2) The first-level classification is based on the expansion of the social benefit areas (SBAs) proposed by the International Earth Observation Organization (Group on Earth Observations, GEO) for the field of geosciences. SBAs include 9 major topics of interest, including agriculture (Agriculture) , Biodiversity, Climate, Disaster, Ecosystem, Energy, Health, Water and Weather, etc. SBAs are based on international earth observation The social benefit areas (SBAs) proposed by the Group on Earth Observations (GEO) for the field of geosciences, including 9 major topics of interest, including Agriculture, Biodiversity, Climate, Disasters (Disaster), Ecosystem, Energy, Health, Water and Weather, etc. The subject classification catalogue of this embodiment is extended on the basis of SBAs, and Geology is added as the tenth subject, so all subject classification catalogues and first-level subject classification catalogues involved in this embodiment refer to this 10 themes.

以SBAs为主题分类目录,抽取SWEET和WordNet定义中主题的上位词、下位词和同义词作为与主题语义相关的典型词,生成典型词词表,图3(a)为从SWEET中抽取的主题“Agriculture(农业)”对应的典型词示例,图3(b)为从WordNet中抽取的主题“Agriculture”对应的典型词示例,不同颜色代表不同的词义集合;Taking SBAs as the subject category catalogue, the hypernyms, hyponyms and synonyms of the subjects in the definitions of SWEET and WordNet are extracted as typical words related to the semantics of the subject, and a typical vocabulary is generated. Figure 3(a) is the subject extracted from SWEET" An example of a typical word corresponding to "Agriculture", Figure 3(b) is an example of a typical word corresponding to the topic "Agriculture" extracted from WordNet, and different colors represent different sets of word meanings;

3)基于Word2vec算法的CBOW模型将典型词与文本特征词表示为二维空间词向量,计算典型词与文本特征词向量间的余弦距离;3) The CBOW model based on the Word2vec algorithm expresses typical words and text feature words as two-dimensional space word vectors, and calculates the cosine distance between the typical words and text feature word vectors;

4)设定距离阈值,基于距离阈值对文本特征词进行筛选,滤除与典型词距离大于阈值的特征,由此获得对主题分类贡献较大的特征子集,作为分类算法的模型输入;4) Set the distance threshold, screen the text feature words based on the distance threshold, and filter out the features whose distance from the typical word is greater than the threshold, thereby obtaining a feature subset that contributes more to the topic classification, as the model input of the classification algorithm;

5)设计贴合WMS领域特性和考虑文本语义的多标签分类算法ML-CSW,作为协同训练基模型H1,以语料库计算文本特征到主题间的语义关联度作为特征权重,训练主题预测模型:5) Design a multi-label classification algorithm ML-CSW that fits the characteristics of the WMS domain and considers text semantics, as a collaborative training base model H 1 , and uses the corpus to calculate the semantic correlation between text features and topics as feature weights to train the topic prediction model:

5.1)以SWEET的网络定义为主,WordNet为辅计算文本特征与主题的语义最短距离;5.1) The network definition of SWEET is the main, and WordNet is the auxiliary to calculate the shortest semantic distance between the text feature and the topic;

若文本特征词被SWEET收录,则依据SWEET的网络定义爬取特征词与主题间的最短距离,如图4(a)所示,特征“Glacier(冰川)”到主题“Water(水)”之间的距离为3;If the text feature word is included in SWEET, the shortest distance between the feature word and the topic is crawled according to the network definition of SWEET, as shown in Figure 4(a), the feature "Glacier (glacier)" to the topic "Water (water)" The distance between them is 3;

若文本特征未被SWEET收录,则在WordNet中逐层次向上查找上位词作为文本特征的替代词,直至查找到被SWEET收录的替代词,计算特征到替代词在WordNet定义中的最短距离D1,如图4(b)所示,特征“Neve(粒雪)”的替代词为“Ice(冰)”,最短距离为1。进而依据SWEET的网络定义,基于Dijkstra算法计算替代词到主题间的最短距离D2,如图4(b)中替代词“Ice(冰)”到主题“Water(水)”的最短距离为2。文本特征与主题的最终距离为文本特征到替代词的距离与替代词到主题的距离之和,即D=D1+D2,如图4(b)中特征“Neve(粒雪)”到主题“Water(水)”的最短距离为3。If the text feature is not included in SWEET, look up the hypernym in WordNet layer by layer as the substitute word for the text feature, until the substitute word included in SWEET is found, and calculate the shortest distance D 1 from the feature to the substitute word in the WordNet definition. , as shown in Figure 4(b), the substitute for the feature "Neve (grain snow)" is "Ice (ice)", and the shortest distance is 1. Then, according to the network definition of SWEET, the shortest distance D 2 between the substitute word and the topic is calculated based on the Dijkstra algorithm. As shown in Figure 4(b), the shortest distance from the substitute word "Ice" to the topic "Water" is 2 . The final distance between the text feature and the topic is the sum of the distance from the text feature to the substitute word and the distance from the substitute word to the topic, that is, D=D 1 +D 2 , as shown in Figure 4(b), the feature "Neve (grain snow)" to The shortest distance for the topic "Water" is 3.

5.2)基于文本特征与主题的最短距离定义特征权重,建立主题预测模型,为未标记样本预测多标签主题;5.2) Define feature weights based on the shortest distance between text features and topics, establish a topic prediction model, and predict multi-label topics for unlabeled samples;

a)依据步骤5.1)可计算得到文本特征f与每个主题pi间的语义距离

Figure BDA0002223242790000111
对最短距离求导作为文本特征f与所有主题P的最大语义相关度sf,其中P为所有主题集合;a) According to step 5.1), the semantic distance between the text feature f and each topic p i can be calculated
Figure BDA0002223242790000111
The derivation of the shortest distance is taken as the maximum semantic relevance s f between the text feature f and all topics P, where P is the set of all topics;

Figure BDA0002223242790000112
Figure BDA0002223242790000112

b)若所有文本共包含n个文本特征,则可计算得到训练集中所有特征到所有主题的最大语义相关度向量S=[s1,s2,…,sn]。将单条数据x的权重w(x)定义为1×n的向量,分别对应n个文本特征的权重,若特征f在样本x中出现,则定义为sf,否则定义为0。b) If all texts contain n text features, the maximum semantic relevance vector S=[s 1 , s 2 ,...,s n ] from all features in the training set to all topics can be calculated. The weight w(x) of a single piece of data x is defined as a 1×n vector, corresponding to the weights of n text features respectively. If the feature f appears in the sample x, it is defined as s f , otherwise it is defined as 0.

c)建立主题预测模型Y,其中F为特征的调整向量,α为平滑参数。基于标记样本数据,采用BP神经网络迭代优化训练主题预测模型,计算损失最小情况下F和α的最优解,得到最终的模型,依据模型可预测未标记样本t的类别集合;c) Establish a topic prediction model Y, where F is the adjustment vector of the feature, and α is the smoothing parameter. Based on the labeled sample data, the BP neural network is used to iteratively optimize the training topic prediction model, and the optimal solution of F and α under the condition of minimum loss is calculated to obtain the final model. According to the model, the category set of the unlabeled sample t can be predicted;

Y=w(x)*F+αY=w(x)*F+α

6)选取应用较为广泛的经典多标签分类算法ML-KNN作为协同训练基模型H26) Select the widely used classical multi-label classification algorithm ML-KNN as the collaborative training base model H 2 :

指定近邻样本个数k,以N(x)表示训练集L1中样本x的k个近邻样本集合,统计N(x)中属于主题类别l的样本数量c[j],统计N(x)中不属于主题类别l的样本数量c′[j]。下列公式中,当样本x属于主题类别l时,

Figure BDA0002223242790000121
为1,
Figure BDA0002223242790000122
为0,当样本x不属于主题类别l时,
Figure BDA0002223242790000123
为0,
Figure BDA0002223242790000124
为1;Specify the number of neighbor samples k, and use N(x) to represent the k neighbor sample set of the sample x in the training set L 1 , count the number of samples c[j] belonging to the topic category l in N(x), and count N(x) The number of samples c'[j] that do not belong to topic category l in . In the following formula, when the sample x belongs to the subject category l,
Figure BDA0002223242790000121
is 1,
Figure BDA0002223242790000122
is 0, when the sample x does not belong to the subject category l,
Figure BDA0002223242790000123
is 0,
Figure BDA0002223242790000124
is 1;

Figure BDA0002223242790000125
Figure BDA0002223242790000125

计算未标记样本t属于主题类别l的先验概率

Figure BDA0002223242790000126
与后验概率
Figure BDA0002223242790000127
其中s为平滑参数,m为训练样本个数,表示事件样本t属于主题类别l,
Figure BDA0002223242790000129
表示事件样本t不属于主题类别l,
Figure BDA00022232427900001210
表示事件样本t的k个近邻样本中实例j属于类别l;Calculate the prior probability that an unlabeled sample t belongs to topic class l
Figure BDA0002223242790000126
with the posterior probability
Figure BDA0002223242790000127
where s is the smoothing parameter, m is the number of training samples, means that the event sample t belongs to the topic category l,
Figure BDA0002223242790000129
indicates that the event sample t does not belong to the topic category l,
Figure BDA00022232427900001210
Indicates that instance j in the k nearest neighbor samples of event sample t belongs to category l;

Figure BDA00022232427900001211
Figure BDA00022232427900001211

Figure BDA00022232427900001212
Figure BDA00022232427900001212

Figure BDA00022232427900001213
Figure BDA00022232427900001213

依据最大化后验概率和贝叶斯原则预测未标记样本t的类别集合

Figure BDA00022232427900001214
Predict the class set of the unlabeled sample t according to the maximized posterior probability and Bayes' principle
Figure BDA00022232427900001214

Figure BDA00022232427900001215
Figure BDA00022232427900001215

7)将所有标记样本的80%重复随机采样分为L1和L2两个子集,分别作为分类器H1和H2的训练集,利用两个分类器预测所有未标记样本的类别集合;7) 80% repeated random sampling of all labeled samples is divided into two subsets, L 1 and L 2 , which are used as the training sets of classifiers H 1 and H 2 respectively, and the two classifiers are used to predict the category set of all unlabeled samples;

8)选出分类器H1和H2的相同预测结果的样本赋予伪标记,将伪标记样本分别添加至两个训练子集L1和L2,更新训练集,重复7)直至两个分类器的分类结果不出现明显变化,得到未标记样本的类别集合。8) Select the samples with the same prediction results of the classifiers H 1 and H 2 to give pseudo-labels, add the pseudo-labeled samples to the two training subsets L 1 and L 2 respectively, update the training set, and repeat 7) until two classifications There is no obvious change in the classification results of the classifier, and the category set of unlabeled samples is obtained.

9)以所有标记样本的10%作为测试样本,利用训练后的分类器为测试样本匹配主题类别集合,如图5中示例文本的SBAs类别标签包含Biodiversity,Climate,Disaster,Ecosystem,Water和Weather。9) Using 10% of all labeled samples as test samples, use the trained classifier to match the set of subject categories for the test samples, as shown in Figure 5, the SBA category labels of the sample text include Biodiversity, Climate, Disaster, Ecosystem, Water and Weather.

10)指定主题层次数N,对于每个层次,选择单一主题类别的元数据文本,基于隐狄利克雷分布(Latent Dirichlet Allocation,LDA)算法抽取文本细粒度主题,直至生成N层主题目录,为WMS元数据文本匹配N层主题,如图5中Biodiversity对应的二级主题为wildlife,specie和diversity,Climate对应的二级主题为forest和meteorology,Disaster对应的二级主题为pollution,Ecosystem对应的二级主题为habitat,resource和conserve,Water对应的二级主题为rain,Weather对应的二级主题为meteorology。10) Specify the number of topic layers N. For each layer, select the metadata text of a single topic category, and extract the fine-grained topics of the text based on the Latent Dirichlet Allocation (LDA) algorithm until an N-layer topic directory is generated, which is: The WMS metadata text matches N-layer themes. As shown in Figure 5, the secondary themes corresponding to Biodiversity are wildlife, specie and diversity, the secondary themes corresponding to Climate are forest and meteorology, the secondary themes corresponding to Disaster are pollution, and the secondary themes corresponding to Ecosystem The first-level themes are habitat, resource, and conserve, the second-level theme corresponding to Water is rain, and the second-level theme corresponding to Weather is meteorology.

本发明考虑地理信息服务元数据的领域特色和文本语义,仅依赖少量的标记数据样本;如图6所示,相比于分类器链、投票分类器等传统多标签分类算法,本发明方法的分类结果整体表现更好。The present invention considers the domain characteristics and textual semantics of geographic information service metadata, and only relies on a small number of labeled data samples; as shown in FIG. 6, compared with traditional multi-label classification algorithms such as classifier chains and voting classifiers, the method of the invention has The classification results perform better overall.

如图7所示,本发明的文本特征选择流程相较于卡方检验、基于WordNet的特征选择方法更能滤除掉对分类结果无贡献的特征。本发明方法可推广应用于地理信息门户和数据目录服务,辅助各类地理信息资源的检索与发现。As shown in FIG. 7 , compared with the chi-square test and the WordNet-based feature selection method, the text feature selection process of the present invention can filter out features that do not contribute to the classification result. The method of the invention can be applied to geographic information portals and data catalog services, and assists retrieval and discovery of various geographic information resources.

应当理解的是,对本领域普通技术人员来说,可以根据上述说明加以改进或变换,而所有这些改进和变换都应属于本发明所附权利要求的保护范围。It should be understood that, for those skilled in the art, improvements or changes can be made according to the above description, and all these improvements and changes should fall within the protection scope of the appended claims of the present invention.

Claims (8)

1.一种地理信息服务元数据文本多层级多标签分类方法,其特征在于,包括以下步骤:1. a multi-level multi-label classification method for geographic information service metadata text, is characterized in that, comprises the following steps: 1)获取包含未标记样本与标记样本的地理信息服务元数据文本集进行文本预处理,将每条数据样本划分为文本特征词组合;1) Acquire a geographic information service metadata text set containing unmarked samples and marked samples for text preprocessing, and divide each data sample into text feature word combinations; 2)基于地理信息资源的领域应用主题类别设定一级分类目录,获得分类类别,即主题,然后生成与分类类别语义关联的典型词词表;2) Set a first-level classification directory based on the domain application theme category of geographic information resources, obtain the classification category, that is, the theme, and then generate a typical vocabulary related to the classification category semantics; 3)根据典型词词表对文本特征词进行筛选,滤除与典型词距离大于阈值的特征,获得根据主题分类筛选的特征子集;3) Screen the text feature words according to the typical word list, filter out the features whose distance from the typical word is greater than the threshold, and obtain the feature subsets screened according to the topic classification; 4)选取经典多标签分类算法ML-KNN作为协同训练的一个基模型,记为H14) Select the classic multi-label classification algorithm ML-KNN as a base model for collaborative training, denoted as H 1 ; 5)依据语料库计算特征到主题的语义距离,建立主题预测模型ML-CSW,将该模型作为协同训练的另一基模型,记为H25) Calculate the semantic distance from the feature to the topic according to the corpus, establish a topic prediction model ML-CSW, and use this model as another base model for collaborative training, denoted as H 2 ; 6)基于上述两个基模型设计协同机制,为元数据文本匹配多标签主题,作为一级粗粒度主题分类结果;6) Design a collaborative mechanism based on the above two base models to match multi-label topics for metadata texts as a first-level coarse-grained topic classification result; 7)选取某一分类标签对应的元数据文本,提取文本主题作为下一层级的细粒度主题,同时获得元数据文本与双层主题目录的匹配关系;7) Select the metadata text corresponding to a certain classification label, extract the text subject as the fine-grained subject of the next level, and obtain the matching relationship between the metadata text and the double-layer subject directory; 8)重复步骤7),得到不同级别的细粒度主题类别目录,以及元数据文本与主题目录间的匹配关系。8) Step 7) is repeated to obtain fine-grained subject category catalogues at different levels, as well as the matching relationship between the metadata text and the subject catalogue. 2.根据权利要求1所述的地理信息服务元数据文本多层级多标签分类方法,其特征在于,所述步骤2)中基于地理信息资源的领域应用主题类别定义一级分类目录是基于国际地球观测组织针对地学领域提出的社会受益领域SBAs进行扩展而得到一级分类。2. the multi-level multi-label classification method of geographic information service metadata text according to claim 1, is characterized in that, in the described step 2), based on the domain application subject category definition first-level classification of geographic information resources is based on international earth The observation organization expands the SBAs proposed in the field of geosciences to the social benefit field and obtains a first-level classification. 3.根据权利要求1所述的地理信息服务元数据文本多层级多标签分类方法,其特征在于,所述步骤2)中典型词词表生成方式如下:3. The multi-level multi-label classification method of geographic information service metadata text according to claim 1, is characterized in that, in described step 2), typical vocabulary list generation mode is as follows: 以SBAs为主题分类目录,抽取SWEET和WordNet定义中主题的上位词、下位词和同义词作为与主题语义相关的典型词,生成典型词词表。Taking SBAs as the subject classification catalogue, the hypernyms, hyponyms and synonyms of the subjects in the SWEET and WordNet definitions are extracted as typical words related to the semantics of the subjects, and a typical vocabulary list is generated. 4.根据权利要求1所述的地理信息服务元数据文本多层级多标签分类方法,其特征在于,所述步骤3)中根据典型词词表对文本特征词进行筛选,具体如下:4. The multi-level and multi-label classification method of geographic information service metadata text according to claim 1, is characterized in that, in described step 3), according to typical word list, text feature word is screened, specifically as follows: S31、基于Word2vec算法将典型词与文本特征词表示为二维空间词向量;S31. Represent typical words and text feature words as two-dimensional space word vectors based on the Word2vec algorithm; S32、计算典型词与文本特征词向量间的余弦距离;S32. Calculate the cosine distance between the typical word and the text feature word vector; S33、设定距离阈值T,滤除掉与典型词余弦距离大于T的文本特征词。S33 , setting a distance threshold T, and filtering out text feature words whose cosine distance from typical words is greater than T. 5.根据权利要求1所述的地理信息服务元数据文本多层级多标签分类方法,其特征在于,所述步骤5)中主题模型的建立方法具体如下:5. the multi-level multi-label classification method of geographic information service metadata text according to claim 1, is characterized in that, the establishment method of topic model in described step 5) is specifically as follows: S51、依据SWEET本体库与WordNet英语词汇网的网络定义,计算文本特征f与每个主题pi间的语义距离
Figure FDA0002223242780000021
若特征f被SWEET收录,则依据SWEET网络直接基于Dijsktra算法得到特征f与每个主题pi间的语义距离若特征f未被SWEET收录,则逐层级向上查找被SWEET收录的上位词作为特征f的替代词,对WordNet中特征f与替代词的距离和SWEET中替代词与每个主题pi的距离求和,作为特征f与每个主题pi间的语义距离
S51. According to the network definition of the SWEET ontology library and the WordNet English vocabulary network, calculate the semantic distance between the text feature f and each topic p i
Figure FDA0002223242780000021
If the feature f is included in SWEET, the semantic distance between the feature f and each topic p i is obtained directly based on the Dijsktra algorithm according to the SWEET network. If the feature f is not included in SWEET, then look up the hypernym that is included in SWEET as the substitute word for the feature f, and compare the distance between the feature f and the substitute word in WordNet and the distance between the substitute word in SWEET and each topic p i sum, as the semantic distance between feature f and each topic pi
S52、计算特征f与每个主题pi间的语义距离的最小值,并求倒作为文本特征f与所有主题P的最大语义相关度sf,其中,P为所有主题集合;S52. Calculate the semantic distance between the feature f and each topic p i The minimum value of , and find the maximum semantic relevance s f between the text feature f and all topics P, where P is the set of all topics;
Figure FDA0002223242780000034
Figure FDA0002223242780000034
S53、基于文本特征与主题的最短距离定义特征权重,建立主题预测模型,为未标记样本预测多标签主题;S53. Define feature weights based on the shortest distance between text features and topics, establish a topic prediction model, and predict multi-label topics for unlabeled samples; S54、假定训练集中共包含n个文本特征,则可计算得到训练集中所有特征到所有主题的最大语义相关度的向量S=[s1,s2,…,sn],将单条数据x的权重w(x)定义为1×n的向量,分别对应n个文本特征的权重,若特征f在样本x中出现,则定义为sf,否则定义为0;S54. Assuming that the training set contains n text features in total, the vector S=[s 1 , s 2 , . The weight w(x) is defined as a 1×n vector, corresponding to the weights of n text features respectively. If the feature f appears in the sample x, it is defined as s f , otherwise it is defined as 0; S55、建立主题预测模型Y,其中F为特征的调整向量,α为平滑参数;基于标记样本数据,采用BP神经网络迭代优化训练模型Y,计算损失最小情况下F和α的最优解并得到最终的模型,依据模型预测未标记样本t的类别集合;S55. Establish a topic prediction model Y, where F is the adjustment vector of the feature, and α is a smoothing parameter; based on the labeled sample data, the BP neural network is used to iteratively optimize the training model Y, and the optimal solutions of F and α under the condition of minimum loss are calculated and obtained. The final model predicts the category set of unlabeled samples t according to the model; Y=w(x)*F+α。Y=w(x)*F+α.
6.根据权利要求1所述的地理信息服务元数据文本多层级多标签分类方法,其特征在于,所述步骤6)设计协同机制,为元数据文本匹配多标签主题,作为一级粗粒度主题分类结果;具体如下:6. The multi-level and multi-label classification method of geographic information service metadata text according to claim 1, is characterized in that, described step 6) design collaboration mechanism, for metadata text matching multi-label theme, as one-level coarse-grained theme Classification results; details are as follows: S61、根据地理信息服务元数据文本集中的标记样本生成L1和L2两个子集,分别作为协同训练基模型H1和H2的训练集;S61. Generate two subsets L 1 and L 2 according to the marked samples in the geographic information service metadata text set, which are respectively used as the training sets of the collaborative training base models H 1 and H 2 ; S62、利用训练集训练基模型H1和H2,并利用训练好的基模型预测未标记样本的类别向量;S62, using the training set to train the base models H 1 and H 2 , and using the trained base model to predict the category vector of the unlabeled sample; S63、从未标记样本中选出分类器H1和H2具有相同预测结果的样本赋予伪标记,将伪标记样本分别添加至两个训练子集L1和L2,更新训练集,重复步骤S62-S63,直至两个分类器的分类结果不出现明显变化,得到所有未标记样本的类别集合;S63. Select the samples with the same prediction result by the classifiers H 1 and H 2 from the unlabeled samples and assign them to pseudo-labels, add the pseudo-labeled samples to the two training subsets L 1 and L 2 respectively, update the training set, and repeat the steps S62-S63, until there is no obvious change in the classification results of the two classifiers, obtain the category set of all unlabeled samples; S64、基于所有有标记的样本训练分类器H1,为测试样本匹配主题类别集合。S64: Train the classifier H 1 based on all the labeled samples, and match the set of subject categories for the test samples. 7.根据权利要求1所述的地理信息服务元数据文本多层级多标签分类方法,其特征在于,所述步骤4)中选取经典多标签分类算法ML-KNN作为协同训练的一个基模型,具体如下:7. The multi-level and multi-label classification method of geographic information service metadata text according to claim 1, is characterized in that, in described step 4), select classic multi-label classification algorithm ML-KNN as a base model of collaborative training, specifically as follows: S41、指定近邻样本个数k,以N(x)表示训练集中样本x的k个近邻样本的集合,统计N(x)中属于主题类别l的样本数量c[j],统计N(x)中不属于主题类别l的样本数量c′[j];下列公式中,当样本x属于主题类别l时,
Figure FDA0002223242780000051
为1,
Figure FDA0002223242780000052
为0,反之则
Figure FDA0002223242780000053
为0,
Figure FDA0002223242780000054
为1;
S41. Specify the number k of neighboring samples, use N(x) to represent the set of k neighboring samples of the sample x in the training set, count the number of samples c[j] belonging to the topic category l in N(x), and count N(x) The number of samples c'[j] that do not belong to the subject category l in the following formula; when the sample x belongs to the subject category l,
Figure FDA0002223242780000051
is 1,
Figure FDA0002223242780000052
0, otherwise
Figure FDA0002223242780000053
is 0,
Figure FDA0002223242780000054
is 1;
S42、计算未标记样本t属于主题类别l的先验概率
Figure FDA0002223242780000056
与后验概率
Figure FDA0002223242780000057
其中b的取值为0和1,
Figure FDA0002223242780000058
表示样本t属于主题类别l的事件,
Figure FDA0002223242780000059
表示样本t不属于主题类别l的事件,s为平滑参数,m为训练样本个数,表示样本t的k个近邻样本中样本j属于类别l的事件;
S42. Calculate the prior probability that the unlabeled sample t belongs to the topic category l
Figure FDA0002223242780000056
with the posterior probability
Figure FDA0002223242780000057
where b takes the values 0 and 1,
Figure FDA0002223242780000058
represents the event that the sample t belongs to the topic class l,
Figure FDA0002223242780000059
Indicates that the sample t does not belong to the event of the topic category l, s is the smoothing parameter, m is the number of training samples, Represents the event that sample j belongs to category l in the k nearest neighbor samples of sample t;
Figure FDA00022232427800000511
Figure FDA00022232427800000511
Figure FDA00022232427800000513
Figure FDA00022232427800000513
S43、依据最大化后验概率和贝叶斯原则预测未标记样本t的类别集合
Figure FDA00022232427800000514
S43. Predict the class set of the unlabeled sample t according to the maximized posterior probability and the Bayesian principle
Figure FDA00022232427800000514
Figure FDA00022232427800000515
Figure FDA00022232427800000515
8.根据权利要求1所述的地理信息服务元数据文本多层级多标签分类方法,其特征在于,所述步骤7)中抽取文本主题是基于隐狄利克雷分布算法抽取文本主题。8 . The multi-level and multi-label classification method for geographic information service metadata text according to claim 1 , wherein the extraction of text topics in the step 7) is based on the hidden Dirichlet distribution algorithm to extract text topics. 9 .
CN201910942287.2A 2019-09-30 2019-09-30 Geographic information service metadata text multi-level multi-label classification method Active CN110704624B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910942287.2A CN110704624B (en) 2019-09-30 2019-09-30 Geographic information service metadata text multi-level multi-label classification method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910942287.2A CN110704624B (en) 2019-09-30 2019-09-30 Geographic information service metadata text multi-level multi-label classification method

Publications (2)

Publication Number Publication Date
CN110704624A true CN110704624A (en) 2020-01-17
CN110704624B CN110704624B (en) 2021-08-10

Family

ID=69197772

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910942287.2A Active CN110704624B (en) 2019-09-30 2019-09-30 Geographic information service metadata text multi-level multi-label classification method

Country Status (1)

Country Link
CN (1) CN110704624B (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111460097A (en) * 2020-03-26 2020-07-28 华泰证券股份有限公司 Small sample text classification method based on TPN
CN111611801A (en) * 2020-06-02 2020-09-01 腾讯科技(深圳)有限公司 Method, device, server and storage medium for identifying text region attribute
CN112256938A (en) * 2020-12-23 2021-01-22 畅捷通信息技术股份有限公司 Message metadata processing method, device and medium
CN112465075A (en) * 2020-12-31 2021-03-09 杭银消费金融股份有限公司 Metadata management method and system
CN112464010A (en) * 2020-12-17 2021-03-09 中国矿业大学(北京) Automatic image labeling method based on Bayesian network and classifier chain
CN113792081A (en) * 2021-08-31 2021-12-14 吉林银行股份有限公司 Method and system for automatically checking data assets
CN114358208A (en) * 2022-01-13 2022-04-15 辽宁工程技术大学 Science and collaboration activity text title recognition method based on deep learning
CN116343104A (en) * 2023-02-03 2023-06-27 中国矿业大学 Method and system for map scene recognition based on coupling of visual features and vector semantic space
CN115408525B (en) * 2022-09-29 2023-07-04 中电科新型智慧城市研究院有限公司 Method, device, equipment and medium for classifying petition texts based on multi-level tags
CN116541752A (en) * 2023-07-06 2023-08-04 杭州美创科技股份有限公司 Metadata management method, device, computer equipment and storage medium
CN118114060A (en) * 2024-02-01 2024-05-31 郑州大学 Disaster metadata automatic matching method and system based on word2vec model

Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101283353A (en) * 2005-08-03 2008-10-08 温克科技公司 Systems for and methods of finding relevant documents by analyzing tags
US7958068B2 (en) * 2007-12-12 2011-06-07 International Business Machines Corporation Method and apparatus for model-shared subspace boosting for multi-label classification
US7975039B2 (en) * 2003-12-01 2011-07-05 International Business Machines Corporation Method and apparatus to support application and network awareness of collaborative applications using multi-attribute clustering
CN102129470A (en) * 2011-03-28 2011-07-20 中国科学技术大学 Tag clustering method and system
US8340405B2 (en) * 2009-01-13 2012-12-25 Fuji Xerox Co., Ltd. Systems and methods for scalable media categorization
CN104408153A (en) * 2014-12-03 2015-03-11 中国科学院自动化研究所 Short text hash learning method based on multi-granularity topic models
CN104850650A (en) * 2015-05-29 2015-08-19 清华大学 Short-text expanding method based on similar-label relation
CN104951554A (en) * 2015-06-29 2015-09-30 浙江大学 Method for matching landscape with verses according with artistic conception of landscape
CN104991974A (en) * 2015-07-31 2015-10-21 中国地质大学(武汉) Particle swarm algorithm-based multi-label classification method
CN105354593A (en) * 2015-10-22 2016-02-24 南京大学 NMF (Non-negative Matrix Factorization)-based three-dimensional model classification method
CN105868415A (en) * 2016-05-06 2016-08-17 黑龙江工程学院 Microblog real-time filtering model based on historical microblogs
CN105868905A (en) * 2016-03-28 2016-08-17 国网天津市电力公司 Managing and control system based on sensitive content perception
US20180089540A1 (en) * 2016-09-23 2018-03-29 International Business Machines Corporation Image classification utilizing semantic relationships in a classification hierarchy

Patent Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7975039B2 (en) * 2003-12-01 2011-07-05 International Business Machines Corporation Method and apparatus to support application and network awareness of collaborative applications using multi-attribute clustering
CN101283353A (en) * 2005-08-03 2008-10-08 温克科技公司 Systems for and methods of finding relevant documents by analyzing tags
US7958068B2 (en) * 2007-12-12 2011-06-07 International Business Machines Corporation Method and apparatus for model-shared subspace boosting for multi-label classification
US8340405B2 (en) * 2009-01-13 2012-12-25 Fuji Xerox Co., Ltd. Systems and methods for scalable media categorization
CN102129470A (en) * 2011-03-28 2011-07-20 中国科学技术大学 Tag clustering method and system
CN104408153A (en) * 2014-12-03 2015-03-11 中国科学院自动化研究所 Short text hash learning method based on multi-granularity topic models
CN104850650A (en) * 2015-05-29 2015-08-19 清华大学 Short-text expanding method based on similar-label relation
CN104951554A (en) * 2015-06-29 2015-09-30 浙江大学 Method for matching landscape with verses according with artistic conception of landscape
CN104991974A (en) * 2015-07-31 2015-10-21 中国地质大学(武汉) Particle swarm algorithm-based multi-label classification method
CN105354593A (en) * 2015-10-22 2016-02-24 南京大学 NMF (Non-negative Matrix Factorization)-based three-dimensional model classification method
CN105868905A (en) * 2016-03-28 2016-08-17 国网天津市电力公司 Managing and control system based on sensitive content perception
CN105868415A (en) * 2016-05-06 2016-08-17 黑龙江工程学院 Microblog real-time filtering model based on historical microblogs
US20180089540A1 (en) * 2016-09-23 2018-03-29 International Business Machines Corporation Image classification utilizing semantic relationships in a classification hierarchy

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
DJAVAN DE CLERCQA ET.AL: ""Multi-label classification and interactive NLP-based visualization of electric"", 《HTTPS://DOI.ORG/10.1016/J.WPI.2019.101903》 *
刘培奇: ""基于 LDA 主题模型的标签传递算法"", 《计算机应用》 *

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111460097B (en) * 2020-03-26 2024-06-07 华泰证券股份有限公司 TPN-based small sample text classification method
CN111460097A (en) * 2020-03-26 2020-07-28 华泰证券股份有限公司 Small sample text classification method based on TPN
CN111611801A (en) * 2020-06-02 2020-09-01 腾讯科技(深圳)有限公司 Method, device, server and storage medium for identifying text region attribute
CN112464010A (en) * 2020-12-17 2021-03-09 中国矿业大学(北京) Automatic image labeling method based on Bayesian network and classifier chain
CN112464010B (en) * 2020-12-17 2021-08-27 中国矿业大学(北京) Automatic image labeling method based on Bayesian network and classifier chain
CN112256938A (en) * 2020-12-23 2021-01-22 畅捷通信息技术股份有限公司 Message metadata processing method, device and medium
CN112465075A (en) * 2020-12-31 2021-03-09 杭银消费金融股份有限公司 Metadata management method and system
CN112465075B (en) * 2020-12-31 2021-05-25 杭银消费金融股份有限公司 Metadata management method and system
CN113792081A (en) * 2021-08-31 2021-12-14 吉林银行股份有限公司 Method and system for automatically checking data assets
CN114358208A (en) * 2022-01-13 2022-04-15 辽宁工程技术大学 Science and collaboration activity text title recognition method based on deep learning
CN115408525B (en) * 2022-09-29 2023-07-04 中电科新型智慧城市研究院有限公司 Method, device, equipment and medium for classifying petition texts based on multi-level tags
CN116343104B (en) * 2023-02-03 2023-09-15 中国矿业大学 Map scene recognition method and system for visual feature and vector semantic space coupling
CN116343104A (en) * 2023-02-03 2023-06-27 中国矿业大学 Method and system for map scene recognition based on coupling of visual features and vector semantic space
CN116541752A (en) * 2023-07-06 2023-08-04 杭州美创科技股份有限公司 Metadata management method, device, computer equipment and storage medium
CN116541752B (en) * 2023-07-06 2023-09-15 杭州美创科技股份有限公司 Metadata management method, device, computer equipment and storage medium
CN118114060A (en) * 2024-02-01 2024-05-31 郑州大学 Disaster metadata automatic matching method and system based on word2vec model
CN118114060B (en) * 2024-02-01 2025-05-23 郑州大学 Disaster metadata automatic matching method and system based on word2vec model

Also Published As

Publication number Publication date
CN110704624B (en) 2021-08-10

Similar Documents

Publication Publication Date Title
CN110704624A (en) A multi-level and multi-label classification method for geographic information service metadata text
Gao et al. Visual-textual joint relevance learning for tag-based social image search
CN106649434B (en) Cross-domain knowledge migration label embedding method and device
RU2711125C2 (en) System and method of forming training set for machine learning algorithm
Miura et al. A simple scalable neural networks based model for geolocation prediction in twitter
CN104834747A (en) Short text classification method based on convolution neutral network
CN111832289A (en) A service discovery method based on clustering and Gaussian LDA
CN109885675B (en) Text subtopic discovery method based on improved LDA
US20230074771A1 (en) Hierarchical clustering on graphs for taxonomy extraction and applications thereof
Prabowo et al. Hierarchical multi-label classification to identify hate speech and abusive language on Indonesian twitter
US20240168999A1 (en) Hierarchical clustering on graphs for taxonomy extraction and applications thereof
Sarwar et al. A scalable framework for stylometric analysis of multi-author documents
Saikia et al. Modelling social context for fake news detection: a graph neural network based approach
Codina et al. Semantically-enhanced pre-filtering for context-aware recommender systems
Zhang et al. Multidimensional mining of massive text data
Eswaraiah et al. A Hybrid Deep Learning GRU based Approach for Text Classification using Word Embedding.
Agarwal et al. WGSDMM+ GA: A genetic algorithm-based service clustering methodology assimilating dirichlet multinomial mixture model with word embedding
Li et al. bi-hptm: An effective semantic matchmaking model for web service discovery
Eom et al. Multi-task learning for spatial events prediction from social data
Sun et al. Identifying regional characteristics of transportation research with Transport Research International Documentation (TRID) data
Khatun et al. Deep-KeywordNet: automated english keyword extraction in documents using deep keyword network based ranking
Xiao et al. Web services clustering based on HDP and SOM neural network
Vijaya Shetty et al. Graph-based keyword extraction for twitter data
Sun et al. Analysis of English writing text features based on random forest and Logistic regression classification algorithm
Qi et al. Big data prediction in location-aware wireless caching: A machine learning approach

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant