CN110704624A - A multi-level and multi-label classification method for geographic information service metadata text - Google Patents
A multi-level and multi-label classification method for geographic information service metadata text Download PDFInfo
- Publication number
- CN110704624A CN110704624A CN201910942287.2A CN201910942287A CN110704624A CN 110704624 A CN110704624 A CN 110704624A CN 201910942287 A CN201910942287 A CN 201910942287A CN 110704624 A CN110704624 A CN 110704624A
- Authority
- CN
- China
- Prior art keywords
- text
- topic
- classification
- feature
- samples
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 34
- 238000012549 training Methods 0.000 claims abstract description 43
- 238000013461 design Methods 0.000 claims abstract description 7
- 230000007246 mechanism Effects 0.000 claims abstract description 6
- 238000007781 pre-processing Methods 0.000 claims abstract description 4
- 235000009508 confectionery Nutrition 0.000 claims description 19
- 239000013598 vector Substances 0.000 claims description 17
- 238000007635 classification algorithm Methods 0.000 claims description 13
- 238000004422 calculation algorithm Methods 0.000 claims description 12
- 238000009499 grossing Methods 0.000 claims description 6
- 230000008901 benefit Effects 0.000 claims description 5
- 238000012360 testing method Methods 0.000 claims description 4
- 238000013528 artificial neural network Methods 0.000 claims description 3
- 230000008859 change Effects 0.000 claims description 3
- 230000008520 organization Effects 0.000 claims description 3
- 238000000605 extraction Methods 0.000 claims description 2
- 238000001914 filtration Methods 0.000 claims description 2
- XLYOFNOQVPJJNP-UHFFFAOYSA-N water Substances O XLYOFNOQVPJJNP-UHFFFAOYSA-N 0.000 description 8
- 230000008569 process Effects 0.000 description 6
- 238000013145 classification model Methods 0.000 description 3
- 238000010586 diagram Methods 0.000 description 2
- 230000036541 health Effects 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000000546 chi-square test Methods 0.000 description 1
- 239000003086 colorant Substances 0.000 description 1
- 238000007405 data analysis Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000009795 derivation Methods 0.000 description 1
- 238000002372 labelling Methods 0.000 description 1
- 230000000877 morphologic effect Effects 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 238000010187 selection method Methods 0.000 description 1
- 241000894007 species Species 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/34—Browsing; Visualisation therefor
- G06F16/345—Summarisation for human users
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/38—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/387—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using geographical or spatial information, e.g. location
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
- G06F18/232—Non-hierarchical techniques
- G06F18/2321—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
- G06F18/23213—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2415—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
- G06F18/24155—Bayesian classification
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- Probability & Statistics with Applications (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Library & Information Science (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
本发明公开了一种地理信息服务元数据文本多层级多标签分类方法,包括:1)获取地理信息服务元数据文本集进行文本预处理,将每条数据样本划分为文本特征词组合;2)设定一级分类目录,生成与分类类别语义关联的典型词词表;3)根据典型词词表对文本特征词进行筛选;4)选取ML‑KNN作为协同训练的一个基模型;5)建立主题预测模型ML‑CSW作为协同训练的另一基模型;6)设计协同机制,为元数据文本匹配多标签主题,作为一级粗粒度主题分类结果;7)选取某一分类标签对应的元数据文本,得到不同级别的细粒度主题类别目录。本发明方法考虑地理信息服务元数据的领域特色和文本语义,仅依赖少量的标记数据样本且分类结果相比传统多标签分类方法整体表现更好。
The invention discloses a multi-level and multi-label classification method for geographic information service metadata text, including: 1) acquiring a geographic information service metadata text set to perform text preprocessing, and dividing each data sample into text feature word combinations; 2) Set a first-level classification catalog, and generate a typical vocabulary related to the classification category semantics; 3) Screen the text feature words according to the typical vocabulary; 4) Select ML-KNN as a base model for collaborative training; 5) Establish The topic prediction model ML-CSW is used as another base model for collaborative training; 6) Design a collaborative mechanism to match multi-label topics for metadata texts as a first-level coarse-grained topic classification result; 7) Select the metadata corresponding to a certain classification label text, get different levels of fine-grained topic category catalogs. The method of the invention considers the domain characteristics and text semantics of the metadata of the geographic information service, only relies on a small number of labeled data samples, and the classification result has better overall performance than the traditional multi-label classification method.
Description
技术领域technical field
本发明涉及自然语言处理技术,尤其涉及一种地理信息服务元数据文本多层级多标签分类方法。The invention relates to natural language processing technology, in particular to a multi-level and multi-label classification method for geographic information service metadata text.
背景技术Background technique
文本精准分类作为数据分析的一种重要手段,是提升地理信息资源检索品质的关键,具有广泛的应用场景。传统分类方法大多适用于二分类或单分类场景,且过度依赖大量标记样本训练分类模型,限制了文本分类的精准性、全面性,及模型的适用场景。特别是针对地理信息服务元数据而言,通常缺乏标注主题的样本数据集,且文本内容杂糅纷繁,地学术语和通识词汇混杂导致特征词表复杂化;并且主题之间的交叠和隶属关系使得元数据文本主题具有多粒度与多类别特性,进一步加大了主题分类的难度。针对缺乏训练样本的问题和多类别匹配的需求,有学者提出了半监督、弱监督等机制减少分类器对训练样本的依赖,也有学者通过ML-KNN、BR-KNN和TSVM等方法实现文本多标签分类。但这些方法通常未结合领域特色,未考虑文本中专业术语的语义,无法有效贴合地理信息服务元数据的文本特性。As an important means of data analysis, accurate text classification is the key to improving the retrieval quality of geographic information resources, and has a wide range of application scenarios. Most of the traditional classification methods are suitable for binary or single classification scenarios, and they rely too much on a large number of labeled samples to train the classification model, which limits the accuracy and comprehensiveness of text classification and the applicable scenarios of the model. Especially for the metadata of geographic information services, there is usually a lack of sample datasets to annotate topics, and the text content is complex. This makes the metadata text topics have multi-granularity and multi-category characteristics, which further increases the difficulty of topic classification. In response to the lack of training samples and the need for multi-category matching, some scholars have proposed semi-supervised and weakly supervised mechanisms to reduce the dependence of classifiers on training samples, and some scholars have used ML-KNN, BR-KNN and TSVM to achieve more text Label classification. However, these methods usually do not combine domain characteristics, do not consider the semantics of technical terms in the text, and cannot effectively fit the textual characteristics of geographic information service metadata.
发明内容SUMMARY OF THE INVENTION
本发明要解决的技术问题在于针对现有技术中的缺陷,提供一种地理信息服务元数据文本多层级多标签分类方法。The technical problem to be solved by the present invention is to provide a multi-level and multi-label classification method for geographic information service metadata text in view of the defects in the prior art.
本发明解决其技术问题所采用的技术方案是:一种地理信息服务元数据文本多层级多标签分类方法,包括以下步骤:The technical scheme adopted by the present invention to solve the technical problem is: a multi-level and multi-label classification method for geographic information service metadata text, comprising the following steps:
1)获取包含未标记样本与标记样本的地理信息服务元数据文本集进行文本预处理,将每条数据样本划分为文本特征词组合;1) Acquire a geographic information service metadata text set containing unmarked samples and marked samples for text preprocessing, and divide each data sample into text feature word combinations;
2)基于地理信息资源的领域应用主题类别定义一级分类目录,生成与分类类别(以下称为“主题”)语义紧密关联的典型词词表;2) Define a first-level taxonomy based on the domain application theme category of geographic information resources, and generate a typical vocabulary closely related to the semantics of the taxonomy category (hereinafter referred to as "theme");
3)根据典型词词表对文本特征词进行筛选,滤除与典型词距离大于阈值的特征,获得根据主题分类筛选的特征子集;3) Screen the text feature words according to the typical word list, filter out the features whose distance from the typical word is greater than the threshold, and obtain the feature subsets screened according to the topic classification;
4)选取经典多标签分类算法ML-KNN(Multi-label K Nearest Neighbors)作为协同训练的一个基模型H1;4) Select the classic multi-label classification algorithm ML-KNN (Multi-label K Nearest Neighbors) as a base model H 1 for collaborative training;
5)依据语料库计算特征到主题的语义距离,建立主题预测模型ML-CSW(Multi-label Classification based on SWEET & WordNet),将该模型作为协同训练的另一基模型H2;5) Calculate the semantic distance from the feature to the subject according to the corpus, establish a subject prediction model ML-CSW (Multi-label Classification based on SWEET & WordNet), and use this model as another base model H 2 for collaborative training;
6)基于上述两个基模型设计协同机制,为元数据文本匹配多标签主题,作为一级粗粒度主题分类结果;6) Design a collaborative mechanism based on the above two base models to match multi-label topics for metadata texts as a first-level coarse-grained topic classification result;
7)根据一级粗粒度主题分类结果,选取某一分类标签对应的元数据文本,抽取文本主题,作为下一层级的细粒度主题,同时获得元数据文本与双层主题目录的匹配关系;7) According to the first-level coarse-grained topic classification result, select the metadata text corresponding to a certain classification label, extract the text topic, as the fine-grained topic of the next level, and obtain the matching relationship between the metadata text and the two-level topic catalog;
8)重复步骤7),得到不同级别的细粒度主题类别目录,以及元数据文本与主题目录间的匹配关系。8) Step 7) is repeated to obtain fine-grained subject category catalogues at different levels, as well as the matching relationship between the metadata text and the subject catalogue.
按上述方案,所述步骤2)中基于地理信息资源的领域应用主题类别定义一级分类目录是基于国际地球观测组织针对地学领域提出的社会受益领域SBAs进行扩展而得到一级分类。According to the above solution, in step 2), the first-level classification catalogue of the domain application subject category definition based on geographic information resources is based on the expansion of the SBAs for the social benefit field proposed by the International Earth Observation Organization for the field of geosciences to obtain the first-level classification.
按上述方案,所述步骤2)中典型词词表生成方式如下:According to the above scheme, the typical vocabulary list generation method in the step 2) is as follows:
以SBAs为主题分类目录,抽取SWEET和WordNet定义中主题的上位词、下位词和同义词作为与主题语义相关的典型词,生成典型词词表。Taking SBAs as the subject classification catalogue, the hypernyms, hyponyms and synonyms of the subjects in the SWEET and WordNet definitions are extracted as typical words related to the semantics of the subjects, and a typical vocabulary list is generated.
按上述方案,所述步骤3)中根据典型词词表对文本特征词进行筛选,具体如下:According to the above scheme, in the step 3), the text feature words are screened according to the typical vocabulary list, and the details are as follows:
S31、基于Word2vec算法将典型词与文本特征词表示为二维空间词向量;S31. Represent typical words and text feature words as two-dimensional space word vectors based on the Word2vec algorithm;
S32、计算典型词与文本特征词向量间的余弦距离;S32. Calculate the cosine distance between the typical word and the text feature word vector;
S33、设定距离阈值T,滤除掉与典型词余弦距离大于T的文本特征词。S33 , setting a distance threshold T, and filtering out text feature words whose cosine distance from typical words is greater than T.
按上述方案,所述步骤5)中主题模型的建立方法具体如下:According to the above scheme, the establishment method of the topic model in the step 5) is as follows:
依据SWEET本体库与WordNet英语词汇网的网络定义,计算文本特征f与每个主题pi间的语义距离dpi According to the network definition of SWEET ontology library and WordNet English vocabulary network, calculate the semantic distance d pi between the text feature f and each topic pi
求特征f与每个主题pi间的语义距离dpi的最小值,并求倒作为文本特征f与所有主题P的最大语义相关度sf,其中P为所有主题集合;Find the minimum value of the semantic distance d pi between the feature f and each topic pi, and find the maximum semantic correlation s f between the text feature f and all topics P, where P is the set of all topics;
基于文本特征与主题的最短距离定义特征权重,建立主题预测模型,为未标记样本预测多标签主题;Define feature weights based on the shortest distance between text features and topics, establish topic prediction models, and predict multi-label topics for unlabeled samples;
假定训练集中共包含n个文本特征,则可计算得到训练集中所有特征到所有主题的最大语义相关度的向量S=[s1,s2,…,sn],将单条数据x的权重w(x)定义为1×n的向量,分别对应n个文本特征的权重,若特征f在样本x中出现,则定义为sf,否则定义为0;Assuming that there are n text features in the training set, the vector S=[s 1 ,s 2 ,...,s n ] can be calculated to obtain the maximum semantic relevance of all features in the training set to all topics, and the weight w of a single piece of data x can be calculated. (x) is defined as a 1×n vector, corresponding to the weights of n text features, if the feature f appears in the sample x, it is defined as s f , otherwise it is defined as 0;
建立主题预测模型Y,其中F为特征的调整向量,α为平滑参数。基于标记样本数据,采用BP神经网络迭代优化训练模型Y,计算损失最小情况下F和α的最优解并得到最终的模型,依据模型预测未标记样本t的类别集合;Establish a topic prediction model Y, where F is the adjustment vector of the feature, and α is the smoothing parameter. Based on the labeled sample data, BP neural network is used to iteratively optimize the training model Y, calculate the optimal solution of F and α under the condition of minimum loss, and obtain the final model, and predict the category set of unlabeled sample t according to the model;
Y=w(x)*F+α。Y=w(x)*F+α.
按上述方案,所述步骤6)设计协同机制,为元数据文本匹配多标签主题,作为一级粗粒度主题分类结果;具体如下:According to the above scheme, the step 6) design a collaborative mechanism to match the multi-label topic for the metadata text, as the first-level coarse-grained topic classification result; the details are as follows:
S61、根据地理信息服务元数据文本集中的标记样本生成L1和L2两个子集,分别作为协同训练基模型H1和H2的训练集;S61. Generate two subsets L 1 and L 2 according to the marked samples in the geographic information service metadata text set, which are respectively used as the training sets of the collaborative training base models H 1 and H 2 ;
S62、利用训练集训练基模型H1和H2,并利用训练好的基模型预测未标记样本的类别向量;S62, using the training set to train the base models H 1 and H 2 , and using the trained base model to predict the category vector of the unlabeled sample;
S63、从未标记样本中选出分类器H1和H2具有相同预测结果的样本赋予伪标记,将伪标记样本分别添加至两个训练子集L1和L2,更新训练集,重复步骤S62-S63,直至两个分类器的分类结果不出现明显变化,得到所有未标记样本的类别集合以及最后更新的训练集;S63. Select the samples with the same prediction result by the classifiers H 1 and H 2 from the unlabeled samples and assign them to pseudo-labels, add the pseudo-labeled samples to the two training subsets L 1 and L 2 respectively, update the training set, and repeat the steps S62-S63, until the classification results of the two classifiers do not change significantly, obtain the category set of all unlabeled samples and the last updated training set;
S64、基于所有有标记的样本训练分类器H1,为测试样本匹配主题类别集合。S64: Train the classifier H 1 based on all the labeled samples, and match the set of subject categories for the test samples.
按上述方案,所述步骤4)中选取经典多标签分类算法ML-KNN作为协同训练的一个基模型,具体如下:According to the above scheme, in the described step 4), the classical multi-label classification algorithm ML-KNN is selected as a base model for collaborative training, and the details are as follows:
S41、选用ML-KNN算法作为协同训练的基模型H1,指定近邻样本个数k,以N(x)表示训练集中样本x的k个近邻样本的集合,统计N(x)中属于主题类别l的样本数量c[j],统计N(x)中不属于主题类别l的样本数量c′[j]。下列公式中,当样本x属于主题类别l时,为1,为0,反之则为0,为1;S41. Select the ML-KNN algorithm as the base model H 1 for collaborative training, specify the number k of neighbor samples, and use N(x) to represent the set of k neighbor samples of the sample x in the training set, and count N(x) belonging to the subject category The number of samples c[j] of l, count the number of samples c'[j] in N(x) that do not belong to the topic category l. In the following formula, when the sample x belongs to the subject category l, is 1, 0, otherwise is 0, is 1;
S42、计算未标记样本t属于主题类别l的先验概率与后验概率其中b的取值为0和1,表示样本t属于主题类别l的事件,表示样本t不属于主题类别l的事件,s为平滑参数,m为训练样本个数,表示样本t的k个近邻样本中样本j属于类别l的事件;S42. Calculate the prior probability that the unlabeled sample t belongs to the topic category l with the posterior probability where b takes the
S43、依据最大化后验概率和贝叶斯原则预测未标记样本t的类别集合 S43. Predict the class set of the unlabeled sample t according to the maximized posterior probability and the Bayesian principle
按上述方案,所述步骤7)中抽取文本主题是基于隐狄利克雷分布(LatentDirichlet Allocation,LDA)算法抽取文本主题。According to the above scheme, the extraction of text topics in step 7) is based on the Latent Dirichlet Allocation (LDA) algorithm to extract text topics.
本发明产生的有益效果是:本发明提出了一种新的针对OGC网络地图服务WMS及其他地理信息网络资源元数据文本的多层级多标签分类流程。该流程将地学本体库SWEET和通用英语词汇网络WordNet引入分类过程,结合传统分类算法ML-KNN和紧密贴合领域特性与文本语义的分类算法ML-CSW进行协同训练,以获得地理信息服务元数据文本与多层级主题目录的匹配关系。本发明方法考虑地理信息服务元数据的领域特色和文本语义,仅依赖少量的标记数据样本;同时,相比于分类器链、投票分类器等传统多标签分类算法,本发明方法的分类结果整体表现更好。The beneficial effects of the present invention are as follows: the present invention proposes a new multi-level and multi-label classification process for OGC network map service WMS and other geographic information network resource metadata texts. This process introduces the geoscience ontology library SWEET and the general English vocabulary network WordNet into the classification process, and combines the traditional classification algorithm ML-KNN and the classification algorithm ML-CSW, which closely fits the domain characteristics and text semantics, for collaborative training to obtain geographic information service metadata. The matching relationship between the text and the multi-level topic directory. The method of the present invention considers the domain characteristics and text semantics of the metadata of the geographic information service, and only relies on a small number of labeled data samples; at the same time, compared with traditional multi-label classification algorithms such as classifier chains and voting classifiers, the classification results of the method of the present invention are as a whole. perform better.
附图说明Description of drawings
下面将结合附图及实施例对本发明作进一步说明,附图中:The present invention will be further described below in conjunction with the accompanying drawings and embodiments, in which:
图1是本发明实施例的方法流程图;Fig. 1 is the method flow chart of the embodiment of the present invention;
图2是本发明实施例的方法流程图;Fig. 2 is the method flow chart of the embodiment of the present invention;
图3是本发明实施例的典型词示例图;Fig. 3 is the typical word example diagram of the embodiment of the present invention;
图4是本发明实施例的ML-CSW算法中文本特征与主题间最短距离计算示例图;4 is an example diagram of calculating the shortest distance between text features and topics in the ML-CSW algorithm according to an embodiment of the present invention;
图5是本发明实施例的示例文本的分类结果;Fig. 5 is the classification result of the example text of the embodiment of the present invention;
图6是本发明实施例的不同分类算法的分类结果对比;Fig. 6 is the classification result comparison of the different classification algorithms of the embodiment of the present invention;
图7是本发明实施例的基于不同特征选择算法的分类结果对比。FIG. 7 is a comparison of classification results based on different feature selection algorithms according to an embodiment of the present invention.
具体实施方式Detailed ways
为了使本发明的目的、技术方案及优点更加清楚明白,以下结合实施例,对本发明进行进一步详细说明。应当理解,此处所描述的具体实施例仅用以解释本发明,并不用于限定本发明。In order to make the objectives, technical solutions and advantages of the present invention clearer, the present invention will be further described in detail below with reference to the embodiments. It should be understood that the specific embodiments described herein are only used to explain the present invention, but not to limit the present invention.
现有46000条网络地图服务(Web Map Service,WMS)文本数据,其中400条标注了SBAs主题,各主题分布均匀。文本内容来自于WMS GetCapability能力文档中Service标签内的URL、Abstract、Keywords和Title字段。由于文本内容杂糅纷繁,篇幅长度各不相同,单条数据对应多个主题类别,且标注主题的样本数据量较少,传统多标签分类算法很难精准、全面地分类,也无法得到多层次的主题匹配结果。There are currently 46,000 Web Map Service (WMS) text data, of which 400 are marked with SBAs topics, and each topic is evenly distributed. The text content comes from the URL, Abstract, Keywords and Title fields in the Service tag in the WMS GetCapability capability document. Due to the variety of text content and different lengths of texts, a single piece of data corresponds to multiple topic categories, and the amount of sample data for labeling topics is small, it is difficult for traditional multi-label classification algorithms to classify accurately and comprehensively, and it is impossible to obtain multi-level topics. match results.
本发明结合半监督学习中协同训练的理论基础,引入地学本体库和通用英语词汇网设计贴合地学领域特性的基础分类模型。在分类过程中结合应用广泛的经典多标签分类模型进行协同训练,并抽取多层次细粒度主题,为WMS元数据文本匹配多层次多标签主题。Combined with the theoretical basis of collaborative training in semi-supervised learning, the invention introduces a geoscience ontology database and a general English vocabulary network to design a basic classification model that fits the characteristics of the geoscience field. In the classification process, the widely used classical multi-label classification model is used for collaborative training, and multi-level fine-grained topics are extracted to match multi-level multi-label topics for WMS metadata text.
下面将结合本发明中的附图,对本发明的算法过程进行详细阐述,具体如下:Below in conjunction with the accompanying drawings in the present invention, the algorithm process of the present invention will be described in detail, as follows:
如图1和图2所示,一种地理信息服务元数据文本多层级多标签分类方法,包括以下步骤:As shown in Figure 1 and Figure 2, a multi-level and multi-label classification method for geographic information service metadata text includes the following steps:
1)对所有WMS元数据进行文本预处理,包括分词、去除停用词和词形还原三个步骤,将每条文本分割为文本特征词组合;1) Perform text preprocessing on all WMS metadata, including three steps of word segmentation, removal of stop words and morphological restoration, and divide each text into text feature word combinations;
2)基于国际地球观测组织(Group on Earth Observations,GEO)针对地学领域提出的社会受益领域(societal benefit areas,SBAs)进行扩展而得到一级分类,SBAs包括9大兴趣主题,包括农业(Agriculture)、生物多样性(Biodiversity)、气候(Climate)、灾害(Disaster)、生态(Ecosystem)、能源(Energy)、健康(Health)、水(Water)和天气(Weather)等,SBAs是基于国际地球观测组织(Group on Earth Observations,GEO)针对地学领域提出的社会受益领域(societal benefit areas,SBAs),包括9大兴趣主题,包括农业(Agriculture)、生物多样性(Biodiversity)、气候(Climate)、灾害(Disaster)、生态(Ecosystem)、能源(Energy)、健康(Health)、水(Water)和天气(Weather)等。本实施例的主题分类目录是在SBAs的基础上进行扩展,添加了地质(Geology)作为第10个主题,所以本实施例中涉及的所有的主题分类目录、一级主题分类目录都是指这10个主题。2) The first-level classification is based on the expansion of the social benefit areas (SBAs) proposed by the International Earth Observation Organization (Group on Earth Observations, GEO) for the field of geosciences. SBAs include 9 major topics of interest, including agriculture (Agriculture) , Biodiversity, Climate, Disaster, Ecosystem, Energy, Health, Water and Weather, etc. SBAs are based on international earth observation The social benefit areas (SBAs) proposed by the Group on Earth Observations (GEO) for the field of geosciences, including 9 major topics of interest, including Agriculture, Biodiversity, Climate, Disasters (Disaster), Ecosystem, Energy, Health, Water and Weather, etc. The subject classification catalogue of this embodiment is extended on the basis of SBAs, and Geology is added as the tenth subject, so all subject classification catalogues and first-level subject classification catalogues involved in this embodiment refer to this 10 themes.
以SBAs为主题分类目录,抽取SWEET和WordNet定义中主题的上位词、下位词和同义词作为与主题语义相关的典型词,生成典型词词表,图3(a)为从SWEET中抽取的主题“Agriculture(农业)”对应的典型词示例,图3(b)为从WordNet中抽取的主题“Agriculture”对应的典型词示例,不同颜色代表不同的词义集合;Taking SBAs as the subject category catalogue, the hypernyms, hyponyms and synonyms of the subjects in the definitions of SWEET and WordNet are extracted as typical words related to the semantics of the subject, and a typical vocabulary is generated. Figure 3(a) is the subject extracted from SWEET" An example of a typical word corresponding to "Agriculture", Figure 3(b) is an example of a typical word corresponding to the topic "Agriculture" extracted from WordNet, and different colors represent different sets of word meanings;
3)基于Word2vec算法的CBOW模型将典型词与文本特征词表示为二维空间词向量,计算典型词与文本特征词向量间的余弦距离;3) The CBOW model based on the Word2vec algorithm expresses typical words and text feature words as two-dimensional space word vectors, and calculates the cosine distance between the typical words and text feature word vectors;
4)设定距离阈值,基于距离阈值对文本特征词进行筛选,滤除与典型词距离大于阈值的特征,由此获得对主题分类贡献较大的特征子集,作为分类算法的模型输入;4) Set the distance threshold, screen the text feature words based on the distance threshold, and filter out the features whose distance from the typical word is greater than the threshold, thereby obtaining a feature subset that contributes more to the topic classification, as the model input of the classification algorithm;
5)设计贴合WMS领域特性和考虑文本语义的多标签分类算法ML-CSW,作为协同训练基模型H1,以语料库计算文本特征到主题间的语义关联度作为特征权重,训练主题预测模型:5) Design a multi-label classification algorithm ML-CSW that fits the characteristics of the WMS domain and considers text semantics, as a collaborative training base model H 1 , and uses the corpus to calculate the semantic correlation between text features and topics as feature weights to train the topic prediction model:
5.1)以SWEET的网络定义为主,WordNet为辅计算文本特征与主题的语义最短距离;5.1) The network definition of SWEET is the main, and WordNet is the auxiliary to calculate the shortest semantic distance between the text feature and the topic;
若文本特征词被SWEET收录,则依据SWEET的网络定义爬取特征词与主题间的最短距离,如图4(a)所示,特征“Glacier(冰川)”到主题“Water(水)”之间的距离为3;If the text feature word is included in SWEET, the shortest distance between the feature word and the topic is crawled according to the network definition of SWEET, as shown in Figure 4(a), the feature "Glacier (glacier)" to the topic "Water (water)" The distance between them is 3;
若文本特征未被SWEET收录,则在WordNet中逐层次向上查找上位词作为文本特征的替代词,直至查找到被SWEET收录的替代词,计算特征到替代词在WordNet定义中的最短距离D1,如图4(b)所示,特征“Neve(粒雪)”的替代词为“Ice(冰)”,最短距离为1。进而依据SWEET的网络定义,基于Dijkstra算法计算替代词到主题间的最短距离D2,如图4(b)中替代词“Ice(冰)”到主题“Water(水)”的最短距离为2。文本特征与主题的最终距离为文本特征到替代词的距离与替代词到主题的距离之和,即D=D1+D2,如图4(b)中特征“Neve(粒雪)”到主题“Water(水)”的最短距离为3。If the text feature is not included in SWEET, look up the hypernym in WordNet layer by layer as the substitute word for the text feature, until the substitute word included in SWEET is found, and calculate the shortest distance D 1 from the feature to the substitute word in the WordNet definition. , as shown in Figure 4(b), the substitute for the feature "Neve (grain snow)" is "Ice (ice)", and the shortest distance is 1. Then, according to the network definition of SWEET, the shortest distance D 2 between the substitute word and the topic is calculated based on the Dijkstra algorithm. As shown in Figure 4(b), the shortest distance from the substitute word "Ice" to the topic "Water" is 2 . The final distance between the text feature and the topic is the sum of the distance from the text feature to the substitute word and the distance from the substitute word to the topic, that is, D=D 1 +D 2 , as shown in Figure 4(b), the feature "Neve (grain snow)" to The shortest distance for the topic "Water" is 3.
5.2)基于文本特征与主题的最短距离定义特征权重,建立主题预测模型,为未标记样本预测多标签主题;5.2) Define feature weights based on the shortest distance between text features and topics, establish a topic prediction model, and predict multi-label topics for unlabeled samples;
a)依据步骤5.1)可计算得到文本特征f与每个主题pi间的语义距离对最短距离求导作为文本特征f与所有主题P的最大语义相关度sf,其中P为所有主题集合;a) According to step 5.1), the semantic distance between the text feature f and each topic p i can be calculated The derivation of the shortest distance is taken as the maximum semantic relevance s f between the text feature f and all topics P, where P is the set of all topics;
b)若所有文本共包含n个文本特征,则可计算得到训练集中所有特征到所有主题的最大语义相关度向量S=[s1,s2,…,sn]。将单条数据x的权重w(x)定义为1×n的向量,分别对应n个文本特征的权重,若特征f在样本x中出现,则定义为sf,否则定义为0。b) If all texts contain n text features, the maximum semantic relevance vector S=[s 1 , s 2 ,...,s n ] from all features in the training set to all topics can be calculated. The weight w(x) of a single piece of data x is defined as a 1×n vector, corresponding to the weights of n text features respectively. If the feature f appears in the sample x, it is defined as s f , otherwise it is defined as 0.
c)建立主题预测模型Y,其中F为特征的调整向量,α为平滑参数。基于标记样本数据,采用BP神经网络迭代优化训练主题预测模型,计算损失最小情况下F和α的最优解,得到最终的模型,依据模型可预测未标记样本t的类别集合;c) Establish a topic prediction model Y, where F is the adjustment vector of the feature, and α is the smoothing parameter. Based on the labeled sample data, the BP neural network is used to iteratively optimize the training topic prediction model, and the optimal solution of F and α under the condition of minimum loss is calculated to obtain the final model. According to the model, the category set of the unlabeled sample t can be predicted;
Y=w(x)*F+αY=w(x)*F+α
6)选取应用较为广泛的经典多标签分类算法ML-KNN作为协同训练基模型H2:6) Select the widely used classical multi-label classification algorithm ML-KNN as the collaborative training base model H 2 :
指定近邻样本个数k,以N(x)表示训练集L1中样本x的k个近邻样本集合,统计N(x)中属于主题类别l的样本数量c[j],统计N(x)中不属于主题类别l的样本数量c′[j]。下列公式中,当样本x属于主题类别l时,为1,为0,当样本x不属于主题类别l时,为0,为1;Specify the number of neighbor samples k, and use N(x) to represent the k neighbor sample set of the sample x in the training set L 1 , count the number of samples c[j] belonging to the topic category l in N(x), and count N(x) The number of samples c'[j] that do not belong to topic category l in . In the following formula, when the sample x belongs to the subject category l, is 1, is 0, when the sample x does not belong to the subject category l, is 0, is 1;
计算未标记样本t属于主题类别l的先验概率与后验概率其中s为平滑参数,m为训练样本个数,表示事件样本t属于主题类别l,表示事件样本t不属于主题类别l,表示事件样本t的k个近邻样本中实例j属于类别l;Calculate the prior probability that an unlabeled sample t belongs to topic class l with the posterior probability where s is the smoothing parameter, m is the number of training samples, means that the event sample t belongs to the topic category l, indicates that the event sample t does not belong to the topic category l, Indicates that instance j in the k nearest neighbor samples of event sample t belongs to category l;
依据最大化后验概率和贝叶斯原则预测未标记样本t的类别集合 Predict the class set of the unlabeled sample t according to the maximized posterior probability and Bayes' principle
7)将所有标记样本的80%重复随机采样分为L1和L2两个子集,分别作为分类器H1和H2的训练集,利用两个分类器预测所有未标记样本的类别集合;7) 80% repeated random sampling of all labeled samples is divided into two subsets, L 1 and L 2 , which are used as the training sets of classifiers H 1 and H 2 respectively, and the two classifiers are used to predict the category set of all unlabeled samples;
8)选出分类器H1和H2的相同预测结果的样本赋予伪标记,将伪标记样本分别添加至两个训练子集L1和L2,更新训练集,重复7)直至两个分类器的分类结果不出现明显变化,得到未标记样本的类别集合。8) Select the samples with the same prediction results of the classifiers H 1 and H 2 to give pseudo-labels, add the pseudo-labeled samples to the two training subsets L 1 and L 2 respectively, update the training set, and repeat 7) until two classifications There is no obvious change in the classification results of the classifier, and the category set of unlabeled samples is obtained.
9)以所有标记样本的10%作为测试样本,利用训练后的分类器为测试样本匹配主题类别集合,如图5中示例文本的SBAs类别标签包含Biodiversity,Climate,Disaster,Ecosystem,Water和Weather。9) Using 10% of all labeled samples as test samples, use the trained classifier to match the set of subject categories for the test samples, as shown in Figure 5, the SBA category labels of the sample text include Biodiversity, Climate, Disaster, Ecosystem, Water and Weather.
10)指定主题层次数N,对于每个层次,选择单一主题类别的元数据文本,基于隐狄利克雷分布(Latent Dirichlet Allocation,LDA)算法抽取文本细粒度主题,直至生成N层主题目录,为WMS元数据文本匹配N层主题,如图5中Biodiversity对应的二级主题为wildlife,specie和diversity,Climate对应的二级主题为forest和meteorology,Disaster对应的二级主题为pollution,Ecosystem对应的二级主题为habitat,resource和conserve,Water对应的二级主题为rain,Weather对应的二级主题为meteorology。10) Specify the number of topic layers N. For each layer, select the metadata text of a single topic category, and extract the fine-grained topics of the text based on the Latent Dirichlet Allocation (LDA) algorithm until an N-layer topic directory is generated, which is: The WMS metadata text matches N-layer themes. As shown in Figure 5, the secondary themes corresponding to Biodiversity are wildlife, specie and diversity, the secondary themes corresponding to Climate are forest and meteorology, the secondary themes corresponding to Disaster are pollution, and the secondary themes corresponding to Ecosystem The first-level themes are habitat, resource, and conserve, the second-level theme corresponding to Water is rain, and the second-level theme corresponding to Weather is meteorology.
本发明考虑地理信息服务元数据的领域特色和文本语义,仅依赖少量的标记数据样本;如图6所示,相比于分类器链、投票分类器等传统多标签分类算法,本发明方法的分类结果整体表现更好。The present invention considers the domain characteristics and textual semantics of geographic information service metadata, and only relies on a small number of labeled data samples; as shown in FIG. 6, compared with traditional multi-label classification algorithms such as classifier chains and voting classifiers, the method of the invention has The classification results perform better overall.
如图7所示,本发明的文本特征选择流程相较于卡方检验、基于WordNet的特征选择方法更能滤除掉对分类结果无贡献的特征。本发明方法可推广应用于地理信息门户和数据目录服务,辅助各类地理信息资源的检索与发现。As shown in FIG. 7 , compared with the chi-square test and the WordNet-based feature selection method, the text feature selection process of the present invention can filter out features that do not contribute to the classification result. The method of the invention can be applied to geographic information portals and data catalog services, and assists retrieval and discovery of various geographic information resources.
应当理解的是,对本领域普通技术人员来说,可以根据上述说明加以改进或变换,而所有这些改进和变换都应属于本发明所附权利要求的保护范围。It should be understood that, for those skilled in the art, improvements or changes can be made according to the above description, and all these improvements and changes should fall within the protection scope of the appended claims of the present invention.
Claims (8)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910942287.2A CN110704624B (en) | 2019-09-30 | 2019-09-30 | Geographic information service metadata text multi-level multi-label classification method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910942287.2A CN110704624B (en) | 2019-09-30 | 2019-09-30 | Geographic information service metadata text multi-level multi-label classification method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110704624A true CN110704624A (en) | 2020-01-17 |
CN110704624B CN110704624B (en) | 2021-08-10 |
Family
ID=69197772
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910942287.2A Active CN110704624B (en) | 2019-09-30 | 2019-09-30 | Geographic information service metadata text multi-level multi-label classification method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110704624B (en) |
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111460097A (en) * | 2020-03-26 | 2020-07-28 | 华泰证券股份有限公司 | Small sample text classification method based on TPN |
CN111611801A (en) * | 2020-06-02 | 2020-09-01 | 腾讯科技(深圳)有限公司 | Method, device, server and storage medium for identifying text region attribute |
CN112256938A (en) * | 2020-12-23 | 2021-01-22 | 畅捷通信息技术股份有限公司 | Message metadata processing method, device and medium |
CN112465075A (en) * | 2020-12-31 | 2021-03-09 | 杭银消费金融股份有限公司 | Metadata management method and system |
CN112464010A (en) * | 2020-12-17 | 2021-03-09 | 中国矿业大学(北京) | Automatic image labeling method based on Bayesian network and classifier chain |
CN113792081A (en) * | 2021-08-31 | 2021-12-14 | 吉林银行股份有限公司 | Method and system for automatically checking data assets |
CN114358208A (en) * | 2022-01-13 | 2022-04-15 | 辽宁工程技术大学 | Science and collaboration activity text title recognition method based on deep learning |
CN116343104A (en) * | 2023-02-03 | 2023-06-27 | 中国矿业大学 | Method and system for map scene recognition based on coupling of visual features and vector semantic space |
CN115408525B (en) * | 2022-09-29 | 2023-07-04 | 中电科新型智慧城市研究院有限公司 | Method, device, equipment and medium for classifying petition texts based on multi-level tags |
CN116541752A (en) * | 2023-07-06 | 2023-08-04 | 杭州美创科技股份有限公司 | Metadata management method, device, computer equipment and storage medium |
CN118114060A (en) * | 2024-02-01 | 2024-05-31 | 郑州大学 | Disaster metadata automatic matching method and system based on word2vec model |
Citations (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101283353A (en) * | 2005-08-03 | 2008-10-08 | 温克科技公司 | Systems for and methods of finding relevant documents by analyzing tags |
US7958068B2 (en) * | 2007-12-12 | 2011-06-07 | International Business Machines Corporation | Method and apparatus for model-shared subspace boosting for multi-label classification |
US7975039B2 (en) * | 2003-12-01 | 2011-07-05 | International Business Machines Corporation | Method and apparatus to support application and network awareness of collaborative applications using multi-attribute clustering |
CN102129470A (en) * | 2011-03-28 | 2011-07-20 | 中国科学技术大学 | Tag clustering method and system |
US8340405B2 (en) * | 2009-01-13 | 2012-12-25 | Fuji Xerox Co., Ltd. | Systems and methods for scalable media categorization |
CN104408153A (en) * | 2014-12-03 | 2015-03-11 | 中国科学院自动化研究所 | Short text hash learning method based on multi-granularity topic models |
CN104850650A (en) * | 2015-05-29 | 2015-08-19 | 清华大学 | Short-text expanding method based on similar-label relation |
CN104951554A (en) * | 2015-06-29 | 2015-09-30 | 浙江大学 | Method for matching landscape with verses according with artistic conception of landscape |
CN104991974A (en) * | 2015-07-31 | 2015-10-21 | 中国地质大学(武汉) | Particle swarm algorithm-based multi-label classification method |
CN105354593A (en) * | 2015-10-22 | 2016-02-24 | 南京大学 | NMF (Non-negative Matrix Factorization)-based three-dimensional model classification method |
CN105868415A (en) * | 2016-05-06 | 2016-08-17 | 黑龙江工程学院 | Microblog real-time filtering model based on historical microblogs |
CN105868905A (en) * | 2016-03-28 | 2016-08-17 | 国网天津市电力公司 | Managing and control system based on sensitive content perception |
US20180089540A1 (en) * | 2016-09-23 | 2018-03-29 | International Business Machines Corporation | Image classification utilizing semantic relationships in a classification hierarchy |
-
2019
- 2019-09-30 CN CN201910942287.2A patent/CN110704624B/en active Active
Patent Citations (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7975039B2 (en) * | 2003-12-01 | 2011-07-05 | International Business Machines Corporation | Method and apparatus to support application and network awareness of collaborative applications using multi-attribute clustering |
CN101283353A (en) * | 2005-08-03 | 2008-10-08 | 温克科技公司 | Systems for and methods of finding relevant documents by analyzing tags |
US7958068B2 (en) * | 2007-12-12 | 2011-06-07 | International Business Machines Corporation | Method and apparatus for model-shared subspace boosting for multi-label classification |
US8340405B2 (en) * | 2009-01-13 | 2012-12-25 | Fuji Xerox Co., Ltd. | Systems and methods for scalable media categorization |
CN102129470A (en) * | 2011-03-28 | 2011-07-20 | 中国科学技术大学 | Tag clustering method and system |
CN104408153A (en) * | 2014-12-03 | 2015-03-11 | 中国科学院自动化研究所 | Short text hash learning method based on multi-granularity topic models |
CN104850650A (en) * | 2015-05-29 | 2015-08-19 | 清华大学 | Short-text expanding method based on similar-label relation |
CN104951554A (en) * | 2015-06-29 | 2015-09-30 | 浙江大学 | Method for matching landscape with verses according with artistic conception of landscape |
CN104991974A (en) * | 2015-07-31 | 2015-10-21 | 中国地质大学(武汉) | Particle swarm algorithm-based multi-label classification method |
CN105354593A (en) * | 2015-10-22 | 2016-02-24 | 南京大学 | NMF (Non-negative Matrix Factorization)-based three-dimensional model classification method |
CN105868905A (en) * | 2016-03-28 | 2016-08-17 | 国网天津市电力公司 | Managing and control system based on sensitive content perception |
CN105868415A (en) * | 2016-05-06 | 2016-08-17 | 黑龙江工程学院 | Microblog real-time filtering model based on historical microblogs |
US20180089540A1 (en) * | 2016-09-23 | 2018-03-29 | International Business Machines Corporation | Image classification utilizing semantic relationships in a classification hierarchy |
Non-Patent Citations (2)
Title |
---|
DJAVAN DE CLERCQA ET.AL: ""Multi-label classification and interactive NLP-based visualization of electric"", 《HTTPS://DOI.ORG/10.1016/J.WPI.2019.101903》 * |
刘培奇: ""基于 LDA 主题模型的标签传递算法"", 《计算机应用》 * |
Cited By (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111460097B (en) * | 2020-03-26 | 2024-06-07 | 华泰证券股份有限公司 | TPN-based small sample text classification method |
CN111460097A (en) * | 2020-03-26 | 2020-07-28 | 华泰证券股份有限公司 | Small sample text classification method based on TPN |
CN111611801A (en) * | 2020-06-02 | 2020-09-01 | 腾讯科技(深圳)有限公司 | Method, device, server and storage medium for identifying text region attribute |
CN112464010A (en) * | 2020-12-17 | 2021-03-09 | 中国矿业大学(北京) | Automatic image labeling method based on Bayesian network and classifier chain |
CN112464010B (en) * | 2020-12-17 | 2021-08-27 | 中国矿业大学(北京) | Automatic image labeling method based on Bayesian network and classifier chain |
CN112256938A (en) * | 2020-12-23 | 2021-01-22 | 畅捷通信息技术股份有限公司 | Message metadata processing method, device and medium |
CN112465075A (en) * | 2020-12-31 | 2021-03-09 | 杭银消费金融股份有限公司 | Metadata management method and system |
CN112465075B (en) * | 2020-12-31 | 2021-05-25 | 杭银消费金融股份有限公司 | Metadata management method and system |
CN113792081A (en) * | 2021-08-31 | 2021-12-14 | 吉林银行股份有限公司 | Method and system for automatically checking data assets |
CN114358208A (en) * | 2022-01-13 | 2022-04-15 | 辽宁工程技术大学 | Science and collaboration activity text title recognition method based on deep learning |
CN115408525B (en) * | 2022-09-29 | 2023-07-04 | 中电科新型智慧城市研究院有限公司 | Method, device, equipment and medium for classifying petition texts based on multi-level tags |
CN116343104B (en) * | 2023-02-03 | 2023-09-15 | 中国矿业大学 | Map scene recognition method and system for visual feature and vector semantic space coupling |
CN116343104A (en) * | 2023-02-03 | 2023-06-27 | 中国矿业大学 | Method and system for map scene recognition based on coupling of visual features and vector semantic space |
CN116541752A (en) * | 2023-07-06 | 2023-08-04 | 杭州美创科技股份有限公司 | Metadata management method, device, computer equipment and storage medium |
CN116541752B (en) * | 2023-07-06 | 2023-09-15 | 杭州美创科技股份有限公司 | Metadata management method, device, computer equipment and storage medium |
CN118114060A (en) * | 2024-02-01 | 2024-05-31 | 郑州大学 | Disaster metadata automatic matching method and system based on word2vec model |
CN118114060B (en) * | 2024-02-01 | 2025-05-23 | 郑州大学 | Disaster metadata automatic matching method and system based on word2vec model |
Also Published As
Publication number | Publication date |
---|---|
CN110704624B (en) | 2021-08-10 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110704624A (en) | A multi-level and multi-label classification method for geographic information service metadata text | |
Gao et al. | Visual-textual joint relevance learning for tag-based social image search | |
CN106649434B (en) | Cross-domain knowledge migration label embedding method and device | |
RU2711125C2 (en) | System and method of forming training set for machine learning algorithm | |
Miura et al. | A simple scalable neural networks based model for geolocation prediction in twitter | |
CN104834747A (en) | Short text classification method based on convolution neutral network | |
CN111832289A (en) | A service discovery method based on clustering and Gaussian LDA | |
CN109885675B (en) | Text subtopic discovery method based on improved LDA | |
US20230074771A1 (en) | Hierarchical clustering on graphs for taxonomy extraction and applications thereof | |
Prabowo et al. | Hierarchical multi-label classification to identify hate speech and abusive language on Indonesian twitter | |
US20240168999A1 (en) | Hierarchical clustering on graphs for taxonomy extraction and applications thereof | |
Sarwar et al. | A scalable framework for stylometric analysis of multi-author documents | |
Saikia et al. | Modelling social context for fake news detection: a graph neural network based approach | |
Codina et al. | Semantically-enhanced pre-filtering for context-aware recommender systems | |
Zhang et al. | Multidimensional mining of massive text data | |
Eswaraiah et al. | A Hybrid Deep Learning GRU based Approach for Text Classification using Word Embedding. | |
Agarwal et al. | WGSDMM+ GA: A genetic algorithm-based service clustering methodology assimilating dirichlet multinomial mixture model with word embedding | |
Li et al. | bi-hptm: An effective semantic matchmaking model for web service discovery | |
Eom et al. | Multi-task learning for spatial events prediction from social data | |
Sun et al. | Identifying regional characteristics of transportation research with Transport Research International Documentation (TRID) data | |
Khatun et al. | Deep-KeywordNet: automated english keyword extraction in documents using deep keyword network based ranking | |
Xiao et al. | Web services clustering based on HDP and SOM neural network | |
Vijaya Shetty et al. | Graph-based keyword extraction for twitter data | |
Sun et al. | Analysis of English writing text features based on random forest and Logistic regression classification algorithm | |
Qi et al. | Big data prediction in location-aware wireless caching: A machine learning approach |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |