CN110704638A - A Construction Method of Electric Power Text Dictionary Based on Clustering Algorithm - Google Patents

A Construction Method of Electric Power Text Dictionary Based on Clustering Algorithm Download PDF

Info

Publication number
CN110704638A
CN110704638A CN201910940220.5A CN201910940220A CN110704638A CN 110704638 A CN110704638 A CN 110704638A CN 201910940220 A CN201910940220 A CN 201910940220A CN 110704638 A CN110704638 A CN 110704638A
Authority
CN
China
Prior art keywords
text
word
dictionary
clustering
words
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910940220.5A
Other languages
Chinese (zh)
Inventor
邓松
徐雨楠
朱博宇
付雄
岳东
吴新新
袁新雅
陈福林
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Post and Telecommunication University
Original Assignee
Nanjing Post and Telecommunication University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Post and Telecommunication University filed Critical Nanjing Post and Telecommunication University
Priority to CN201910940220.5A priority Critical patent/CN110704638A/en
Publication of CN110704638A publication Critical patent/CN110704638A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/374Thesaurus
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a clustering algorithm-based electric power text dictionary construction method, which mainly comprises four parts: the system comprises a data classification preprocessor, a data word segmentation processor, a clustering processor and a data processing operation core. The invention provides a clustering algorithm-based electric power text dictionary construction method, which is a strategic method and is mainly used for constructing a dictionary in the electric power field text classification process. Through the model in the invention, the key phrases which can represent the text types in the text in the power field can be more accurately found, and the construction of the dictionary is carried out by utilizing the key phrases.

Description

一种基于聚类算法的电力文本词典构造方法A Construction Method of Electric Power Text Dictionary Based on Clustering Algorithm

技术领域technical field

本发明涉及电力系统数据处理领域,具体是涉及一种基于聚类算法的电力文本词典构造方法,主要用于电力领域中的文本数据处理。The invention relates to the field of power system data processing, in particular to a method for constructing a power text dictionary based on a clustering algorithm, which is mainly used for text data processing in the power field.

背景技术Background technique

电网企业是资产密集型企业,电力设备健康状态管理是其核心任务,利用大数据进行科学管理是必然趋势。然而普遍认为,电网数据存在体量大、类型多、价值密度低和变化快的特点,较难利用。其中,数据价值密度低,是指绝大部分数据是电网正常数据,只有极少量的异常数据。数据的严重偏斜影响基于机器学习、深度学习等人工智能方法的挖掘效果。幸运的是,电力数据类型众多,其中文本数据,因“重要的事情常常被记录”而具有价值密度高的特点,挖掘前景好,因此电力文本挖掘是电力设备健康管理重点关注的关键技术之一。现有的一些针对电网方向的数据挖掘都是针对电网中结构化数据的所做研究和应用,而电网中非结构化数据中的文本方向的研究却基本鲜有研究,迄今为止,有关电网中文文本处理的研究报告几乎为零。暂时还没有获取电力文本信息的技术途径和解决方案,无法构建详实的电力语料库。所以构造一个电网相关领域的词典是很有必要的一件事情。Power grid enterprises are asset-intensive enterprises, and the management of the health status of power equipment is its core task, and the use of big data for scientific management is an inevitable trend. However, it is generally believed that power grid data is difficult to use due to its large volume, many types, low value density and rapid changes. Among them, the data value density is low, which means that most of the data is the normal data of the power grid, and there is only a very small amount of abnormal data. The serious skewness of the data affects the mining effect based on artificial intelligence methods such as machine learning and deep learning. Fortunately, there are many types of power data, among which text data has the characteristics of high value density because "important things are often recorded", and the mining prospect is good. Therefore, power text mining is one of the key technologies that focus on the health management of power equipment. . Some of the existing data mining for the power grid is all about the research and application of the structured data in the power grid, while the research on the text direction of the unstructured data in the power grid is basically seldom studied. There are almost zero research reports on text processing. For the time being, there are no technical ways and solutions to obtain power text information, and it is impossible to build a detailed power corpus. Therefore, it is necessary to construct a dictionary of power grid related fields.

而电网企业在设备运维管理过程中,会以中文形式记录设备的敌障、缺陷、检修、消缺等信息。这些信息会以文本形式保存在信息管理系统中,不仅反映电力设备个体健康状态的既往史,还蕴藏着丰富的同类设备可靠性信息的技术。中文文本分类一直来被认为是一项重要而困难的技术,尤其当它应用于各专业领域时,需要与专业领域知识密切结合,则更为困难。今各个领域都在迅速发展,新词汇、新概念、新关系不断的涌现,如果还停留在传统的词汇分析,那远远达不到我们的需求;而领域词典的出现却能很大程度上解决这种问题,通过构建词典,搜集最新的概念以及相互关系,在特定领域内的研究能游刃有余。In the process of equipment operation and maintenance management, power grid enterprises will record information such as enemy obstacles, defects, maintenance, and elimination of equipment in Chinese. This information will be stored in the information management system in the form of text, which not only reflects the past history of the individual health status of the power equipment, but also contains a wealth of technology for the reliability of similar equipment. Chinese text classification has always been regarded as an important and difficult technology, especially when it is applied to various professional fields, it needs to be closely combined with professional domain knowledge, which is even more difficult. Nowadays, various fields are developing rapidly, and new words, new concepts, and new relationships are constantly emerging. If we still stay in traditional lexical analysis, it is far from meeting our needs. However, the emergence of domain dictionaries can largely To solve this problem, by building a dictionary, collecting the latest concepts and interrelationships, research in a specific field can be done with ease.

词典构造主要考虑两个方面的问题:(1)如何解决电网数据文本中其文字具有很强的专业性,导致其词典构造困难的问题。(2)电力领域存在较多文本并不严格符合汉语语法,文本中存在较多不规范格式,为电力领域文本处理及语义解析带来困难。The dictionary construction mainly considers two aspects: (1) How to solve the problem that the words in the power grid data text are highly specialized, which leads to the difficulty of dictionary construction. (2) There are many texts in the electric power field that do not strictly conform to Chinese grammar, and there are many irregular formats in the texts, which bring difficulties to text processing and semantic parsing in the electric power field.

发明内容SUMMARY OF THE INVENTION

为解决上述技术问题,本发明提供了一种基于聚类算法的电力文本词典构造方法,来解决电力系统文本词典构造的问题,本机制是一种策略性方法,通过使用本方法可以使得电力系统文本词典更加完善正确,提升后续文本处理的效果。In order to solve the above technical problems, the present invention provides a power text dictionary construction method based on a clustering algorithm to solve the problem of power system text dictionary construction. This mechanism is a strategic method. By using this method, the power system can be The text dictionary is more complete and correct, improving the effect of subsequent text processing.

本发明所述的一种基于聚类算法的电力文本词典构造方法,其采用的技术方案为:所述电力文本词典构造方法使用的设备包括数据分类预处理器、数据分词处理器、聚类处理器、数据处理操作核心;The method for constructing a power text dictionary based on a clustering algorithm according to the present invention adopts the technical scheme as follows: the equipment used in the power text dictionary construction method includes a data classification preprocessor, a data word segmentation processor, and a clustering processor. processor, data processing operation core;

所述电力文本词典构造步骤为:The steps of constructing the power text dictionary are:

步骤1:利用电力领域相关文档创建需要进行处理的电力领域语料库,准备处理电力领域语料库中的文本,进入步骤2;Step 1: Use the relevant documents in the electric power field to create the electric power field corpus that needs to be processed, prepare to process the text in the electric power field corpus, and go to step 2;

步骤2:对待处理文本进行预处理,根据去停用词词表删除其中的一些不影响文本语义的词语,进入步骤3;Step 2: Preprocess the text to be processed, delete some words that do not affect the semantics of the text according to the word list to remove stop words, and go to Step 3;

步骤3:对步骤2中进行过预处理过的文本利用通用的词典进行分词,得到一批分好的词语,进入步骤4;Step 3: Use a general dictionary for word segmentation on the preprocessed text in Step 2 to obtain a batch of good words, and go to Step 4;

步骤4:对步骤3分过词后的文本利用tf-idf算法寻找能代表该文本的一些关键词,进入步骤5;Step 4: Use the tf-idf algorithm to find some keywords that can represent the text for the text after the word segmentation in step 3, and go to step 5;

步骤5:对步骤4中得到的关键词利用word2vec模型进行词向量的构造,转到步骤6;Step 5: Use the word2vec model to construct the word vector for the keywords obtained in step 4, and go to step 6;

步骤6:利用k-meas聚类算法对构造好的词向量进行聚类的处理,进入步骤7;Step 6: Use the k-meas clustering algorithm to cluster the constructed word vectors, and go to Step 7;

步骤7:在文本中选取k个利用word2vec模型构造的词向量作为聚类中心(μ12,...μk-1k),进入步骤8;Step 7: Select k word vectors constructed by the word2vec model in the text as the clustering centers (μ 1 , μ 2 ,...μ k-1 , μ k ), and go to step 8;

步骤8:计算每个词向量到k个利用word2vec模型构造的词向量的余弦距离,进入步骤9;Step 8: Calculate the cosine distance from each word vector to k word vectors constructed using the word2vec model, and go to Step 9;

步骤9:将词向量归入余弦距离最小的k个聚集类簇中,计算划分后的每个聚类簇中数据点的均值,并将此值作为新的聚类中心;Step 9: The word vector is classified into the k clusters with the smallest cosine distance, and the mean value of the data points in each cluster after division is calculated, and this value is used as the new cluster center;

步骤10:如果聚类中心不再发生变化或者达到最大迭代次数则算法停止,进入步骤11;Step 10: If the cluster center no longer changes or the maximum number of iterations is reached, the algorithm stops and goes to step 11;

步骤11:检查聚类得到的关键词是否达到预先给定的阈值,对达到阈值的词作为关键词,没有达到阈值的词则放弃,进入步骤12;Step 11: Check whether the keywords obtained by the clustering reach a predetermined threshold, and use the words that reach the threshold as keywords, and discard the words that do not reach the threshold, and go to step 12;

步骤12:使用步骤4和步骤11得到相关关键词进行词典的构造,进入步骤13;Step 12: Use step 4 and step 11 to obtain relevant keywords to construct a dictionary, and enter step 13;

步骤13:结束。Step 13: End.

进一步,所述数据分类预处理器,根据电力领域语料库和去停用词词表,对待分类的测试文本进行文本的预处理,去除掉文本的一些无意义的词语与数字符号。Further, the data classification preprocessor performs text preprocessing on the test text to be classified according to the electric power field corpus and the removal stop word list, and removes some meaningless words and numerical symbols in the text.

进一步,所述停用词词表包含在文本中经常出现的无实际意义的词汇、数字以及符号。Further, the stop word list includes words, numbers and symbols that often appear in the text without actual meaning.

进一步,所述停用词词表建立方法为,建立一个数据统计知识规则库,是否将某数字或符号填入停用词表设置一个阈值,通过和这个阈值的比较来确认是否将文本中的一些数字和符号加入停用词词表。Further, the method for establishing the stop word list is to establish a data statistics knowledge rule base, whether to fill in a certain number or symbol into the stop word list to set a threshold, and compare with this threshold to confirm whether to use the text in the text. Some numbers and symbols are added to the stop word list.

进一步,所述数据分词处理器,对进行过预处理过的文本进行分词的方法为:Further, the data word segmentation processor performs word segmentation on the preprocessed text as follows:

(1)对已经预处理过后的文本利用通用的词典进行分词处理,分词处理后,对每个词语进行向量化的表示;(1) Use a general dictionary to perform word segmentation on the preprocessed text, and after word segmentation, perform vectorized representation for each word;

(2)对大量的词语向量进行特征选取,使用tf-idf算法,

Figure BDA0002222673760000031
其中a为该词在文本中出现的次数,b文本的总词数,c为电力领域语料库的文档总数,e为包含该词的文档数,分母加1是为了避免分母为0的情况出现,计算该词tf×idf的值,选择计算结果最大的一些词语作为关键词;(2) Feature selection for a large number of word vectors, using the tf-idf algorithm,
Figure BDA0002222673760000031
where a is the number of times the word appears in the text, b is the total number of words in the text, c is the total number of documents in the power field corpus, e is the number of documents containing the word, and the denominator is increased by 1 to avoid the denominator being 0. Calculate the value of the word tf × idf, and select some words with the largest calculation results as keywords;

(3)使用word2vec模型计算于(2)中所得关键词的词向量。(3) Calculate the word vector of the keyword obtained in (2) using the word2vec model.

进一步,步骤(3)中使用word2vec模型为skip-grim模型。Further, in step (3), the word2vec model is used as the skip-grim model.

进一步,所述聚类处理器通过利用k-meas算法对word2vec算法得到的词向量进行聚类的处理,得到一批新的关键词,利用预先设定好的阈值去除聚类得到的不合理的关键词,使用在阈值之上聚类得到的关键词和最初使用tf-idf算法得出的关键词来构造词典。Further, the clustering processor uses the k-meas algorithm to cluster the word vectors obtained by the word2vec algorithm to obtain a batch of new keywords, and uses a preset threshold to remove unreasonable ones obtained by clustering. Keywords, a dictionary is constructed using the keywords clustered above the threshold and the keywords originally derived using the tf-idf algorithm.

进一步,所述数据处理操作核心包括了在数据进行特征选取后,数据处理时所需的所有具体操作。Further, the data processing operation core includes all specific operations required for data processing after feature selection of the data.

本发明所述的有益效果为:本发明提出了一种基于聚类算法的电力文本词典构造方法,是一种策略性的方法,主要用于对电力领域文本分类过程中词典的构造过程,通过基于聚类的方法用来解决电力领域文本由于其文本专业性过强而导致的分词困难和分词难以构造好的电力领域词典的问题。通过本发明中的模型,可以更加准确的找到能代表电力领域文本中能代表文本类别的关键词组,并利用其进行词典的构造。The beneficial effects of the present invention are as follows: the present invention proposes a method for constructing a power text dictionary based on a clustering algorithm, which is a strategic method, and is mainly used for the construction process of the dictionary in the text classification process in the power field. The method based on clustering is used to solve the problem of difficult word segmentation and difficult word segmentation to construct a good power domain dictionary due to its strong text specialization. Through the model in the present invention, the keyword group that can represent the text category in the text in the electric power field can be found more accurately, and the dictionary can be constructed by using it.

附图说明Description of drawings

为了使本发明的内容更容易被清楚地理解,下面根据具体实施例并结合附图,对本发明作进一步详细的说明。In order to make the content of the present invention easier to understand clearly, the present invention will be described in further detail below according to specific embodiments and in conjunction with the accompanying drawings.

图1是系统结构图示意图。Figure 1 is a schematic diagram of the system structure.

图2是本发明方法的流程示意图。Figure 2 is a schematic flow chart of the method of the present invention.

具体实施方式Detailed ways

如图1和2所示,本发明所述的一种基于聚类算法的电力文本词典构造方法,其特征在于,所述电力文本词典构造方法使用的设备包括数据分类预处理器、数据分词处理器、聚类处理器、数据处理操作核心;As shown in Figures 1 and 2, a method for constructing a power text dictionary based on a clustering algorithm according to the present invention is characterized in that the equipment used in the method for constructing a power text dictionary includes a data classification preprocessor, a data word segmentation process processor, clustering processor, data processing operation core;

所述电力文本词典构造步骤为:The steps of constructing the power text dictionary are:

步骤1:利用电力领域相关文档创建需要进行处理的电力领域语料库,准备处理电力领域语料库中的文本,进入步骤2;Step 1: Use the relevant documents in the electric power field to create the electric power field corpus that needs to be processed, prepare to process the text in the electric power field corpus, and go to step 2;

步骤2:对待处理文本进行预处理,根据去停用词词表删除其中的一些不影响文本语义的词语,进入步骤3;Step 2: Preprocess the text to be processed, delete some words that do not affect the semantics of the text according to the word list to remove stop words, and go to Step 3;

步骤3:对步骤2中进行过预处理过的文本利用通用的词典进行分词,得到一批分好的词语,进入步骤4;Step 3: Use a general dictionary for word segmentation on the preprocessed text in Step 2 to obtain a batch of good words, and go to Step 4;

步骤4:对步骤3分过词后的文本利用tf-idf算法寻找能代表该文本的一些关键词,进入步骤5;Step 4: Use the tf-idf algorithm to find some keywords that can represent the text for the text after the word segmentation in step 3, and go to step 5;

步骤5:对步骤4中得到的关键词利用word2vec模型进行词向量的构造,转到步骤6;Step 5: Use the word2vec model to construct the word vector for the keywords obtained in step 4, and go to step 6;

步骤6:利用k-meas聚类算法对构造好的词向量进行聚类的处理,进入步骤7;Step 6: Use the k-meas clustering algorithm to cluster the constructed word vectors, and go to Step 7;

步骤7:在文本中选取k个利用word2vec模型构造的词向量作为聚类中心(μ12,...μk-1k),进入步骤8;Step 7: Select k word vectors constructed by the word2vec model in the text as the clustering centers (μ 1 , μ 2 ,...μ k-1 , μ k ), and go to step 8;

步骤8:计算每个词向量到k个利用word2vec模型构造的词向量的余弦距离,进入步骤9;Step 8: Calculate the cosine distance from each word vector to k word vectors constructed using the word2vec model, and go to Step 9;

步骤9:将词向量归入余弦距离最小的k个聚集类簇中,计算划分后的每个聚类簇中数据点的均值,并将此值作为新的聚类中心;Step 9: The word vector is classified into the k clusters with the smallest cosine distance, and the mean value of the data points in each cluster after division is calculated, and this value is used as the new cluster center;

步骤10:如果聚类中心不再发生变化或者达到最大迭代次数则算法停止,进入步骤11;Step 10: If the cluster center no longer changes or the maximum number of iterations is reached, the algorithm stops and goes to step 11;

步骤11:检查聚类得到的关键词是否达到预先给定的阈值,对达到阈值的词作为关键词,没有达到阈值的词则放弃,进入步骤12;Step 11: Check whether the keywords obtained by the clustering reach a predetermined threshold, and use the words that reach the threshold as keywords, and discard the words that do not reach the threshold, and go to step 12;

步骤12:使用步骤4和步骤11得到相关关键词进行词典的构造,进入步骤13;Step 12: Use step 4 and step 11 to obtain relevant keywords to construct a dictionary, and enter step 13;

步骤13:结束。Step 13: End.

数据分类预处理器,主要用于对文本分类过程中数据和训练数据集的预处理过程中,文本预处理是将半结构化或非结构化的文本转换为适当的文本表示形式的必经阶段。通常来说都是首先删除文本中出现的不包含任何信息的特殊字符、标点符号、数字等字符,然而由于电力领域的特殊性,一般其文本中都会包含有大量的数字和符号,所以在进行预处理过程中,要针对这一部分进行特殊的处理,保留其文本中的有效数字和符号。Data classification preprocessor, mainly used in the preprocessing of data and training datasets in the process of text classification, text preprocessing is a necessary stage to convert semi-structured or unstructured text into an appropriate text representation . Usually, the special characters, punctuation marks, numbers and other characters that do not contain any information that appear in the text are deleted first. However, due to the particularity of the electric field, the text generally contains a large number of numbers and symbols. In the process of preprocessing, special treatment should be performed for this part, and the valid numbers and symbols in its text should be preserved.

在文本分类中需要对文本中那些常用词进行去除,其中常用词是指那些在文本中经常出现的词汇,例如英文中的‘a’,‘the’等,中文中的‘的’,‘啊’,还有一些数字和符号,这些词汇不能给分类带来任何帮助,被收集到一个称为“停用词表”的集合中,在文本预处理的过程中应该将文本中包含的停用词删除,但是由于电力领域的特殊性,其文本中必然含有大量的数字和符号。然而,随着文本分类应用环境的不同,停用词往往并不局限于停用词表中的词汇,因为本方法是解决电力领域相关的文本,因此在本方法中,建立一个数据统计知识规则库,是否将某数字或符号填入停用词表设置一个阈值,通过和这个阈值的比较来确认是否将文本中的一些数字和符号加入“停用词词表”。删除停用词能够大幅度增加文本分类的性能。In the text classification, it is necessary to remove those common words in the text, where the common words refer to those words that often appear in the text, such as 'a', 'the' in English, '的', 'ah' in Chinese ', as well as some numbers and symbols, these words can not bring any help to the classification, are collected into a set called "stop word list", in the process of text preprocessing, the words contained in the text should be stopped Words are deleted, but due to the particularity of the field of electricity, its text must contain a large number of numbers and symbols. However, with the different application environments of text classification, stop words are often not limited to the words in the stop word list, because this method is to solve the text related to the power field, so in this method, a data statistical knowledge rule is established. Library, whether to fill a certain number or symbol into the stop word list, set a threshold, and confirm whether to add some numbers and symbols in the text to the "stop word list" by comparing with this threshold. Removing stop words can greatly increase the performance of text classification.

由于电力领域文档多为设备状态和设备检修等文档,所以其多为短文档,对进行预处理过后的文本需要进行文本分词的处理,而电力领域的特殊性决定了该领域的文本必然会有很多专业性极强的文本,我们需要利用数据分词处理器对这些文本进行分词,解决其文本专业性极强的问题。Since most of the documents in the electric power field are documents such as equipment status and equipment maintenance, they are mostly short documents. The preprocessed text needs to be processed by text segmentation, and the particularity of the electric power field determines that the text in this field will inevitably have For many highly specialized texts, we need to use the data segmentation processor to segment these texts to solve the problem of highly specialized texts.

文本分类过程中的分词是一个很重要的部分,分词的功能是在现有的文本中通过已有的分词工具,对文本进行分词,这样之后会得到一系列分好的词汇,我们称其为分词集。Word segmentation is a very important part in the process of text classification. The function of word segmentation is to segment the text through the existing word segmentation tools in the existing text, so that a series of good words will be obtained, which we call them. participle set.

本申请先对已经预处理过后的短文本利用数据分词处理器进行分词处理,分词处理后,会得到一系列大量的词语。我们同样利用该数据分词处理器首先利用统计模型(即tf-idf算法)进行一次特征的选择,这时会得到一些能够代表该文本的词语,即关键词,然而由于电力领域的特殊性,可能会漏掉一些跟关键词语意相同的词语,本申请使用word2vec算法对上述关键词进行词向量的计算。The present application first uses a data word segmentation processor to perform word segmentation processing on the preprocessed short text. After the word segmentation processing, a series of a large number of words will be obtained. We also use the data word segmentation processor to first use the statistical model (ie the tf-idf algorithm) to select a feature. At this time, we will get some words that can represent the text, that is, keywords. However, due to the particularity of the power field, it may be Some words with the same semantic meaning as the keywords will be missed. This application uses the word2vec algorithm to calculate the word vectors for the above keywords.

所述数据分词处理器,对进行过预处理过的文本进行分词的方法为:The method for the word segmentation processor to perform word segmentation on the preprocessed text is as follows:

(1)对已经预处理过后的文本利用通用的词典进行分词处理,分词处理后,对每个词语进行向量化的表示;(1) Use a general dictionary to perform word segmentation on the preprocessed text, and after word segmentation, perform vectorized representation for each word;

(2)对大量的词语向量进行特征选取,使用tf-idf算法,其中a为该词在文本中出现的次数,b文本的总词数,c为电力领域语料库的文档总数,e为包含该词的文档数,分母加1是为了避免分母为0的情况出现,计算该词tf×idf的值,选择计算结果最大的一些词语作为关键词;(2) Feature selection for a large number of word vectors, using the tf-idf algorithm, where a is the number of times the word appears in the text, b is the total number of words in the text, c is the total number of documents in the power field corpus, e is the number of documents containing the word, and the denominator is increased by 1 to avoid the denominator being 0. Calculate the value of the word tf × idf, and select some words with the largest calculation results as keywords;

(3)使用word2vec模型计算于(2)中所得关键词的词向量;word2vec是一个将单词转换成向量形式,计算出向量空间上的相似度,来表示文本语义上的相似度的一个算法。本申请实施例中,我们使用word2vec算法中的skip-grim模型,该模型是用一个词语作为输入,来预测它周围的上下文。这个模型的实质就是求ux Tvc(就是两个词语的相似度),我们用vc代表目标词语的词向量,ux代表除目标词语外第x个词语的词向量,其中vc=Wwc,W表示目标词语的矩阵,W是一个d×V的矩阵,其中V代表所有词语的数量,d代表该目标词语的维数,wc表示目标词语的one-hot向量。(3) Use the word2vec model to calculate the word vector of the keywords obtained in (2); word2vec is an algorithm that converts words into vector form and calculates the similarity in vector space to represent the semantic similarity of text. In the embodiment of this application, we use the skip-grim model in the word2vec algorithm, which uses a word as an input to predict the context around it. The essence of this model is to find u x T v c (that is, the similarity of two words), we use v c to represent the word vector of the target word, u x to represent the word vector of the xth word except the target word, where v c =Ww c , W represents the matrix of the target word, W is a d×V matrix, where V represents the number of all words, d represents the dimension of the target word, and w c represents the one-hot vector of the target word.

由数据分词处理器分出来的词汇,其词汇的专业性也许得到了相关的保证,但是由于分词处理的结果毕竟有限,所以采用聚类的方式,将处理得到的词向量进行一个聚类的处理,以得到更多的专业词汇,为后面的构造词典做相应的准备。The professionalism of the words separated by the data word segmentation processor may be guaranteed, but because the results of word segmentation processing are limited after all, clustering is used to process the word vectors obtained by clustering. , in order to get more professional vocabulary and make corresponding preparations for the following construction dictionary.

经过数据分词处理器得到一系列的关键词,并通过word2vec算法得到其词向量后,我们利用该词向量对词进行聚类的处理,使用k-meas聚类算法对词向量进行聚类,会得到一系列的新的关键词,利用预先设定好的阈值去除聚类得到的不合理的关键词,使用在阈值之上聚类得到的关键词和最初使用tf-idf算法得出的关键词来构造词典。After a series of keywords are obtained by the data word segmentation processor, and their word vectors are obtained by the word2vec algorithm, we use the word vectors to cluster the words, and use the k-meas clustering algorithm to cluster the word vectors. Obtain a series of new keywords, use a preset threshold to remove the unreasonable keywords obtained by clustering, use the keywords obtained by clustering above the threshold and the keywords originally obtained by the tf-idf algorithm to construct a dictionary.

所述数据处理操作核心包括了在数据进行特征选取后,数据处理时所需的所有具体操作,本发明增加了其它的部分并不影响对数据本身的处理,只是保证数据处理能更加顺利有效地进行。The core of the data processing operation includes all the specific operations required for data processing after the feature selection of the data. The addition of other parts in the present invention does not affect the processing of the data itself, but only ensures that the data processing can be more smoothly and effectively. conduct.

为了方便描述,以如下应用实例为例说明:For the convenience of description, the following application example is taken as an example:

现在有一电力企业希望对之前企业中记录下来的有关客户投诉和客户维修的一系列文本进行一个数据的分析,挖掘用户的需求和提升用户对企业的评价,同时希望提高用户的体验,这时利用传统的分类对电力领域的投诉和维修文本进行数据挖掘的效果肯定难以让人满意。Now a power company hopes to conduct a data analysis on a series of texts about customer complaints and customer repairs recorded in the previous company, to mine users' needs and improve users' evaluation of the company, and at the same time hope to improve the user's experience. At this time, use The effect of traditional classification on data mining of complaints and maintenance texts in the power field is definitely not satisfactory.

这时我们就可以先利用本专利提出的方法,构造一个针对该公司的电力企业投诉文本和维修文本的词典,然后利用该词典对文本进行数据的挖掘。At this time, we can first use the method proposed in this patent to construct a dictionary for the company's electric power enterprise complaint text and maintenance text, and then use the dictionary to mine the text data.

其具体的实施方案为:Its specific implementation is:

(1)通过对待处理文本进行文本的预处理工作,即文本的去停用词处理,再对文本进行分词处理。(1) Perform text preprocessing through the text to be processed, that is, stop word removal processing of the text, and then perform word segmentation processing on the text.

(2)对上述经过预处理和分词过后词语,利用tf-idf进行特征选择选取其文本的关键词。(2) For the above words after preprocessing and word segmentation, use tf-idf to perform feature selection to select keywords of their texts.

(3)对(2)中关键词利用word2vec算法进行词向量的构造。(3) The word2vec algorithm is used to construct the word vector for the keywords in (2).

(4)利用(3)中构造好的词向量,利用k-means算法对其进行聚类的处理,聚类得到一系列新的关键词(4) Using the word vector constructed in (3), the k-means algorithm is used to cluster it, and a series of new keywords are obtained by clustering

(5)利用(2)和(4)中得到的关键词作为词根,进行相关词典的构造。(5) Use the keywords obtained in (2) and (4) as root words to construct a related dictionary.

以上所述仅为本发明的优选方案,并非作为对本发明的进一步限定,凡是利用本发明说明书及附图内容所作的各种等效变化均在本发明的保护范围之内。The above descriptions are only the preferred solutions of the present invention, and are not intended to further limit the present invention, and all equivalent changes made by using the contents of the description and drawings of the present invention are within the protection scope of the present invention.

Claims (8)

1.一种基于聚类算法的电力文本词典构造方法,其特征在于,所述电力文本词典构造方法使用的设备包括数据分类预处理器、数据分词处理器、聚类处理器、数据处理操作核心;1. a power text dictionary construction method based on clustering algorithm, is characterized in that, the equipment that described power text dictionary construction method uses comprises data classification preprocessor, data word segmentation processor, clustering processor, data processing operation core ; 所述电力文本词典构造步骤为:The steps of constructing the power text dictionary are: 步骤1:利用电力领域相关文档创建需要进行处理的电力领域语料库,准备处理电力领域语料库中的文本,进入步骤2;Step 1: Use the relevant documents in the electric power field to create the electric power field corpus that needs to be processed, prepare to process the text in the electric power field corpus, and go to step 2; 步骤2:对待处理文本进行预处理,根据去停用词词表删除其中的一些不影响文本语义的词语,进入步骤3;Step 2: Preprocess the text to be processed, delete some words that do not affect the semantics of the text according to the word list to remove stop words, and go to Step 3; 步骤3:对步骤2中进行过预处理过的文本利用通用的词典进行分词,得到一批分好的词语,进入步骤4;Step 3: Use a general dictionary for word segmentation on the preprocessed text in Step 2 to obtain a batch of good words, and go to Step 4; 步骤4:对步骤3分过词后的文本利用tf-idf算法寻找能代表该文本的一些关键词,进入步骤5;Step 4: Use the tf-idf algorithm to find some keywords that can represent the text for the text after the word segmentation in step 3, and go to step 5; 步骤5:对步骤4中得到的关键词利用word2vec模型进行词向量的构造,转到步骤6;Step 5: Use the word2vec model to construct the word vector for the keywords obtained in step 4, and go to step 6; 步骤6:利用k-meas聚类算法对构造好的词向量进行聚类的处理,进入步骤7;Step 6: Use the k-meas clustering algorithm to cluster the constructed word vectors, and go to Step 7; 步骤7:在文本中选取k个利用word2vec模型构造的词向量作为聚类中心(μ12,...μk-1k),进入步骤8;Step 7: Select k word vectors constructed by the word2vec model in the text as the clustering centers (μ 1 , μ 2 ,...μ k-1 , μ k ), and go to step 8; 步骤8:计算每个词向量到k个利用word2vec模型构造的词向量的余弦距离,进入步骤9;Step 8: Calculate the cosine distance from each word vector to k word vectors constructed using the word2vec model, and go to Step 9; 步骤9:将词向量归入余弦距离最小的k个聚集类簇中,计算划分后的每个聚类簇中数据点的均值,并将此值作为新的聚类中心;Step 9: The word vector is classified into the k clusters with the smallest cosine distance, and the mean value of the data points in each cluster after division is calculated, and this value is used as the new cluster center; 步骤10:如果聚类中心不再发生变化或者达到最大迭代次数则算法停止,进入步骤11;Step 10: If the cluster center no longer changes or the maximum number of iterations is reached, the algorithm stops and goes to step 11; 步骤11:检查聚类得到的关键词是否达到预先给定的阈值,对达到阈值的词作为关键词,没有达到阈值的词则放弃,进入步骤12;Step 11: Check whether the keywords obtained by the clustering reach a predetermined threshold, and use the words that reach the threshold as keywords, and discard the words that do not reach the threshold, and go to step 12; 步骤12:使用步骤4和步骤11得到相关关键词进行词典的构造,进入步骤13;Step 12: Use step 4 and step 11 to obtain relevant keywords to construct a dictionary, and enter step 13; 步骤13:结束。Step 13: End. 2.根据权利要求1所述的一种基于聚类算法的电力文本词典构造方法,其特征在于,所述数据分类预处理器,根据电力领域语料库和去停用词词表,对待分类的测试文本进行文本的预处理,去除掉文本的一些无意义的词语与数字符号。2. a kind of power text dictionary construction method based on clustering algorithm according to claim 1, is characterized in that, described data classification preprocessor, according to electric power domain corpus and removing stop word list, the test to be classified The text is preprocessed to remove some meaningless words and numerical symbols from the text. 3.根据权利要求1所述的一种基于聚类算法的电力文本词典构造方法,其特征在于,所述停用词词表包含在文本中经常出现的无实际意义的词汇、数字以及符号。3 . The method for constructing a power text dictionary based on a clustering algorithm according to claim 1 , wherein the stop word list contains meaningless words, numbers and symbols that often appear in the text. 4 . 4.根据权利要求1所述的一种基于聚类算法的电力文本词典构造方法,其特征在于,所述停用词词表建立方法为,建立一个数据统计知识规则库,是否将某数字或符号填入停用词表设置一个阈值,通过和这个阈值的比较来确认是否将文本中的一些数字和符号加入停用词词表。4. The method for constructing a power text dictionary based on a clustering algorithm according to claim 1, wherein the method for establishing the stop word list is to establish a data statistical knowledge rule base, whether to use a certain number or Fill in the stop word list with symbols to set a threshold, and compare with this threshold to confirm whether to add some numbers and symbols in the text to the stop word list. 5.根据权利要求1所述的一种基于聚类算法的电力文本词典构造方法,其特征在于,所述数据分词处理器,对进行过预处理过的文本进行分词的方法为:5. a kind of power text dictionary construction method based on clustering algorithm according to claim 1, is characterized in that, described data word segmentation processor, the method that word segmentation has been carried out on the preprocessed text is: (1)对已经预处理过后的文本利用通用的词典进行分词处理,分词处理后,对每个词语进行向量化的表示;(1) Use a general dictionary to perform word segmentation on the preprocessed text, and after word segmentation, perform vectorized representation for each word; (2)对大量的词语向量进行特征选取,使用tf-idf算法,
Figure FDA0002222673750000021
其中a为该词在文本中出现的次数,b文本的总词数,c为电力领域语料库的文档总数,e为包含该词的文档数,分母加1是为了避免分母为0的情况出现,计算该词tf×idf的值,选择计算结果最大的一些词语作为关键词;
(2) Feature selection for a large number of word vectors, using the tf-idf algorithm,
Figure FDA0002222673750000021
where a is the number of times the word appears in the text, b is the total number of words in the text, c is the total number of documents in the power field corpus, e is the number of documents containing the word, and the denominator is increased by 1 to avoid the denominator being 0. Calculate the value of the word tf × idf, and select some words with the largest calculation results as keywords;
(3)使用word2vec模型计算于(2)中所得关键词的词向量。(3) Calculate the word vector of the keyword obtained in (2) using the word2vec model.
6.根据权利要求5所述的一种基于聚类算法的电力文本词典构造方法,其特征在于,(3)中使用word2vec模型为skip-grim模型。6 . The method for constructing a power text dictionary based on a clustering algorithm according to claim 5 , wherein the word2vec model used in (3) is the skip-grim model. 7 . 7.根据权利要求1所述的一种基于聚类算法的电力文本词典构造方法,其特征在于,所述聚类处理器通过利用k-meas算法对word2vec算法得到的词向量进行聚类的处理,得到一批新的关键词,利用预先设定好的阈值去除聚类得到的不合理的关键词,使用在阈值之上聚类得到的关键词和最初使用tf-idf算法得出的关键词来构造词典。7 . The method for constructing a power text dictionary based on a clustering algorithm according to claim 1 , wherein the clustering processor performs a clustering process on the word vector obtained by the word2vec algorithm by using the k-meas algorithm. 8 . , get a batch of new keywords, use the preset threshold to remove the unreasonable keywords obtained by clustering, use the keywords obtained by clustering above the threshold and the keywords originally obtained by the tf-idf algorithm to construct a dictionary. 8.根据权利要求1所述的一种基于聚类算法的电力文本词典构造方法,其特征在于,所述数据处理操作核心包括了在数据进行特征选取后,数据处理时所需的所有具体操作。8 . The method for constructing a power text dictionary based on a clustering algorithm according to claim 1 , wherein the data processing operation core includes all specific operations required for data processing after feature selection of the data. 9 . .
CN201910940220.5A 2019-09-30 2019-09-30 A Construction Method of Electric Power Text Dictionary Based on Clustering Algorithm Pending CN110704638A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910940220.5A CN110704638A (en) 2019-09-30 2019-09-30 A Construction Method of Electric Power Text Dictionary Based on Clustering Algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910940220.5A CN110704638A (en) 2019-09-30 2019-09-30 A Construction Method of Electric Power Text Dictionary Based on Clustering Algorithm

Publications (1)

Publication Number Publication Date
CN110704638A true CN110704638A (en) 2020-01-17

Family

ID=69197391

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910940220.5A Pending CN110704638A (en) 2019-09-30 2019-09-30 A Construction Method of Electric Power Text Dictionary Based on Clustering Algorithm

Country Status (1)

Country Link
CN (1) CN110704638A (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107436875A (en) * 2016-05-25 2017-12-05 华为技术有限公司 File classification method and device
CN111368539A (en) * 2020-03-02 2020-07-03 贵州电网有限责任公司 Hotspot analysis modeling method
CN111931483A (en) * 2020-06-22 2020-11-13 中国电力科学研究院有限公司 Extraction method and device for structuring electric power equipment information
CN112148880A (en) * 2020-09-28 2020-12-29 深圳壹账通智能科技有限公司 Customer service dialogue corpus clustering method, system, equipment and storage medium
CN112651233A (en) * 2020-12-18 2021-04-13 北京捷通华声科技股份有限公司 Knowledge processing method, knowledge processing device, computer readable storage medium and processor
CN114266256A (en) * 2021-12-21 2022-04-01 深圳供电局有限公司 A method and system for extracting new words in the field
CN114529266A (en) * 2022-02-23 2022-05-24 福建国科信息科技有限公司 AI-based big data platform and human-job matching algorithm thereof
WO2024179519A1 (en) * 2023-03-01 2024-09-06 维沃移动通信有限公司 Semantic recognition method and apparatus

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106649662A (en) * 2016-12-13 2017-05-10 成都数联铭品科技有限公司 Construction method of domain dictionary
CN108628824A (en) * 2018-04-08 2018-10-09 上海熙业信息科技有限公司 A kind of entity recognition method based on Chinese electronic health record
CN109284397A (en) * 2018-09-27 2019-01-29 深圳大学 Method, device, device and storage medium for constructing a domain dictionary
CN110287321A (en) * 2019-06-26 2019-09-27 南京邮电大学 A Power Text Classification Method Based on Improved Feature Selection

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106649662A (en) * 2016-12-13 2017-05-10 成都数联铭品科技有限公司 Construction method of domain dictionary
CN108628824A (en) * 2018-04-08 2018-10-09 上海熙业信息科技有限公司 A kind of entity recognition method based on Chinese electronic health record
CN109284397A (en) * 2018-09-27 2019-01-29 深圳大学 Method, device, device and storage medium for constructing a domain dictionary
CN110287321A (en) * 2019-06-26 2019-09-27 南京邮电大学 A Power Text Classification Method Based on Improved Feature Selection

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
石爱辉: "基于时空兴趣点和词袋模型的人体行为识别方法研究", 《中国优秀博硕士学位论文全文数据库(硕士)信息科技辑》 *
聂卉 等: "基于在线评论的商业竞争情报自动获取", 《情报杂志》 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107436875A (en) * 2016-05-25 2017-12-05 华为技术有限公司 File classification method and device
CN111368539A (en) * 2020-03-02 2020-07-03 贵州电网有限责任公司 Hotspot analysis modeling method
CN111931483A (en) * 2020-06-22 2020-11-13 中国电力科学研究院有限公司 Extraction method and device for structuring electric power equipment information
CN112148880A (en) * 2020-09-28 2020-12-29 深圳壹账通智能科技有限公司 Customer service dialogue corpus clustering method, system, equipment and storage medium
CN112651233A (en) * 2020-12-18 2021-04-13 北京捷通华声科技股份有限公司 Knowledge processing method, knowledge processing device, computer readable storage medium and processor
CN114266256A (en) * 2021-12-21 2022-04-01 深圳供电局有限公司 A method and system for extracting new words in the field
CN114529266A (en) * 2022-02-23 2022-05-24 福建国科信息科技有限公司 AI-based big data platform and human-job matching algorithm thereof
WO2024179519A1 (en) * 2023-03-01 2024-09-06 维沃移动通信有限公司 Semantic recognition method and apparatus

Similar Documents

Publication Publication Date Title
CN110704638A (en) A Construction Method of Electric Power Text Dictionary Based on Clustering Algorithm
CN109800310B (en) Electric power operation and maintenance text analysis method based on structured expression
Li et al. Twiner: named entity recognition in targeted twitter stream
CN107391486B (en) Method for identifying new words in field based on statistical information and sequence labels
WO2020192401A1 (en) System and method for generating answer based on clustering and sentence similarity
WO2023029420A1 (en) Power user appeal screening method and system, electronic device, and storage medium
US12038982B2 (en) Method of extracting table information, electronic device, and storage medium
CN109271524B (en) Entity Linking Method in Knowledge Base Question Answering System
CN118332086A (en) Question-answer pair generation method and system based on large language model
US20060277028A1 (en) Training a statistical parser on noisy data by filtering
CN108717459A (en) A kind of mobile application defect positioning method of user oriented comment information
CN112949713A (en) Text emotion classification method based on ensemble learning of complex network
CN114266256A (en) A method and system for extracting new words in the field
CN111782810A (en) Text abstract generation method based on theme enhancement
CN118535728A (en) Method, system and server for hierarchical classification of long text network information
CN112926340A (en) Semantic matching model for knowledge point positioning
CN116432638A (en) Text keyword extraction method and device, electronic equipment and storage medium
Gupta et al. Semantic parsing for technical support questions
CN118761406A (en) A HAZOP named entity recognition and entity relationship extraction method
CN117291192B (en) Government affair text semantic understanding analysis method and system
CN113792546A (en) Corpus construction method, apparatus, device and storage medium
CN113742448A (en) Knowledge point generation method and device, electronic equipment and computer readable storage medium
Zhang et al. Predicting author age from weibo microblog posts
Panahandeh et al. Correction of spaces in Persian sentences for tokenization
JP6168057B2 (en) Failure occurrence cause extraction device, failure occurrence cause extraction method, and failure occurrence cause extraction program

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20200117

RJ01 Rejection of invention patent application after publication