WO2022126810A1 - Text clustering method - Google Patents

Text clustering method Download PDF

Info

Publication number
WO2022126810A1
WO2022126810A1 PCT/CN2021/071166 CN2021071166W WO2022126810A1 WO 2022126810 A1 WO2022126810 A1 WO 2022126810A1 CN 2021071166 W CN2021071166 W CN 2021071166W WO 2022126810 A1 WO2022126810 A1 WO 2022126810A1
Authority
WO
WIPO (PCT)
Prior art keywords
matrix
clustering
text
adjacency
laplacian
Prior art date
Application number
PCT/CN2021/071166
Other languages
French (fr)
Chinese (zh)
Inventor
张校源
马祥祥
Original Assignee
上海爱数信息技术股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 上海爱数信息技术股份有限公司 filed Critical 上海爱数信息技术股份有限公司
Publication of WO2022126810A1 publication Critical patent/WO2022126810A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities

Definitions

  • the present application relates to the technical field of text analysis, for example, to a text clustering method.
  • Text clustering is mainly based on the well-known clustering assumption: documents of the same class are more similar, while documents of different classes are less similar.
  • clustering has certain flexibility and high automatic processing capability because it does not require a training process and does not require manual labeling of documents in advance, and has become an effective method for organizing text information.
  • summarization and navigation are also important means of attention by more and more researchers.
  • the spectral clustering algorithm regards each object in the data set as a vertex V of the graph, and quantifies the similarity between vertices as the weight of the corresponding vertex connecting edge E, so that an undirected weighted graph G based on the similarity is obtained ( V, E), so the clustering problem can be transformed into a graph partitioning problem.
  • the optimal division criterion based on graph theory is to maximize the similarity within the subgraphs and minimize the similarity between the subgraphs.
  • the spectral clustering algorithm has different implementation methods, but these implementation methods can be summarized into the following three main steps: 1) construct the similarity matrix S representing the object set; 2) calculate the degree matrix and the Laplace matrix, and construct the feature Vector space; 3) Use kmeans or other classical clustering algorithms to cluster the eigenvectors in the eigenvector space.
  • the above clustering methods can only perform text clustering when the number of categories is known, and cannot give the category keywords after clustering, so that users cannot directly know the subject content to be expressed in this category according to the keywords.
  • Most of the clustering results calculated by the clustering method have the problem of low precision and recall, that is, the accuracy of the clustering results is low.
  • the present application provides a text clustering method, which can cluster the text in the case of known or unknown number of categories, and can output keywords corresponding to each category at the same time.
  • a text clustering method including:
  • the obtained clustering result is used as the final clustering result;
  • multiple clusters are obtained by performing the following operations multiple times results, and evaluate the multiple clustering results, and select the final clustering results according to the evaluation results: adjust the clustering parameters, return to execute the construction of the adjacency matrix, the degree matrix and the Laplacian matrix, and calculate the The eigenvalues and eigenvectors of the Laplace matrix to obtain a eigenmatrix, and the eigenmatrix is clustered to obtain an operation of clustering results;
  • category keywords are extracted based on a term frequency-inverse document frequency (Term Frequency-Inverse Document Frequency, TF-IDF) algorithm;
  • FIG. 1 is a schematic flowchart of a text clustering method provided by an embodiment of the present application.
  • FIG. 2 is a schematic diagram of a text clustering process provided by an embodiment of the present application.
  • a text clustering method based on an improved spectral clustering algorithm includes the following steps:
  • step S9 is performed.
  • step S8 is executed.
  • step S9 Combine the clustering result obtained in step S6 or step S8 and the keywords extracted in step S1 to extract category keywords based on the TF-IDF algorithm.
  • text keywords need to be extracted for two reasons: one is to reduce the dimension of the vector for creating a text similarity matrix, and the other is to extract category keywords based on text keywords after clustering is completed.
  • keywords through part-of-speech filtering, keywords whose parts of speech are nouns, verbs, gerunds, person names, place names, and institutional nouns are mainly reserved to improve the accuracy of text similarity.
  • This application adopts the method of constructing a bag of words to calculate text similarity, mainly calculating the TF-IDF value of each keyword in the text, and saving it in A place similar to a bag, by judging whether there are the same keywords between one text and another text, and then using the TF-IDF value in the word bag to calculate the similarity between the texts, this method is similar to the cosine distance calculation method, However, it can reduce the amount of calculation and achieve the effect of improving efficiency.
  • the obtained text similarity matrix is an N*N matrix, and each value is the similarity between texts.
  • s ij is the Euclidean distance between element x i and element x j in the text similarity matrix.
  • the adjacency matrix W is defined as follows:
  • w ij is the i-th row and j-th column element in the adjacency matrix W.
  • the K-neighbor method is that as long as a point is in the K-nearest neighbor of another point, s ij is retained, or two points are in the K-nearest neighbor of each other, s ij can be retained:
  • KNN(x i ) is the K nearest neighbors of element x i
  • KNN(x j ) is the K nearest neighbors of element x j
  • is the variance
  • the full connection method compared with the first two methods, the weight value between all points in the full connection method is greater than 0, so it is called the full connection method.
  • Different kernel functions can be selected to define edge weights, commonly used are polynomial kernel function, Gaussian kernel function and Sigmoid kernel function.
  • Gaussian kernel function the Radial Basis Function (RBF)
  • RBF Radial Basis Function
  • similarity matrix and the adjacency matrix are the same:
  • the degree matrix D is constructed from the adjacency matrix.
  • the degree matrix is a diagonal matrix, only the main diagonal has values, and the values in other positions are 0.
  • the value on the diagonal is the sum of all the values in this row, namely:
  • d i is the element located on the main diagonal in the i-th row of the degree matrix D
  • n is the number of texts.
  • the Laplacian matrix is a symmetric matrix, resulting from the fact that both D and W are symmetric, and all of its eigenvalues are real:
  • L is the Laplace matrix
  • D is the degree matrix
  • W is the adjacency matrix
  • Calculate the eigenvalues and eigenvectors of the matrix according to the Laplace matrix first solve the eigenvalues according to the characteristic polynomial of the Laplace matrix, solve the eigenvectors according to the eigenvalues, and then judge the size of the eigenvalues by the number of clusters (m).
  • the number that satisfies the condition (for example, the value of the feature value is less than (1-1/m)*0.95) is used as the number of dimensions for dimensionality reduction, and the feature matrix of the document set to be clustered is obtained through dimensionality reduction.
  • the feature matrix is clustered by kmeans
  • the traditional classical clustering algorithm kmeans is used to cluster the feature matrix.
  • Spectral clustering only needs the similarity matrix between texts, which is more effective for processing sparse data, and it is difficult to use kmeans directly;
  • spectral clustering uses dimensionality reduction, which is better than using kmeans directly when dealing with high-dimensional data. If the number of categories to be clustered is directly passed in, after the kmeans clustering is completed, the following sixth step can be skipped, and the seventh step is to directly extract the category keywords to complete the clustering task. If the number of categories of clusters is not passed in, it is necessary to find a better number of clusters to complete the clustering through the sixth step, and then perform keyword extraction to complete the clustering task.
  • step 3 By adjusting the number of parameter clusters, go back to step 3 to obtain the clustering results again, and evaluate the histogram of the clustering results to find the number of clusters corresponding to the histogram with the best effect as the number of categories for this clustering task .
  • the category keywords are extracted by the TF-IDF algorithm, and the content described in this category can be roughly judged according to the category keywords. Keywords in this category are keywords extracted based on the TF-IDF values calculated for several categories under this clustering task, and have nothing to do with text data other than this task.
  • the method of the present application and the kmeans and DBSCAN algorithms are respectively used to perform clustering processing on four types of data, wherein the four types of data are:
  • Data 4 (network download data set, used for classification model training data set):
  • This embodiment uses the above four kinds of data, combined with kmeans, DBSCAN algorithm and the method proposed in this application to test to obtain the precision rate, recall rate and F1 value.
  • the three test indicators are explained. According to the confusion matrix, if there is a For two classification problems, then the combination of the predicted results and the actual results will have the following four situations:
  • TP, FP, FN, TN can be understood as:
  • TP Prediction is 1, actual is 1, the prediction is correct.
  • TN The prediction is 0, the actual is 0, the prediction is correct.
  • Accuracy rate For the prediction result, its meaning is the probability that it is actually a positive sample among all the predicted positive samples.
  • the expression is:
  • the F1 score expression is:
  • this embodiment optimizes the test data for data 4, directly removes the category data with text intersection, and only uses 5 categories (Art, Economy, Politics, Space, Sports) out of 9 categories, There are 800 data in each category and a total of 4000 text data. The second test was done, and the data results are shown in Table 11:
  • this application improves on the basis of the original spectral clustering.
  • clustering can be performed without specifying the number of clusters; Instead, it depends on the number of smaller values in the eigenvalues; thirdly, the category keywords can be extracted after the clustering is completed.
  • the main process is to calculate the adjacency matrix (W), degree matrix (D) and Laplace matrix (L) by adjusting the number of parameter clusters after constructing the text similarity matrix, and then calculate the eigenvalues and features.
  • Clustering effect by judging the number of eigenvalues that satisfy the condition k, reduce the eigenvector dimension to k, and construct an eigenvector matrix, and use other classical clustering algorithms (such as kmeans) to cluster the eigenvector matrix, and evaluate the eigenvector matrix.
  • clustering effect select the number of clusters with better clustering effect, so that the clustering effect can still meet the requirements without inputting the number of clusters, and retain the original spectral clustering method that can specify the number of clusters.
  • a method for clustering a set of text pairs It is not only beneficial for users to perform clustering operations on unknown data sets, but also allows users to perform text clustering when the number of categories is known. category keywords, allowing users to judge the subject content to be expressed by this category according to the keywords. Through testing, the clustering effect of the present application also has a certain improvement in precision and recall compared with the traditional clustering algorithm.
  • the method proposed in this application can cluster the document set when the number of categories is unknown or known. It can be applied to customers who want to classify some unlabeled text sets, and extract The keywords under the category can be extended and accurately applied to clustering sensitive document sets of unknown categories, and then the keywords of these labeled sensitive documents are used for document classification, so as to use known sensitive documents to judge unknown documents. Whether it is a sensitive document and the category to which it belongs, and respond accordingly according to the sensitive category determined.
  • the present application realizes the improvement of the spectral clustering algorithm by setting the process of adjusting the clustering parameters, so as to provide the corresponding number of categories independently, and by evaluating the corresponding adjusted clustering results, the optimal clustering can be selected.
  • the corresponding number of categories is determined, so as to achieve the purpose of clustering document sets with unknown number of categories, so that users only need to provide document set data, and based on the method proposed in this application, the document set can be completed. Categories distinguish work.
  • the present application combines the clustering results and the extracted keywords, and adopts the TF-IDF algorithm to extract the category keywords corresponding to the clustering results, so that the user can intuitively view the category keywords corresponding to different categories of text, without having to look at the file content You can get the subject content of the text.
  • the present application screens the eigenvalues based on the number of categories, and uses the number of screened eigenvalues as the number of dimensions for dimensionality reduction, so that the feature matrix corresponding to the set of documents to be clustered can be obtained by dimensionality reduction processing, which can greatly reduce the cost of subsequent clustering processing.
  • the present application uses keywords extracted from the document set to be clustered to construct a text similarity matrix, which can effectively cluster sparse data.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A text clustering method, comprising: performing word segmentation, stop word removal, and keyword extraction processing on a document set to be clustered (S1); creating a text similarity matrix, an adjacency matrix, a degree matrix, and a Laplacian matrix; calculating eigenvalues and eigenvectors of the Laplacian matrix to obtain an eigenmatrix; using a clustering method to cluster the eigenmatrix to obtain a clustering result (S6); if the number of categories is known, then setting the clustering result as the final clustering result; if the number of categories is unknown, then obtaining multiple clustering results by executing the following operations multiple times and evaluating the multiple clustering results to select a final clustering result: adjusting the clustering parameters, and executing the operations of constructing the adjacency matrix, degree matrix, and Laplacian matrix, calculating eigenvalues and eigenvectors of the Laplacian matrix to obtain an eigenmatrix, and clustering the eigenmatrix to obtain a clustering result; combining the final clustering result and the extracted keywords, and extracting a category keyword on the basis of a TF-IDF algorithm; and outputting the final clustering result and the category keyword (S10).

Description

文本聚类方法Text Clustering Methods
本申请要求在2020年12月14日提交中国专利局、申请号为202011464923.4的中国专利申请的优先权,该申请的全部内容通过引用结合在本申请中。This application claims the priority of the Chinese Patent Application No. 202011464923.4 filed with the China Patent Office on December 14, 2020, the entire contents of which are incorporated herein by reference.
技术领域technical field
本申请涉及文本分析技术领域,例如涉及一种文本聚类方法。The present application relates to the technical field of text analysis, for example, to a text clustering method.
背景技术Background technique
文本聚类主要是依据著名的聚类假设:同类的文档相似度较大,而不同类的文档相似度较小。作为一种无监督的机器学习方法,聚类由于不需要训练过程,以及不需要预先对文档手工标注类别,因此具有一定的灵活性和较高的自动化处理能力,已经成为对文本信息进行有效组织、摘要和导航的重要手段,也被越来越多的研究人员所关注。Text clustering is mainly based on the well-known clustering assumption: documents of the same class are more similar, while documents of different classes are less similar. As an unsupervised machine learning method, clustering has certain flexibility and high automatic processing capability because it does not require a training process and does not require manual labeling of documents in advance, and has become an effective method for organizing text information. , summarization and navigation are also important means of attention by more and more researchers.
文本聚类主要有几个方法:1、划分法;2、密度法;3、层次法。常用的聚类算法包括属于划分法中的kmeans、kmean++,属于密度法的基于密度的含噪声应用空间聚类(Density-Based Spatial Clustering of Applications with Noise,DBSCAN)以及属于层次方法的基于层次结构的均衡迭代约简与聚类(Balanced Iterative Reducing and Clustering using Hierarchies,BIRCH)算法。谱聚类算法是一种建立在谱图理论基础上的方法,与传统的聚类算法相比,它具有能在任意形状的样本空间上聚类且收敛于全局最优解的优点。谱聚类算法将数据集中的每个对象看作是图的顶点V,将顶点间的相似度量化作为相应顶点连接边E的权值,这样就得到一个基于相似度的无向加权图G(V,E),于是聚类问题就可以转化为图的划分问题。基于图论的最优划分准则就是使划分成的子图内部相似度最大,子图之间的相似度最小。谱聚类算法有着不同的实现方法,但是这些实现方法都可以归纳为下面三个主要步骤:1)构建表示对象集的相似度矩阵S;2)计算度矩阵和拉普拉斯矩阵,构建特征向量空间;3)利用kmeans或其它经典聚类算法对特征向量空间中的特征向量进行聚类。There are several methods for text clustering: 1. Division method; 2. Density method; 3. Hierarchical method. Commonly used clustering algorithms include kmeans and kmean++ belonging to the partition method, density-based spatial clustering of applications with noise (DBSCAN) belonging to the density method, and hierarchical structure-based clustering belonging to the hierarchical method. Balanced Iterative Reducing and Clustering (BIRCH) algorithm. The spectral clustering algorithm is a method based on the spectral graph theory. Compared with the traditional clustering algorithm, it has the advantage of being able to cluster in any shape of the sample space and converging to the global optimal solution. The spectral clustering algorithm regards each object in the data set as a vertex V of the graph, and quantifies the similarity between vertices as the weight of the corresponding vertex connecting edge E, so that an undirected weighted graph G based on the similarity is obtained ( V, E), so the clustering problem can be transformed into a graph partitioning problem. The optimal division criterion based on graph theory is to maximize the similarity within the subgraphs and minimize the similarity between the subgraphs. The spectral clustering algorithm has different implementation methods, but these implementation methods can be summarized into the following three main steps: 1) construct the similarity matrix S representing the object set; 2) calculate the degree matrix and the Laplace matrix, and construct the feature Vector space; 3) Use kmeans or other classical clustering algorithms to cluster the eigenvectors in the eigenvector space.
上述这些聚类方法只能在已知类别数的情况下进行文本聚类,并且无法给出聚类后的类别关键词,使得用户无法根据关键词直接获知此类别所要表达的主题内容,此外,聚类方法计算得到的聚类结果大多存在精确率和召回率较低的问题,即聚类结果的准确度较低。The above clustering methods can only perform text clustering when the number of categories is known, and cannot give the category keywords after clustering, so that users cannot directly know the subject content to be expressed in this category according to the keywords. In addition, Most of the clustering results calculated by the clustering method have the problem of low precision and recall, that is, the accuracy of the clustering results is low.
发明内容SUMMARY OF THE INVENTION
本申请提供一种文本聚类方法,针对已知或未知类别数的情况,能够对文本进行聚类,同时能够输出对应于每个类别的关键词。The present application provides a text clustering method, which can cluster the text in the case of known or unknown number of categories, and can output keywords corresponding to each category at the same time.
提供一种文本聚类方法,包括:Provide a text clustering method, including:
对待聚类文档集依次进行分词、去停用词以及提取关键词处理;Perform word segmentation, stop word removal and keyword extraction in the document set to be clustered in turn;
根据提取的关键词,创建文本相似度矩阵;According to the extracted keywords, create a text similarity matrix;
基于所述文本相似度矩阵构建邻接矩阵,基于所述邻接矩阵构建度矩阵;Construct an adjacency matrix based on the text similarity matrix, and construct a degree matrix based on the adjacency matrix;
结合所述邻接矩阵和所述度矩阵构建拉普拉斯矩阵;Constructing a Laplacian matrix by combining the adjacency matrix and the degree matrix;
计算所述拉普拉斯矩阵的特征值和特征向量,得到对应于所述待聚类文档集的特征矩阵;Calculate the eigenvalues and eigenvectors of the Laplacian matrix, and obtain the eigenmatrix corresponding to the document set to be clustered;
采用聚类方法对所述特征矩阵进行聚类,得到聚类结果;Clustering the feature matrix using a clustering method to obtain a clustering result;
在已知聚类的类别个数的情况下,将得到的所述聚类结果作为最终聚类结果;在未知聚类的类别个数的情况下,通过多次执行以下操作得到多个聚类结果,并对所述多个聚类结果进行评估,根据评估结果选取得到最终聚类结果:调整聚类参数,返回执行所述构建邻接矩阵、度矩阵以及拉普拉斯矩阵,并计算所述拉普拉斯矩阵的特征值和特征向量,得到特征矩阵,对所述特征矩阵进行聚类,得到聚类结果的操作;When the number of categories of clusters is known, the obtained clustering result is used as the final clustering result; when the number of categories of clusters is unknown, multiple clusters are obtained by performing the following operations multiple times results, and evaluate the multiple clustering results, and select the final clustering results according to the evaluation results: adjust the clustering parameters, return to execute the construction of the adjacency matrix, the degree matrix and the Laplacian matrix, and calculate the The eigenvalues and eigenvectors of the Laplace matrix to obtain a eigenmatrix, and the eigenmatrix is clustered to obtain an operation of clustering results;
结合所述最终聚类结果以及所述提取的关键词,基于词频-逆文本频率(Term Frequency–Inverse Document Frequency,TF-IDF)算法提取出类别关键词;Combined with the final clustering result and the extracted keywords, category keywords are extracted based on a term frequency-inverse document frequency (Term Frequency-Inverse Document Frequency, TF-IDF) algorithm;
输出所述最终聚类结果及所述类别关键词。Output the final clustering result and the category keyword.
附图说明Description of drawings
图1为本申请实施例提供的一种文本聚类方法的流程示意图;1 is a schematic flowchart of a text clustering method provided by an embodiment of the present application;
图2为本申请实施例提供的一种文本聚类过程的示意图。FIG. 2 is a schematic diagram of a text clustering process provided by an embodiment of the present application.
具体实施方式Detailed ways
下面结合附图和具体实施例对本申请进行说明。The present application will be described below with reference to the accompanying drawings and specific embodiments.
实施例Example
如图1所示,一种基于改进谱聚类算法的文本聚类方法,包括以下步骤:As shown in Figure 1, a text clustering method based on an improved spectral clustering algorithm includes the following steps:
S1、对待聚类文档集依次进行分词、去停用词以及提取关键词处理。S1. Perform word segmentation, stop word removal, and keyword extraction in the document set to be clustered in sequence.
S2、根据提取的关键词,创建文本相似度矩阵。S2. Create a text similarity matrix according to the extracted keywords.
S3、基于文本相似度矩阵构建邻接矩阵,基于邻接矩阵构建度矩阵。S3. Construct an adjacency matrix based on the text similarity matrix, and construct a degree matrix based on the adjacency matrix.
S4、结合邻接矩阵和度矩阵构建拉普拉斯矩阵。S4, combine the adjacency matrix and the degree matrix to construct a Laplace matrix.
S5、计算拉普拉斯矩阵的特征值和特征向量,得到对应于待聚类文档集的特征矩阵。S5. Calculate the eigenvalues and eigenvectors of the Laplacian matrix, and obtain the eigenmatrix corresponding to the document set to be clustered.
S6、采用经典聚类方法对特征矩阵进行聚类,得到对应的聚类结果。S6. Use the classical clustering method to cluster the feature matrix to obtain a corresponding clustering result.
S7、若已知聚类的类别个数,则执行步骤S9。S7. If the number of categories of clusters is known, step S9 is performed.
若未知聚类的类别个数,则执行步骤S8。If the number of categories of clusters is unknown, step S8 is executed.
S8、依次调整聚类参数,以确定对应的类别个数,之后返回执行步骤S3~S6,得到多个调整聚类结果,并对多个调整聚类结果进行评估,选取得到最优的聚类结果。S8. Adjust the clustering parameters in turn to determine the number of corresponding categories, and then return to steps S3 to S6 to obtain multiple adjusted clustering results, evaluate the multiple adjusted clustering results, and select the optimal clustering result. result.
S9、结合步骤S6或步骤S8得到的聚类结果以及步骤S1提取的关键词,基于TF-IDF算法提取出类别关键词。S9. Combine the clustering result obtained in step S6 or step S8 and the keywords extracted in step S1 to extract category keywords based on the TF-IDF algorithm.
S10、输出聚类结果及对应的类别关键词。S10, output the clustering result and the corresponding category keywords.
将上述方法应用于实际,工作过程如图2所示,从用户输入文档集至输出文本聚类结果,主要包括以下过程:Applying the above method to practice, the working process is shown in Figure 2. From the user inputting the document set to the output text clustering result, it mainly includes the following processes:
一、对输入的文档集进行分词、去停用词以及提取关键词处理1. Perform word segmentation, remove stop words and extract keywords from the input document set
在对文本聚类之前,需要提取文本关键词,有两个原因:一是缩小创建文本相似矩阵的向量维度,二是在聚类完成后可根据文本关键词提取出类别关键词。在提取关键词时,通过词性过滤,主要保留了词性为名词、动词、动名词、人名、地名及机构名词等的关键词,以提高文本相似度的精确性。Before text clustering, text keywords need to be extracted for two reasons: one is to reduce the dimension of the vector for creating a text similarity matrix, and the other is to extract category keywords based on text keywords after clustering is completed. When extracting keywords, through part-of-speech filtering, keywords whose parts of speech are nouns, verbs, gerunds, person names, place names, and institutional nouns are mainly reserved to improve the accuracy of text similarity.
二、通过提取出来的关键词创建文本相似度矩阵2. Create a text similarity matrix through the extracted keywords
通常计算文本相似度方法有余弦相似度、欧式距离、jaccard距离等方法,本申请采用构建词袋的方式来计算文本相似度,主要是计算文本下每个关键词的TF-IDF值,保存在一个类似袋子的地方,通过判断一个文本与另一个文本之间是否有相同的关键词,再利用词袋里的TF-IDF值计算文本之间的相似度,此种方法类似余弦距离计算方式,但是能够减少计算量、达到提高效率的效果。得到的文本相似度矩阵就是一个N*N的矩阵,每个值都是文本与文本之间的相似度。Generally, there are cosine similarity, Euclidean distance, jaccard distance and other methods for calculating text similarity. This application adopts the method of constructing a bag of words to calculate text similarity, mainly calculating the TF-IDF value of each keyword in the text, and saving it in A place similar to a bag, by judging whether there are the same keywords between one text and another text, and then using the TF-IDF value in the word bag to calculate the similarity between the texts, this method is similar to the cosine distance calculation method, However, it can reduce the amount of calculation and achieve the effect of improving efficiency. The obtained text similarity matrix is an N*N matrix, and each value is the similarity between texts.
三、计算邻接矩阵(W)、度矩阵(D)和拉普拉斯矩阵(L)3. Calculate the adjacency matrix (W), degree matrix (D) and Laplace matrix (L)
邻接矩阵(W):构建邻接矩阵的方法有三类:∈-邻近法、K邻近法和全连接法。其中,∈-邻近法,它设置了一个距离阈值∈,然后用欧式距离度量任意两 点之间的距离。即有文本相似矩阵的欧式距离为:Adjacency Matrix (W): There are three types of methods for constructing adjacency matrix: ∈-neighbor method, K-neighbor method and fully connected method. Among them, ∈-proximity method, which sets a distance threshold ∈, and then uses Euclidean distance to measure the distance between any two points. That is, the Euclidean distance of the text similarity matrix is:
Figure PCTCN2021071166-appb-000001
Figure PCTCN2021071166-appb-000001
其中,s ij为文本相似度矩阵中元素x i与元素x j之间的欧式距离,根据s ij和∈的大小关系,定义邻接矩阵W如下: Among them, s ij is the Euclidean distance between element x i and element x j in the text similarity matrix. According to the size relationship between s ij and ∈, the adjacency matrix W is defined as follows:
[根据细则91更正 07.04.2022] 
Figure WO-DOC-FIGURE-43
[Correction 07.04.2022 under Rule 91]
Figure WO-DOC-FIGURE-43
其中,w ij为邻接矩阵W中第i行第j列元素。 Among them, w ij is the i-th row and j-th column element in the adjacency matrix W.
K邻近法是只要一个点在另一个点的K近邻中,则保留s ij,或两个点互为K近邻中,才能保留s ijThe K-neighbor method is that as long as a point is in the K-nearest neighbor of another point, s ij is retained, or two points are in the K-nearest neighbor of each other, s ij can be retained:
Figure PCTCN2021071166-appb-000003
Figure PCTCN2021071166-appb-000003
其中,KNN(x i)为元素x i的K个近邻,KNN(x j)为元素x j的K个近邻,σ为方差。 Among them, KNN(x i ) is the K nearest neighbors of element x i , KNN(x j ) is the K nearest neighbors of element x j , and σ is the variance.
全连接法,相比前两种方法,全连接方法所有的点之间的权重值都大于0,因此称之为全连接法。可以选择不同的核函数来定义边权重,常用的有多项式核函数,高斯核函数和Sigmoid核函数。对于高斯核函数即径向基函数(Radial Basis Function,RBF),相似矩阵和邻接矩阵相同:The full connection method, compared with the first two methods, the weight value between all points in the full connection method is greater than 0, so it is called the full connection method. Different kernel functions can be selected to define edge weights, commonly used are polynomial kernel function, Gaussian kernel function and Sigmoid kernel function. For the Gaussian kernel function, the Radial Basis Function (RBF), the similarity matrix and the adjacency matrix are the same:
Figure PCTCN2021071166-appb-000004
Figure PCTCN2021071166-appb-000004
度矩阵D是由邻接矩阵构建,度矩阵是一个对角矩阵,只有主对角线上有值,其他位置的值都为0。对角线上的值为本行所有值得和,即:The degree matrix D is constructed from the adjacency matrix. The degree matrix is a diagonal matrix, only the main diagonal has values, and the values in other positions are 0. The value on the diagonal is the sum of all the values in this row, namely:
Figure PCTCN2021071166-appb-000005
Figure PCTCN2021071166-appb-000005
其中,d i为度矩阵D中第i行位于主对角线上的元素,n为文本个数。 Among them, d i is the element located on the main diagonal in the i-th row of the degree matrix D, and n is the number of texts.
拉普拉斯矩阵是对称矩阵,由D和W都是对称矩阵而得,并且它的所有的特征值都是实数:The Laplacian matrix is a symmetric matrix, resulting from the fact that both D and W are symmetric, and all of its eigenvalues are real:
L=D-WL=D-W
其中,L为拉普拉斯矩阵,D为度矩阵,W为邻接矩阵。Among them, L is the Laplace matrix, D is the degree matrix, and W is the adjacency matrix.
四、计算特征值、特征向量和特征矩阵Fourth, calculate eigenvalues, eigenvectors and eigenmatrixes
根据拉普拉斯矩阵计算矩阵的特征值和特征向量,先根据拉普拉斯矩阵的特征多项式求解得到特征值,根据特征值求解特征向量,再通过聚类个数(m)判断特征值大小满足条件(例如是特征值的数值小于(1-1/m)*0.95)的个数,以作为降维的维度数,通过降维得到待聚类文档集的特征矩阵。Calculate the eigenvalues and eigenvectors of the matrix according to the Laplace matrix, first solve the eigenvalues according to the characteristic polynomial of the Laplace matrix, solve the eigenvectors according to the eigenvalues, and then judge the size of the eigenvalues by the number of clusters (m). The number that satisfies the condition (for example, the value of the feature value is less than (1-1/m)*0.95) is used as the number of dimensions for dimensionality reduction, and the feature matrix of the document set to be clustered is obtained through dimensionality reduction.
五、本实施例通过kmeans对特征矩阵进行聚类V. In this embodiment, the feature matrix is clustered by kmeans
在构建特征矩阵后,利用传统的经典聚类算法kmeans对特征矩阵进行聚类。谱聚类只需要文本之间的相似度矩阵,对处理稀疏数据比较有效,直接用kmeans很难做到;谱聚类使用了降维,处理高维数据时比直接使用kmeans效果要好。若直接传入聚类的类别个数,在通过kmeans聚类完成后即可跳过以下第六步骤,到第七步骤,直接提取类别关键词,完成聚类任务。若未传入聚类的类别个数,则需要通过第六步找到一个效果较好的聚类个数完成聚类,再进行关键词提取,完成聚类任务。After the feature matrix is constructed, the traditional classical clustering algorithm kmeans is used to cluster the feature matrix. Spectral clustering only needs the similarity matrix between texts, which is more effective for processing sparse data, and it is difficult to use kmeans directly; spectral clustering uses dimensionality reduction, which is better than using kmeans directly when dealing with high-dimensional data. If the number of categories to be clustered is directly passed in, after the kmeans clustering is completed, the following sixth step can be skipped, and the seventh step is to directly extract the category keywords to complete the clustering task. If the number of categories of clusters is not passed in, it is necessary to find a better number of clusters to complete the clustering through the sixth step, and then perform keyword extraction to complete the clustering task.
六、评估聚类效果6. Evaluate the clustering effect
通过调整参数聚类个数,返回步骤三,重新获得聚类结果,并且评估聚类结果的直方图,找到一个效果最优直方图对应的聚类个数作为此次聚类任务的类别个数。By adjusting the number of parameter clusters, go back to step 3 to obtain the clustering results again, and evaluate the histogram of the clustering results to find the number of clusters corresponding to the histogram with the best effect as the number of categories for this clustering task .
七、提取类别关键词7. Extract category keywords
根据聚类结果和文本关键词通过TF-IDF算法提取出类别关键词,可根据类别关键词大致判断此类别所述内容。此类别关键词是基于此次聚类任务下几个类别计算的TF-IDF值提取的关键词,与非此次任务的文本数据无关。According to the clustering results and the text keywords, the category keywords are extracted by the TF-IDF algorithm, and the content described in this category can be roughly judged according to the category keywords. Keywords in this category are keywords extracted based on the TF-IDF values calculated for several categories under this clustering task, and have nothing to do with text data other than this task.
八、整个流程结束,返回类别文本及类别关键词Eight, the whole process is over, return the category text and category keywords
本实施例应用本申请方法与kmeans、DBSCAN算法分别对四类数据进行聚类处理,其中,四类数据为:In this embodiment, the method of the present application and the kmeans and DBSCAN algorithms are respectively used to perform clustering processing on four types of data, wherein the four types of data are:
数据1:Data 1:
表1Table 1
Figure PCTCN2021071166-appb-000006
Figure PCTCN2021071166-appb-000006
Figure PCTCN2021071166-appb-000007
Figure PCTCN2021071166-appb-000007
数据2:Data 2:
表2Table 2
Figure PCTCN2021071166-appb-000008
Figure PCTCN2021071166-appb-000008
数据3(网络下载数据,共14个新闻类别):Data 3 (network download data, a total of 14 news categories):
表3table 3
类别category 数量(篇)Quantity (articles)
财经Finance 200200
彩票lottery 200200
房产real estate 200200
股票stock 200200
家具furniture 200200
教育educate 200200
科技Technology 200200
社会society 200200
时尚Fashion 200200
时政current affairs 200200
体育physical education 200200
星座constellation 200200
娱乐entertainment 200200
数据4(网络下载数据集,用于分类模型训练数据集):Data 4 (network download data set, used for classification model training data set):
表4Table 4
类别category 数量(篇)Quantity (articles)
ArtArt 800800
EconomyEconomy 800800
PoliticsPolitics 800800
SpaceSpace 800800
SportsSports 800800
AgricultureAgriculture 300300
ComputerComputer 300300
EnviornmentEnviornment 300300
HistoryHistory 300300
本实施例是利用上述四种数据,结合kmeans、DBSCAN算法以及本申请提出的方法测试得到精确率、召回率以及F1值,首先对这三个测试指标进行说明,根据混淆矩阵,假如现在有一个二分类问题,那么预测结果和实际结果两两结合会出现如下四种情况:This embodiment uses the above four kinds of data, combined with kmeans, DBSCAN algorithm and the method proposed in this application to test to obtain the precision rate, recall rate and F1 value. First, the three test indicators are explained. According to the confusion matrix, if there is a For two classification problems, then the combination of the predicted results and the actual results will have the following four situations:
表5table 5
Figure PCTCN2021071166-appb-000009
Figure PCTCN2021071166-appb-000009
Figure PCTCN2021071166-appb-000010
Figure PCTCN2021071166-appb-000010
由于用数字1、0表示不太方便阅读,转换为用T(True)代表正确、F(False)代表错误、P(Positive)代表1、N(Negative)代表0。先看预测结果(P|N),然后再针对实际结果对比预测结果,给出判断结果(T|F)。按照上面逻辑,重新分配后为:Since it is not easy to read with numbers 1 and 0, it is converted to use T (True) for correct, F (False) for error, P (Positive) for 1, and N (Negative) for 0. First look at the predicted results (P|N), and then compare the predicted results with the actual results to give the judgment results (T|F). According to the above logic, after reassignment is:
表6Table 6
Figure PCTCN2021071166-appb-000011
Figure PCTCN2021071166-appb-000011
TP、FP、FN、TN可以理解为:TP, FP, FN, TN can be understood as:
TP:预测为1,实际为1,预测正确。TP: Prediction is 1, actual is 1, the prediction is correct.
FP:预测为1,实际为0,预测错误。FP: Predicted 1, actual 0, wrong prediction.
FN:预测为0,实际为1,预测错误。FN: Predicted 0, actual 1, wrong prediction.
TN:预测为0,实际为0,预测正确。TN: The prediction is 0, the actual is 0, the prediction is correct.
准确率:预测正确的结果占总样本的百分比,表达式为:Accuracy: The percentage of correct predictions in the total sample, expressed as:
Figure PCTCN2021071166-appb-000012
Figure PCTCN2021071166-appb-000012
精确率:针对预测结果而言的,其含义是在被所有预测为正的样本中实际为正样本的概率,表达式为:Accuracy rate: For the prediction result, its meaning is the probability that it is actually a positive sample among all the predicted positive samples. The expression is:
Figure PCTCN2021071166-appb-000013
Figure PCTCN2021071166-appb-000013
召回率:针对原样本而言的,其含义是在实际为正的样本中被预测为正样本的概率,表达式为:Recall rate: For the original sample, its meaning is the probability of being predicted to be a positive sample in an actual positive sample. The expression is:
Figure PCTCN2021071166-appb-000014
Figure PCTCN2021071166-appb-000014
F1分数表达式为:The F1 score expression is:
Figure PCTCN2021071166-appb-000015
Figure PCTCN2021071166-appb-000015
对于数据1,在已知传入类别个数的情况下进行测试,测试结果如表7所示:For data 1, the test is carried out when the number of incoming categories is known, and the test results are shown in Table 7:
表7Table 7
算法algorithm 平均精确率(%)Average Precision (%) 平均召回率(%)Average recall (%) 平均F1值(%)Average F1 value (%)
KmeansKmeans 88.688.6 86.586.5 87.587.5
DBSCANDBSCAN 61.861.8 43.643.6 51.151.1
本方法this method 93.293.2 90.490.4 91.891.8
由表7的数据可知,利用不同的聚类算法,传入固定的聚类个数时,本方法的精确率、召回率和F1值均优于kmeans算法和DBSCAN算法。It can be seen from the data in Table 7 that using different clustering algorithms, when a fixed number of clusters is input, the precision, recall and F1 value of this method are better than the kmeans algorithm and the DBSCAN algorithm.
对于数据2,采用本申请提出的方法,分别在指定和未指定类别个数的情况下进行测试,测试结果如表8所示:For data 2, the method proposed in this application is used to test the number of specified and unspecified categories respectively, and the test results are shown in Table 8:
表8Table 8
是否指定聚类个数Whether to specify the number of clusters 聚类结果Clustering results 精确率(%)Accuracy (%) 召回率(%)Recall (%) F1值(%)F1 value (%)
指定聚类个数4Specify the number of clusters 4 4个类别4 categories 96.296.2 93.793.7 94.994.9
未指定聚类个数Number of clusters not specified 4个类别4 categories 96.296.2 93.793.7 94.994.9
由表8的数据可知,已知测试文档集有4类数据,在指定和不指定4个类别数据时,本方法的聚类结果都是4个类别,并且聚类效果较优。It can be seen from the data in Table 8 that the known test document set has 4 types of data. When specifying or not specifying 4 categories of data, the clustering results of this method are all 4 categories, and the clustering effect is better.
对于数据3,在指定聚类个数的情况下测试本方法结果:For data 3, test the results of this method with the specified number of clusters:
表9Table 9
聚类结果Clustering results 精确率(%)Accuracy (%) 召回率(%)Recall (%) F1值(%)F1 value (%)
14个类别14 categories 93.293.2 9393 9393
由表9的数据可知,当利用多个类别、数量较多的文档集进行测试时,在传入固定的类别个数情况下,本方法测试结果的平均精确率、平均召回率及平均F1值均超过90%,效果较优。As can be seen from the data in Table 9, when using multiple categories and a large number of document sets for testing, when a fixed number of categories is passed in, the average precision rate, average recall rate and average F1 value of the test results of this method are obtained. All are more than 90%, the effect is better.
对于数据4,共进行了两次测试,第一次测试是利用所有9个类别,每个类别包含300个文本,共2700个文本进行测试,并对比kmeans和DBSCAN测试结果如表10所示:For data 4, a total of two tests were carried out. The first test was to use all 9 categories, each category contained 300 texts, and a total of 2700 texts were tested, and the test results of kmeans and DBSCAN were compared as shown in Table 10:
表10Table 10
算法algorithm 平均精确率(%)Average Precision (%) 平均召回率(%)Average recall (%) 平均F1值(%)Average F1 value (%)
KmeansKmeans 65.865.8 63.963.9 64.864.8
DBSCANDBSCAN 52.352.3 49.549.5 50.8650.86
本方法this method 68.368.3 66.966.9 67.667.6
从测试数据看,本方法与kmeans方法对比后效果并没有特别突出,整体数据值相差不大;并且每个算法整体数据都不太高;经过抽查文本及聚类结果分析,发现此数据集中有个别不同类别的文本比较相似,有交叉的情况,比如:Enviorment(环境)类别与Agriculture(农业)类别中,有较多交叉文本,提取的关键词比较相似,也就是根据这些关键词会容易判断错误类别;From the test data, the effect of this method and kmeans method is not particularly prominent, and the overall data value is not much different; and the overall data of each algorithm is not too high; after random inspection of text and analysis of clustering results, it is found that this data set contains Some texts of different categories are relatively similar, and there is overlap. For example, there are many overlapping texts in the Environment (environment) category and the Agriculture (agriculture) category, and the extracted keywords are relatively similar, that is, it is easy to judge based on these keywords. error category;
基于上面的分析,本实施例对数据4进行了测试数据优化,直接去掉了有文本交叉的类别数据,只用了9个类别中的5个类别(Art,Economy,Politics,Space,Sports),每个类别800个数据,共4000个文本数据,做了第二次测试,数据结果如表11所示:Based on the above analysis, this embodiment optimizes the test data for data 4, directly removes the category data with text intersection, and only uses 5 categories (Art, Economy, Politics, Space, Sports) out of 9 categories, There are 800 data in each category and a total of 4000 text data. The second test was done, and the data results are shown in Table 11:
表11Table 11
算法algorithm 平均精确率(%)Average Precision (%) 平均召回率(%)Average recall (%) 平均F1值(%)Average F1 value (%)
KmeansKmeans 83.383.3 79.679.6 81.481.4
DBSCANDBSCAN 63.263.2 65.865.8 64.564.5
本方法this method 89.6189.61 89.0289.02 89.3189.31
由表11和表10的数据可知,整体效果都有一定提高,对比数据4的两次 测试结果,不管有没有对数据集进行优化,本方法效果都比kmeans和DBSCAN算法的效果要高。As can be seen from the data in Table 11 and Table 10, the overall effect has been improved to a certain extent. Compared with the two test results of data 4, no matter whether the data set is optimized or not, the effect of this method is higher than that of the kmeans and DBSCAN algorithms.
综上所述,本申请在原谱聚类的基础上进行改进,一是可不指定聚类个数的情况下进行聚类;二是特征向量降维的维度数不是传入的聚类个数,而是取决于特征值里较小值的个数;三是聚类完成后可提取出类别关键词。主要流程是在构建好文本相似度矩阵后,通过调整参数聚类个数,计算出邻接矩阵(W)、度矩阵(D)和拉普拉斯矩阵(L),再计算得到特征值和特征向量,通过判断特征值大小满足条件的个数k,把特征向量降维到k,构建成一个特征向量矩阵,利用其他经典聚类算法(比如:kmeans)对特征向量矩阵进行聚类,通过评估聚类效果选择聚类效果比较好的聚类个数,以达到在不传入聚类个数的情况下,聚类效果仍然能够满足需求,并保留原谱聚类方法中可指定聚类个数对文本集进行聚类的方法。不仅有利于用户对未知数据集进行聚类操作,还可以让用户在已知类别数的情况下进行文本聚类,同时在对文档集进行聚类的同时,提取出根据此聚类结果计算出的类别关键词,让用户能够根据关键词判断此类别所要表达的主题内容。通过测试,本申请的聚类效果也比传统的聚类算法在精确率和召回率上有一定的提高。To sum up, this application improves on the basis of the original spectral clustering. First, clustering can be performed without specifying the number of clusters; Instead, it depends on the number of smaller values in the eigenvalues; thirdly, the category keywords can be extracted after the clustering is completed. The main process is to calculate the adjacency matrix (W), degree matrix (D) and Laplace matrix (L) by adjusting the number of parameter clusters after constructing the text similarity matrix, and then calculate the eigenvalues and features. Vector, by judging the number of eigenvalues that satisfy the condition k, reduce the eigenvector dimension to k, and construct an eigenvector matrix, and use other classical clustering algorithms (such as kmeans) to cluster the eigenvector matrix, and evaluate the eigenvector matrix. For clustering effect, select the number of clusters with better clustering effect, so that the clustering effect can still meet the requirements without inputting the number of clusters, and retain the original spectral clustering method that can specify the number of clusters. A method for clustering a set of text pairs. It is not only beneficial for users to perform clustering operations on unknown data sets, but also allows users to perform text clustering when the number of categories is known. category keywords, allowing users to judge the subject content to be expressed by this category according to the keywords. Through testing, the clustering effect of the present application also has a certain improvement in precision and recall compared with the traditional clustering algorithm.
在实际应用中,采用本申请提出的方法,能够针对未知或已知类别个数的情况下,对文档集进行聚类,可以应用于客户想对一些未标注的文本集进行类别划分,并提取出类别下的关键词,可延伸精确应用于对未知类别敏感文档集进行聚类,然后再用这些有标注的敏感文档的关键词进行文档分类,以达到应用已知的敏感文件判断未知的文档是否属于敏感文件及判断所属的类别,并根据判断的敏感类别做出相应的反应。In practical applications, the method proposed in this application can cluster the document set when the number of categories is unknown or known. It can be applied to customers who want to classify some unlabeled text sets, and extract The keywords under the category can be extended and accurately applied to clustering sensitive document sets of unknown categories, and then the keywords of these labeled sensitive documents are used for document classification, so as to use known sensitive documents to judge unknown documents. Whether it is a sensitive document and the category to which it belongs, and respond accordingly according to the sensitive category determined.
在文本聚类时不仅可以对已知类别个数的文档集进行聚类,还可以对未知类别个数的文档集进行聚类,用户只要有文档集数据,都可以完成文档的类别区分工作;对稀疏数据很有效,比传统聚类算法的效果好;对高维数据也使用了降维处理,聚类时的复杂度也比传统聚类算法好;聚类结果的精确率和召回率比传统算法好,应用广泛,既可以处理未知类别文档集,又可以处理已知类别文档集。既可以对特定领域的文档集进行聚类处理(比如:已知敏感文件或机密文件等),也可以对普通的文档集进行聚类操作。在聚类的基础上还可以查看类别关键词,在不用翻看每个文件内容的情况下就可以查看类别文本所讲的大致内容。并且可以用类别文本关键词创建文本分类模型,应用于文本分类。In text clustering, not only the document sets with the known number of categories can be clustered, but also the document sets with the unknown number of categories can be clustered. As long as the user has the document set data, he can complete the classification of the documents; It is very effective for sparse data and is better than traditional clustering algorithms; it also uses dimensionality reduction processing for high-dimensional data, and the complexity of clustering is also better than traditional clustering algorithms; the precision rate and recall rate of clustering results are higher than The traditional algorithm is good and widely used. It can process both unknown category document sets and known category document sets. It can perform clustering processing on document sets in specific fields (for example: known sensitive documents or confidential documents, etc.), and can also perform clustering operations on general document sets. On the basis of clustering, category keywords can also be viewed, and the general content of the category text can be viewed without looking at the content of each file. And a text classification model can be created with category text keywords and applied to text classification.
本申请通过设置调整聚类参数的过程,实现对谱聚类算法的改进,以自主给出对应的类别个数,并通过对相应的调整聚类结果进行评估,能够选取出最优的聚类结果,从而确定对应的类别个数,以此实现对未知类别个数的文档集 进行聚类的目的,使得用户只需提供文档集数据,基于本申请提出的方法,即可完成对文档集的类别区分工作。本申请结合聚类结果以及提取的关键词,采用TF-IDF算法能够提取出对应于聚类结果的类别关键词,使得用户能够直观查看到不同类别文本对应的类别关键词,无需翻看文件内容即可获知该文本的主题内容。本申请基于类别个数对特征值进行筛选,以将筛选的特征值个数作为降维的维度数,从而降维处理得到待聚类文档集对应的特征矩阵,能够大大降低后续聚类处理的复杂度,此外,本申请利用从待聚类文档集提取的关键词,以构建文本相似度矩阵,能够有效地对稀疏数据进行聚类。The present application realizes the improvement of the spectral clustering algorithm by setting the process of adjusting the clustering parameters, so as to provide the corresponding number of categories independently, and by evaluating the corresponding adjusted clustering results, the optimal clustering can be selected. As a result, the corresponding number of categories is determined, so as to achieve the purpose of clustering document sets with unknown number of categories, so that users only need to provide document set data, and based on the method proposed in this application, the document set can be completed. Categories distinguish work. The present application combines the clustering results and the extracted keywords, and adopts the TF-IDF algorithm to extract the category keywords corresponding to the clustering results, so that the user can intuitively view the category keywords corresponding to different categories of text, without having to look at the file content You can get the subject content of the text. The present application screens the eigenvalues based on the number of categories, and uses the number of screened eigenvalues as the number of dimensions for dimensionality reduction, so that the feature matrix corresponding to the set of documents to be clustered can be obtained by dimensionality reduction processing, which can greatly reduce the cost of subsequent clustering processing. In addition, the present application uses keywords extracted from the document set to be clustered to construct a text similarity matrix, which can effectively cluster sparse data.

Claims (10)

  1. 一种文本聚类方法,包括:A text clustering method including:
    对待聚类文档集依次进行分词、去停用词以及提取关键词处理;Perform word segmentation, stop word removal and keyword extraction in the document set to be clustered in turn;
    根据提取的关键词,创建文本相似度矩阵;According to the extracted keywords, create a text similarity matrix;
    基于所述文本相似度矩阵构建邻接矩阵,基于所述邻接矩阵构建度矩阵;Construct an adjacency matrix based on the text similarity matrix, and construct a degree matrix based on the adjacency matrix;
    结合所述邻接矩阵和所述度矩阵构建拉普拉斯矩阵;Constructing a Laplacian matrix by combining the adjacency matrix and the degree matrix;
    计算所述拉普拉斯矩阵的特征值和特征向量,得到对应于所述待聚类文档集的特征矩阵;Calculate the eigenvalues and eigenvectors of the Laplacian matrix, and obtain the eigenmatrix corresponding to the document set to be clustered;
    采用聚类方法对所述特征矩阵进行聚类,得到聚类结果;Clustering the feature matrix using a clustering method to obtain a clustering result;
    在已知聚类的类别个数的情况下,将得到的所述聚类结果作为最终聚类结果;在未知聚类的类别个数的情况下,通过多次执行以下操作得到多个聚类结果,并对所述多个聚类结果进行评估,根据评估结果选取得到最终聚类结果:调整聚类参数,返回执行所述构建邻接矩阵、度矩阵以及拉普拉斯矩阵,并计算所述拉普拉斯矩阵的特征值和特征向量,得到特征矩阵,对所述特征矩阵进行聚类,得到聚类结果的操作;When the number of categories of clusters is known, the obtained clustering result is used as the final clustering result; when the number of categories of clusters is unknown, multiple clusters are obtained by performing the following operations multiple times results, and evaluate the multiple clustering results, and select and obtain the final clustering result according to the evaluation results: adjust the clustering parameters, return to execute the construction of the adjacency matrix, the degree matrix and the Laplacian matrix, and calculate the The eigenvalues and eigenvectors of the Laplacian matrix are used to obtain a eigenmatrix, and the eigenmatrix is clustered to obtain an operation of clustering results;
    结合所述最终聚类结果以及所述提取的关键词,基于词频-逆文本频率指数TF-IDF算法提取出类别关键词;Combined with the final clustering result and the extracted keywords, category keywords are extracted based on the word frequency-inverse text frequency index TF-IDF algorithm;
    输出所述最终聚类结果及所述类别关键词。Output the final clustering result and the category keyword.
  2. 根据权利要求1所述的方法,其中,所述提取的关键词的词性包括名词、动词、动名词、人名、地名以及机构名词。The method according to claim 1, wherein the parts of speech of the extracted keywords include nouns, verbs, gerunds, people's names, place names, and institutional nouns.
  3. 根据权利要求1所述的方法,其中,所述根据提取的关键词,创建文本相似度矩阵,包括:The method according to claim 1, wherein, creating a text similarity matrix according to the extracted keywords comprises:
    计算所述待聚类文档集中所有文本中每个关键词的TF-IDF值,并将所述所有文本中每个关键词的TF-IDF值放入词袋中;Calculate the TF-IDF value of each keyword in all the texts in the document set to be clustered, and put the TF-IDF value of each keyword in the all texts into the bag of words;
    根据所述词袋中存入的所述所有文本中每个关键词的TF-IDF值,计算得到不同文本之间的相似度,利用不同文本之间的相似度构建所述文本相似度矩阵。According to the TF-IDF value of each keyword in all the texts stored in the word bag, the similarity between different texts is calculated, and the text similarity matrix is constructed by using the similarity between different texts.
  4. 根据权利要求3所述的方法,其中,所述文本相似度矩阵为一个N*N的矩阵,所述文本相似度矩阵中的每个元素为不同文本之间的相似度。The method according to claim 3, wherein the text similarity matrix is an N*N matrix, and each element in the text similarity matrix is the similarity between different texts.
  5. 根据权利要求4所述的方法,其中,所述基于所述文本相似度矩阵构建邻接矩阵,基于所述邻接矩阵构建度矩阵,包括:The method according to claim 4, wherein the constructing an adjacency matrix based on the text similarity matrix, and constructing a degree matrix based on the adjacency matrix, comprising:
    基于所述文本相似度矩阵,采用∈-邻近法、K邻近法或全连接法构建所述邻接矩阵W;Based on the text similarity matrix, the adjacency matrix W is constructed by adopting the ε-proximity method, the K-proximity method or the full connection method;
    根据所述邻接矩阵W中的元素,构建一个对角矩阵,将所述对角矩阵作为所述度矩阵D。According to the elements in the adjacency matrix W, a diagonal matrix is constructed, and the diagonal matrix is used as the degree matrix D.
  6. [根据细则91更正 07.04.2022] 
    根据权利要求5所述的方法,其中,
    在采用所述∈-邻近法构建所述邻接矩阵W的情况下,所述邻接矩阵W为:
    Figure WO-DOC-FIGURE-CL6

    Figure PCTCN2021071166-appb-100002

    其中,w ij为所述邻接矩阵W中第i行第j列的元素,s ij为所述文本相似度矩阵中元素x i与元素x j之间的欧式距离,∈为设定的距离阈值;
    在采用所述K邻近法构建所述邻接矩阵W的情况下,所述邻接矩阵W为:
    Figure PCTCN2021071166-appb-100003

    其中,w ij为所述邻接矩阵W中第i行第j列的元素,元素x i与元素x j为所述文本相似度矩阵中的元素,KNN(x i)为所述元素x i的K个近邻,KNN(x j)为所述元素x j的K个近邻,σ为方差;
    在采用所述全连接法构建所述邻接矩阵W的情况下,所述邻接矩阵W为:
    Figure PCTCN2021071166-appb-100004

    其中,w ij为所述邻接矩阵W中第i行第j列的元素,元素x i与元素x j为所述文本相似度矩阵中的元素,σ为方差。
    [Correction 07.04.2022 under Rule 91]
    The method of claim 5, wherein,
    In the case of constructing the adjacency matrix W using the ε-proximity method, the adjacency matrix W is:
    Figure WO-DOC-FIGURE-CL6

    Figure PCTCN2021071166-appb-100002

    Wherein, w ij is the element of the i-th row and j-th column in the adjacency matrix W, s ij is the Euclidean distance between the element x i and the element x j in the text similarity matrix, ∈ is the set distance threshold ;
    In the case of constructing the adjacency matrix W using the K-proximity method, the adjacency matrix W is:
    Figure PCTCN2021071166-appb-100003

    Wherein, w ij is the element of the i-th row and the j-th column in the adjacency matrix W, element x i and element x j are the elements in the text similarity matrix, KNN(x i ) is the element of the element x i K nearest neighbors, KNN(x j ) is the K nearest neighbors of the element x j , and σ is the variance;
    In the case of constructing the adjacency matrix W using the full connection method, the adjacency matrix W is:
    Figure PCTCN2021071166-appb-100004

    Wherein, w ij is the element of the i-th row and j-th column in the adjacency matrix W, the element x i and the element x j are the elements in the text similarity matrix, and σ is the variance.
  7. 根据权利要求6所述的方法,其中,所述度矩阵D为:The method according to claim 6, wherein the degree matrix D is:
    Figure PCTCN2021071166-appb-100005
    Figure PCTCN2021071166-appb-100005
    其中,d i为所述度矩阵D中第i行位于主对角线上的元素,n为所述所有文本的个数。 Wherein, d i is the element of the i-th row located on the main diagonal in the degree matrix D, and n is the number of all the texts.
  8. 根据权利要求7所述的方法,其中,所述拉普拉斯矩阵为:The method of claim 7, wherein the Laplacian matrix is:
    L=D-W;L=D-W;
    其中,L为拉普拉斯矩阵。where L is the Laplace matrix.
  9. 根据权利要求1所述的方法,其中,所述计算所述拉普拉斯矩阵的特征值和特征向量,得到对应于所述待聚类文档集的特征矩阵,包括:The method according to claim 1, wherein the calculating the eigenvalues and eigenvectors of the Laplacian matrix to obtain the eigenmatrix corresponding to the set of documents to be clustered, comprising:
    根据所述拉普拉斯矩阵的特征多项式,求解得到所述拉普拉斯矩阵的特征值;According to the characteristic polynomial of the Laplacian matrix, the eigenvalue of the Laplacian matrix is obtained by solving;
    根据所述拉普拉斯矩阵的特征值,求解得到所述拉普拉斯矩阵的特征向量;According to the eigenvalues of the Laplacian matrix, solve to obtain the eigenvectors of the Laplacian matrix;
    根据所述聚类的类别个数,筛选出满足预设条件的特征值的数量为k,将所述拉普拉斯矩阵的特征向量降维到k,从而构建得到降维处理后的特征矩阵,其中,所述预设条件为特征值的数值小于(1-1/m)*0.95,m为所述聚类的类别个数。According to the number of categories of the clusters, the number of eigenvalues that meet the preset conditions is screened to be k, and the dimension of the eigenvectors of the Laplace matrix is reduced to k, so as to construct a dimension-reduced eigenmatrix , wherein the preset condition is that the value of the feature value is less than (1-1/m)*0.95, and m is the number of categories of the cluster.
  10. 根据权利要求1所述的方法,其中,所述对所述多个聚类结果进行评估,包括:The method of claim 1, wherein the evaluating the plurality of clustering results comprises:
    采用计算直方图的方式,对所述多个聚类结果进行评估。The plurality of clustering results are evaluated by means of calculating a histogram.
PCT/CN2021/071166 2020-12-14 2021-01-12 Text clustering method WO2022126810A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011464923.4A CN112464638B (en) 2020-12-14 2020-12-14 Text clustering method based on improved spectral clustering algorithm
CN202011464923.4 2020-12-14

Publications (1)

Publication Number Publication Date
WO2022126810A1 true WO2022126810A1 (en) 2022-06-23

Family

ID=74804038

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/071166 WO2022126810A1 (en) 2020-12-14 2021-01-12 Text clustering method

Country Status (2)

Country Link
CN (1) CN112464638B (en)
WO (1) WO2022126810A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115186778A (en) * 2022-09-13 2022-10-14 福建省特种设备检验研究院 Text analysis-based hidden danger identification method and terminal for pressure-bearing special equipment
CN117891411A (en) * 2024-03-14 2024-04-16 济宁蜗牛软件科技有限公司 Optimized storage method for massive archive data

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113011153B (en) * 2021-03-15 2022-03-29 平安科技(深圳)有限公司 Text correlation detection method, device, equipment and storage medium
CN113361605B (en) * 2021-06-07 2024-05-24 汇智数字科技控股(深圳)有限公司 Product similarity quantification method based on Amazon keywords
CN113554074A (en) * 2021-07-09 2021-10-26 浙江工贸职业技术学院 Image feature analysis method based on layered Laplace
CN114328922B (en) * 2021-12-28 2022-08-02 盐城工学院 Selective text clustering integration method based on spectrogram theory
CN114969348B (en) * 2022-07-27 2023-10-27 杭州电子科技大学 Electronic file hierarchical classification method and system based on inversion adjustment knowledge base
CN115841110B (en) * 2022-12-05 2023-08-11 武汉理工大学 Method and system for obtaining scientific knowledge discovery
CN115982633B (en) * 2023-03-21 2023-06-20 北京百度网讯科技有限公司 Target object classification method, device, electronic equipment and storage medium
CN116402554B (en) * 2023-06-07 2023-08-11 江西时刻互动科技股份有限公司 Advertisement click rate prediction method, system, computer and readable storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080243829A1 (en) * 2007-03-29 2008-10-02 Microsoft Corporation Spectral clustering using sequential shrinkage optimization
CN104462253A (en) * 2014-11-20 2015-03-25 武汉数为科技有限公司 Topic detection or tracking method for network text big data
CN107943856A (en) * 2017-11-07 2018-04-20 南京邮电大学 A kind of file classification method and system based on expansion marker samples
CN108132968A (en) * 2017-12-01 2018-06-08 西安交通大学 Network text is associated with the Weakly supervised learning method of Semantic unit with image
CN109960730A (en) * 2019-04-19 2019-07-02 广东工业大学 A kind of short text classification method, device and equipment based on feature extension

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103514183B (en) * 2012-06-19 2017-04-12 北京大学 Information search method and system based on interactive document clustering
CN104778480A (en) * 2015-05-08 2015-07-15 江南大学 Hierarchical spectral clustering method based on local density and geodesic distance
CN105159998A (en) * 2015-09-08 2015-12-16 海南大学 Keyword calculation method based on document clustering
CN106991430A (en) * 2017-02-28 2017-07-28 浙江工业大学 A kind of cluster number based on point of proximity method automatically determines Spectral Clustering
CN107590218B (en) * 2017-09-01 2020-11-06 南京理工大学 Spark-based multi-feature combined Chinese text efficient clustering method
CN111401468B (en) * 2020-03-26 2023-03-24 上海海事大学 Weight self-updating multi-view spectral clustering method based on shared neighbor

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080243829A1 (en) * 2007-03-29 2008-10-02 Microsoft Corporation Spectral clustering using sequential shrinkage optimization
CN104462253A (en) * 2014-11-20 2015-03-25 武汉数为科技有限公司 Topic detection or tracking method for network text big data
CN107943856A (en) * 2017-11-07 2018-04-20 南京邮电大学 A kind of file classification method and system based on expansion marker samples
CN108132968A (en) * 2017-12-01 2018-06-08 西安交通大学 Network text is associated with the Weakly supervised learning method of Semantic unit with image
CN109960730A (en) * 2019-04-19 2019-07-02 广东工业大学 A kind of short text classification method, device and equipment based on feature extension

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
NIU HAIYAN: "Research on Fuzzy Spectral Clustering Segmentation Algorithm and Apply It to Text Clustering", MASTER THESIS, TIANJIN POLYTECHNIC UNIVERSITY, CN, 15 March 2017 (2017-03-15), CN , XP055942873, ISSN: 1674-0246 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115186778A (en) * 2022-09-13 2022-10-14 福建省特种设备检验研究院 Text analysis-based hidden danger identification method and terminal for pressure-bearing special equipment
CN117891411A (en) * 2024-03-14 2024-04-16 济宁蜗牛软件科技有限公司 Optimized storage method for massive archive data

Also Published As

Publication number Publication date
CN112464638B (en) 2022-12-30
CN112464638A (en) 2021-03-09

Similar Documents

Publication Publication Date Title
WO2022126810A1 (en) Text clustering method
WO2022116537A1 (en) News recommendation method and apparatus, and electronic device and storage medium
CN106407406B (en) text processing method and system
CN110825877A (en) Semantic similarity analysis method based on text clustering
CN109471944B (en) Training method and device of text classification model and readable storage medium
CN109165383B (en) Data aggregation, analysis, mining and sharing method based on cloud platform
CN112380350B (en) Text classification method and device
CN110569289B (en) Column data processing method, equipment and medium based on big data
WO2021189830A1 (en) Sample data optimization method, apparatus and device, and storage medium
CN111144106A (en) Two-stage text feature selection method under unbalanced data set
Alghamdi et al. Arabic Web page clustering: A review
Li et al. A review of machine learning algorithms for text classification
CN112417152A (en) Topic detection method and device for case-related public sentiment
CN112579783B (en) Short text clustering method based on Laplace atlas
CN111178196A (en) Method, device and equipment for cell classification
Chen et al. Learning category distance metric for data clustering
CN106991171A (en) Topic based on Intelligent campus information service platform finds method
CN116881451A (en) Text classification method based on machine learning
WO2020147259A1 (en) User portait method and apparatus, readable storage medium, and terminal device
CN116304063A (en) Simple emotion knowledge enhancement prompt tuning aspect-level emotion classification method
Yu et al. Research on text categorization of KNN based on K-means for class imbalanced problem
Errecalde et al. Silhouette+ attraction: A simple and effective method for text clustering
Mao et al. Detection of artificial pornographic pictures based on multiple features and tree mode
Lin et al. Text categorization research based on cluster idea
CN110532384A (en) A kind of multitask dictionary list classification method, system, device and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21904757

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21904757

Country of ref document: EP

Kind code of ref document: A1