WO2021027086A1 - 文本聚类的方法、设备和存储介质 - Google Patents

文本聚类的方法、设备和存储介质 Download PDF

Info

Publication number
WO2021027086A1
WO2021027086A1 PCT/CN2019/115118 CN2019115118W WO2021027086A1 WO 2021027086 A1 WO2021027086 A1 WO 2021027086A1 CN 2019115118 W CN2019115118 W CN 2019115118W WO 2021027086 A1 WO2021027086 A1 WO 2021027086A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
sub
clustering
cluster
connected graph
Prior art date
Application number
PCT/CN2019/115118
Other languages
English (en)
French (fr)
Inventor
龚朝辉
陈汝龙
陈誉
段成阁
Original Assignee
苏州朗动网络科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 苏州朗动网络科技有限公司 filed Critical 苏州朗动网络科技有限公司
Publication of WO2021027086A1 publication Critical patent/WO2021027086A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification

Definitions

  • the present invention relates to information processing technology, in particular to a method, equipment and storage medium for text clustering.
  • Text is the main carrier of information.
  • browsing the news text published in time on the Internet has become an important means for people to obtain information.
  • the current amount of news text information on the Internet is huge, in order to enable people to navigate and browse quickly and easily.
  • News you need to use text clustering technology to cluster news text.
  • the text clustering technology can automatically divide the text set into multiple clusters, so that the texts in the same cluster have a certain similarity, and the similarity between the texts in different clusters is as low as possible.
  • Currently commonly used clustering methods include Kmeans, hierarchical clustering, and single pass algorithm.
  • the Single pass algorithm has the characteristics of input order dependence, that is, for the same clustering object input in different orders, different clustering results will appear.
  • Other clustering algorithms such as Kmeans, need to specify the number of categories.
  • Hierarchical clustering algorithms also have the problem of hierarchical selection. Different numbers of specified categories or different levels of selection will cause inconsistent clustering results.
  • the purpose of the present invention is to provide a method, equipment and storage medium for text clustering.
  • an embodiment of the present invention provides a method for text clustering, the method including:
  • the text set corresponding to the sub-connected graph is a text cluster.
  • the method further includes:
  • the degree of aggregation of the sub-connected graph refers to the ratio of the clustering coefficient of the sub-connected graph to the maximum graph diameter.
  • the method for acquiring the "distance of the text title after vectorization" includes:
  • the topic model is obtained by subject training on the text headings in the text heading list;
  • the method further includes:
  • the text represented by the highest vertex in the sub-connected graph corresponding to the text cluster is taken as the representative text of the text cluster, and the keywords of the text cluster are extracted as the content of the text cluster.
  • the method further includes:
  • the text is news
  • the text cluster is a news cluster
  • the news in the news cluster is sorted from newest to oldest according to the release time
  • the time interval between adjacent news is calculated
  • the sum of the inverses of all time intervals is taken as this
  • the popularity of news clusters, and the news clusters whose popularity is greater than the popularity threshold are defined as hot news.
  • an embodiment of the present invention provides a method for text clustering, the method including:
  • the text set corresponding to the initial connected graph is a text cluster.
  • the edges of the initial connected graph that are larger than the initial distance threshold are removed to obtain one or more sub-connected graphs
  • the text set corresponding to the sub-connected graph is a text cluster.
  • the method further includes:
  • the degree of aggregation of the initial connected graph refers to the ratio of the clustering coefficient of the initial connected graph to the maximum graph diameter.
  • the method for acquiring the "distance of the text title after vectorization" includes:
  • the topic model is obtained by subject training on the text headings in the text heading list;
  • the method further includes:
  • the text represented by the highest vertex in the sub-connected graph corresponding to the text cluster is taken as the representative text of the text cluster, and the keywords of the text cluster are extracted as the content of the text cluster.
  • the method further includes:
  • the text is news
  • the text cluster is a news cluster
  • the news in the news cluster is sorted from newest to oldest according to the release time
  • the time interval between adjacent news is calculated
  • the sum of the inverses of all time intervals is taken as this
  • the popularity of news clusters, and the news clusters whose popularity is greater than the popularity threshold are defined as hot news.
  • an embodiment of the present invention provides an electronic device including a memory and a processor, the memory stores a computer program that can run on the processor, and is characterized in that the processor The steps in any of the above-mentioned text clustering methods are realized when the program is executed.
  • an embodiment of the present invention provides a computer-readable storage medium on which a computer program is stored, characterized in that, when the computer program is executed by a processor, any of the above-mentioned text clustering is achieved. Steps in the method.
  • the present invention can quickly and stably cluster texts, and the clustering results of the same text data are consistent each time.
  • using this method to cluster enterprise-related news can quickly realize stable extraction of enterprise hot news, and has a good effect on the extraction of enterprise-related hot news.
  • FIG. 1 is a schematic flowchart of a method for text clustering in the first embodiment of the present invention
  • Figure 2 is an example of a connected graph
  • Figure 3 is the sub-connected graph obtained in Figure 2 after edge shifting
  • Fig. 4 is a schematic flowchart of a method for text clustering in the second embodiment of the present invention.
  • FIG. 1 a schematic flow chart of the text clustering method in the first embodiment of the present invention.
  • the relationship between the texts is represented by a connected graph, and then the connected graph is disassembled to obtain different sub-connected Graph to cluster the text.
  • the method includes:
  • Step S11 Obtain a list of text titles to be clustered.
  • the text title list may be a list of news titles related to a specific enterprise, or may be other types of text title lists. Each text title represents a text.
  • Step S12 Using the text title as a vertex, and using the vectorized distance of the text title as an edge, construct an initial connected graph between the text titles.
  • the steps include:
  • Step S121 Subject training is performed on the text titles in the list to obtain a subject model: vectorization methods such as TF-IDF, word2vec, LSI, LDA, etc. can be used to subject the text titles in the list to subject training to obtain the subject model.
  • vectorization methods such as TF-IDF, word2vec, LSI, LDA, etc. can be used to subject the text titles in the list to subject training to obtain the subject model.
  • Step S122 Use the theme model to vectorize each text title to obtain a text title vector: use this theme model to obtain a theme vector representation of each text title, that is, to vectorize each text title to obtain a text title vector.
  • Step S123 Calculate the similarity between the two text title vectors: use cosine distance, Jaccard coefficient and Euclidean distance to calculate the distance between two text title vectors.
  • cosine distance as an example, first calculate the two texts
  • the cosine similarity between the title vectors that is, the cosine value of the two vectors.
  • the range of the cosine value is between [-1,1]. The closer the value is to 1, the closer the directions of the two vectors are; the closer they are At -1, their directions are opposite; close to 0, which means that the two vectors are nearly orthogonal.
  • Step S125 Construct an initial connected graph: take the text title as a vertex (vertex represents a text title, and the text title represents a text, so a vertex represents a text), and use the vectorized distance of the text title as an edge to construct
  • the initial connected graph between the text titles, the connected graph (the connected graph in this article includes the initial connected graph and subsequent sub-connected graphs here) is characterized by a path connection between every two vertices of the graph, and the specific connection Figure can refer to Figure 2.
  • the initial connected graph constructed here is a fully connected graph, that is, any two vertices are connected by an edge, that is, the vertices are connected in pairs.
  • Step S13 Remove edges larger than the initial distance threshold of the initial connected graph to obtain one or more sub-connected graphs.
  • the length of the edge of the connected graph is the similarity between the vertices. The longer the length, the lower the similarity, and the shorter the length, the higher the similarity.
  • the process of clustering is to remove edges with low similarity, and the initial distance threshold is preferably 0.4. The edges of the initial connected graph that are larger than the initial distance threshold are removed to obtain one or more sub-connected graphs (please refer to Fig. 3, the connected graph in Fig. 2 obtains two sub-connected graphs after edge shifting).
  • Step S14 Calculate the degree of aggregation of each of the sub-connected graphs. If the degree of aggregation of one of the sub-connected graphs is greater than or equal to the clustering threshold, the text set corresponding to the sub-connected graph is a text cluster.
  • the degree of aggregation of the sub-connected graphs is relatively high, that is, greater than or equal to the clustering threshold, it indicates that the text sets corresponding to the sub-connected graphs have high similarity, and the text set is a text cluster.
  • the degree of aggregation of the connected graph refers to the ratio of the clustering coefficient of the connected graph to the diameter of the largest graph.
  • the clustering coefficient is an index to measure the degree of aggregation of the graph.
  • the maximum graph diameter is also called the diameter of the tree, which is the longest path connected in the graph. The larger the clustering coefficient, the tighter the graphs are combined, the larger the maximum graph diameter, the looser the relative graphs. The ratio of the two can better measure the degree of graph aggregation, and the larger the ratio, the better the graphs are clustered.
  • the clustering threshold is the clustering threshold.
  • the default clustering threshold is 0.09, that is, when the ratio of the clustering coefficient of the connected graph to the largest graph diameter is greater than or equal to the clustering threshold, the clustering degree of the connected graph meets the requirements, and a connected graph corresponds to The similarity of the text collection reaches the standard of clustering and can be divided into a text cluster.
  • the text clustering method is implemented, and the clustering is realized by a graphical method, which can quickly and stably cluster the text.
  • the clustering results of the same text data are consistent each time, and the clustering The result of the class is clear.
  • the method further includes:
  • the edge is to remove the edges that do not meet the requirements
  • the degree of aggregation of some sub-connected graphs is less than the clustering threshold, that is, The similarity of the text set corresponding to the sub-connected graph cannot meet the clustering standard, and the sub-connected graph needs to be further disassembled.
  • the dismantling method is also realized by moving the edges.
  • the second-level sub-connected graph is obtained, and the aggregation degree of the second-level sub-connected graph is calculated.
  • the method further includes:
  • the text represented by the highest vertex in the sub-connected graph corresponding to the text cluster is taken as the representative text of the text cluster, and the keywords of the text cluster are extracted as the content of the text cluster.
  • the degree of the vertices of the connected graph refers to the number of edges connected by the vertices
  • the “vertex with the highest degree in the sub-connected graph” refers to the vertex with the most connected edges in the sub-connected graph.
  • the text represented by the vertex with the highest degree is regarded as the representative text of the text cluster, and the keywords of the text cluster are extracted as the content of the text cluster. Through the content of the representative text and the text cluster, the text can be quickly understood The general situation of the cluster.
  • the method further includes:
  • the text is news
  • the text cluster is a news cluster
  • the news in the news cluster is sorted from newest to oldest according to the release time
  • the time interval between adjacent news is calculated
  • the sum of the inverses of all time intervals is taken as this
  • the popularity of news clusters, and the news clusters whose popularity is greater than the popularity threshold are defined as hot news.
  • the obtained text (news) title list to be clustered is as follows:
  • Company A received 110 million yuan in fundraising, and was exclusively strategically invested by Company C
  • the two news hotspots are well separated.
  • the first hotspot has a popularity of 164.800618, there are 5 related news, the clustering coefficient is 1.0, the largest picture diameter is 1, and the keyword is "Company A Company C raises funds, a new round of exclusive strategy investment”.
  • the popularity of the second hot spot is 111.744243, there are 6 related news, the clustering coefficient is 0.55, the largest picture diameter is 2, and the key word is "Zhang Trap Company B Memory".
  • a schematic flowchart of a method for text clustering in the second embodiment of the present invention includes:
  • Step S31 Obtain a list of text titles to be clustered
  • Step S32 using the text title as a vertex, and using the vectorized distance of the text title as an edge, construct an initial connected graph between the text titles;
  • Step S33 Calculate the degree of aggregation of the initial connected graph, and if the degree of aggregation of the initial connected graph is greater than or equal to the clustering threshold, the text set corresponding to the initial connected graph is a text cluster.
  • the aggregation degree of the initial connected graph is also calculated. If the aggregation degree of the initial connected graph is greater than or equal to the clustering threshold, the text set corresponding to the initial connected graph is A text cluster.
  • the degree of aggregation of the initial connected graph is less than the clustering threshold, according to the method in the first embodiment, the initial connected graph is edge-shifted to obtain the sub-connected graph, the aggregation degree of the sub-connected graph is calculated, and whether edge shifting is required Etc., for a specific manner, refer to the first embodiment, which will not be repeated here.
  • the present invention also provides an electronic device, including a memory and a processor, the memory stores a computer program that can be run on the processor, and when the processor executes the program, the method for clustering text step.
  • the present invention also provides a computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, the steps in the method for text clustering are realized.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种文本聚类的方法、设备和存储介质,所述方法包括:获取待聚类的文本标题列表(S11);以所述文本标题为顶点,以所述文本标题向量化后的距离为边,构建所述文本标题之间的初始连通图(S12);移除所述初始连通图的大于初始距离阈值的边,得到一个或者多个子连通图(S13);计算每个所述子连通图的聚集程度,若一个所述子连通图的聚集程度大于或等于分簇阈值,所述子连通图对应的文本集合即为一个文本簇(S14)。所述方法能够对文本进行快速、稳定地聚类,同样的文本数据每次聚类的结果是一致的。

Description

文本聚类的方法、设备和存储介质 技术领域
本发明涉及信息处理技术,尤其涉及一种文本聚类的方法、设备和存储介质。
背景技术
文本是信息的主要载体,随着互联网的发展,浏览网络上及时发布的新闻文本成为人们获取信息的重要手段,当前网络上的新闻文本信息数量巨大,为了使人们能够快速、方便地导航和浏览新闻,需要对新闻文本使用文本聚类技术进行聚类。文本聚类技术能够将文本集自动划分成多个簇,使得处于同一个簇中的文本具有一定的相似性,不同簇中的文本之间的相似性尽可能低。目前常用的聚类方法有Kmeans,层次聚类,Single pass算法等。
但是Single pass算法具有输入次序依赖特性,即对于同一聚类对象按不同的次序输入,会出现不同的聚类结果。其他聚类算法,如Kmeans需要指定类别的数量,层次聚类算法也存在层次选取的问题,对于指定类别的数量不同或者选取的层次不同,都会造成聚类结果的不一致。
发明内容
本发明的目的在于提供一种文本聚类的方法、设备和存储介质。
为实现上述发明目的之一,本发明一实施方式提供一种文本聚类的方法,所述方法包括:
获取待聚类的文本标题列表;
以所述文本标题为顶点,以所述文本标题向量化后的距离为边,构建所述文本标题之间的初始连通图;
移除所述初始连通图的大于初始距离阈值的边,得到一个或者多个子连通图;
计算每个所述子连通图的聚集程度,若一个所述子连通图的聚集程度大于或等于分簇阈值,所述子连通图对应的文本集合即为一个文本簇。
作为本发明一实施方式的进一步改进,所述方法还包括:
S21、若一个所述子连通图的聚集程度小于分簇阈值,获取所述子连通图的当前距离阈值,移除所述子连通图的大于当前距离阈值的边,得到一个或者多个子连通图;
S22、计算每个所述子连通图的聚集程度,重复步骤S21~S22,直至所有的子连通图的聚集程度都大于或者等于分簇阈值,每个大于或者等于分簇阈值的子连通图对应的文本集合即为一个文本簇。
作为本发明一实施方式的进一步改进,所述子连通图的聚集程度是指所述子连通图的群聚系 数与最大图直径的比值。
作为本发明一实施方式的进一步改进,所述“所述文本标题向量化后的距离”的获取方法包括:
通过对所述文本标题列表中的文本标题进行主题训练得到主题模型;
利用所述主题模型将每条所述文本标题向量化,得到文本标题向量;
计算两两所述文本标题向量之间的相似度;
计算两两文本标题向量之间的距离。
作为本发明一实施方式的进一步改进,所述方法还包括:
将所述文本簇对应的子连通图中度最高的顶点所代表的文本作为所述文本簇的代表文本,提取所述文本簇的关键词作为所述文本簇的内容。
作为本发明一实施方式的进一步改进,所述方法还包括:
所述文本为新闻,所述文本簇为新闻簇,将所述新闻簇中的新闻按照发布时间从新到旧排序,计算相邻新闻之间的时间间隔,将所有时间间隔的倒数的和作为此新闻簇的热度,将所述热度大于热度阈值的新闻簇定义为热点新闻。
为实现上述发明目的之一,本发明一实施方式提供一种文本聚类的方法,所述方法包括:
获取待聚类的文本标题列表;
以所述文本标题为顶点,以所述文本标题向量化后的距离为边,构建所述文本标题之间的初始连通图;
计算所述初始连通图的聚集程度,若所述初始连通图的聚集程度大于或者等于分簇阈值,所述初始连通图对应的文本集合即为一个文本簇。
作为本发明一实施方式的进一步改进,若所述初始连通图的聚集程度小于分簇阈值,移除所述初始连通图的大于初始距离阈值的边,得到一个或者多个子连通图;
计算每个所述子连通图的聚集程度,若一个所述子连通图的聚集程度大于或者等于分簇阈值,所述子连通图对应的文本集合即为一个文本簇。
作为本发明一实施方式的进一步改进,所述方法还包括:
S41、若一个所述子连通图的聚集程度小于分簇阈值,获取所述子连通图的当前距离阈值,移除所述子连通图的大于当前距离阈值的边,得到一个或者多个子连通图;
S42、计算每个所述子连通图的聚集程度,重复步骤S41~S42,直至所有的子连通图的聚集程度都大于或者等于分簇阈值,每个大于或者等于分簇阈值的子连通图对应的文本集合即为一个文本簇。
作为本发明一实施方式的进一步改进,所述初始连通图的聚集程度是指所述初始连通图的群聚系数与最大图直径的比值。
作为本发明一实施方式的进一步改进,所述“所述文本标题向量化后的距离”的获取方法包括:
通过对所述文本标题列表中的文本标题进行主题训练得到主题模型;
利用所述主题模型将每条所述文本标题向量化,得到文本标题向量;
计算两两所述文本标题向量之间的相似度;
计算两两文本标题向量之间的距离。
作为本发明一实施方式的进一步改进,所述方法还包括:
将所述文本簇对应的子连通图中度最高的顶点所代表的文本作为所述文本簇的代表文本,提取所述文本簇的关键词作为所述文本簇的内容。
作为本发明一实施方式的进一步改进,所述方法还包括:
所述文本为新闻,所述文本簇为新闻簇,将所述新闻簇中的新闻按照发布时间从新到旧排序,计算相邻新闻之间的时间间隔,将所有时间间隔的倒数的和作为此新闻簇的热度,将所述热度大于热度阈值的新闻簇定义为热点新闻。
为实现上述发明目的之一,本发明一实施方式提供一种电子设备,包括存储器和处理器,所述存储器存储有可在所述处理器上运行的计算机程序,其特征在于,所述处理器执行所述程序时实现上述任一文本聚类的方法中的步骤。
为实现上述发明目的之一,本发明一实施方式提供一种计算机可读存储介质,其上存储有计算机程序,其特征在于,所述计算机程序被处理器执行时实现上述任一文本聚类的方法中的步骤。
与现有技术相比,本发明能够对文本进行快速、稳定地聚类,同样的文本数据每次聚类的结果是一致的。同时,使用此方法对企业相关的新闻进行聚类,能够快速的实现对企业热点新闻的稳定提取,对于企业相关的新闻热点提取有较好的效果。
附图说明
图1是本发明第一实施方式中文本聚类的方法的流程示意图;
图2是连通图的一个举例;
图3是图2经过移边后得到的子连通图;
图4是本发明第二实施方式中文本聚类的方法的流程示意图。
具体实施方式
以下将结合附图所示的具体实施方式对本发明进行详细描述。但这些实施方式并不限制本发明,本领域的普通技术人员根据这些实施方式所做出的结构、方法、或功能上的变换均包含在本发明的保护范围内。
如图1所示,本发明第一实施方式中文本聚类的方法的流程示意图,本实施方式通过将文本之间的关系用连通图进行表示,然后对连通图进行拆解得到不同的子连通图,从而将文本进行分簇。所述方法包括:
步骤S11:获取待聚类的文本标题列表。
所述文本标题列表可以是与具体某个企业相关的新闻标题列表,也可以是其它类型的文本标题列表。每条文本标题代表一个文本。
步骤S12:以所述文本标题为顶点,以所述文本标题向量化后的距离为边,构建所述文本标题之间的初始连通图。
本步骤将文本之间的关系用连通图进行表示,在一个优选实施方式中,所述步骤包括:
步骤S121:通过对所述列表中的文本标题进行主题训练得到主题模型:可以使用TF-IDF,word2vec,LSI,LDA等向量化的方法,对列表中文本标题进行主题训练得到主题模型。
步骤S122:利用所述主题模型将每条所述文本标题向量化,得到文本标题向量:利用此主题模型得到每条文本标题的主题向量表示,即将每条文本标题向量化,得到文本标题向量。
步骤S123:计算两两所述文本标题向量之间的相似度:利用余弦距离、Jaccard系数和欧式距离等方式计算两个文本标题向量之间的距离,以余弦距离为例,先计算两个文本标题向量之间的余弦相似度,即两个向量的余弦值,余弦值的范围在[-1,1]之间,值越趋近于1,代表两个向量的方向越接近;越趋近于-1,他们的方向越相反;接近于0,表示两个向量近乎于正交。
步骤S124:计算两两文本标题向量之间的距离:继续以余弦距离为例,余弦距离=1-余弦相似度,取值范围为[0,2],距离越小,代表两个文本标题向量的方向越接近,即两个文本标题越相似。
步骤S125:构建初始连通图:以所述文本标题为顶点(顶点代表一条文本标题,文本标题代表一个文本,因此一个顶点代表一个文本),以所述文本标题向量化后的距离为边,构建所述文本标题之间的初始连通图,连通图(本文中连通图包括此处的初始连通图和后续的子连通图)的特点是图的每两个顶点之间有路径连接,具体的连通图可以参考图2。本发明优选此处构建的初始连通图为完全连通图,即任意两个顶点之间都有一个边相连,也就是顶点两两相连。
步骤S13:移除所述初始连通图的大于初始距离阈值的边,得到一个或者多个子连通图。
连通图的边的长度为顶点之间的相似度,长度越长相似度越低,长度越短相似度越高。聚类 的过程就是要将相似度低的边移除,优选初始距离阈值为0.4。移除所述初始连通图的大于初始距离阈值的边,得到一个或者多个子连通图(请参考图3,图2中的连通图经过移边后得到两个子连通图)。
步骤S14:计算每个所述子连通图的聚集程度,若一个所述子连通图的聚集程度大于或等于分簇阈值,所述子连通图对应的文本集合即为一个文本簇。
计算子连通图的聚集程度,如果子连通图的聚集程度比较高,即大于或等于分簇阈值,表示子连通图对应的文本集合的相似度高,所述文本集合即为一个文本簇。
优选的,连通图的聚集程度是指连通图的群聚系数与最大图直径的比值。群聚系数是衡量图的聚集程度的指标,最大图直径也叫树的直径,是图中连通的最长路径。群聚系数的值越大说明图结合越紧密,最大图直径越大说明相对越松散,两者的比值可以更好的衡量图的聚集程度,比值越大说明图聚集的越好。聚集程度的阈值为分簇阈值,默认分簇阈值为0.09,即连通图的群聚系数与最大图直径的比值大于或者等于分簇阈值时,连通图的聚集程度符合要求,一个连通图对应的文本集合相似度达到分簇的标准,可以分为一个文本簇。
在本实施方式中,执行所述文本聚类的方法,通过图解的方法实现聚类,能够对文本进行快速、稳定地聚类,同样的文本数据每次聚类的结果是一致的,且聚类的结果清晰明了。
优选的,所述方法还包括:
S21、若一个所述子连通图的聚集程度小于分簇阈值,获取所述子连通图的当前距离阈值,移除所述子连通图的大于当前距离阈值的边,得到一个或者多个子连通图;
S22、计算每个所述子连通图的聚集程度,重复步骤S21~S22,直至所有的子连通图的聚集程度都大于或者等于分簇阈值,每个大于或者等于分簇阈值的子连通图对应的文本集合即为一个文本簇。
当初始连通图通过移边(移边即移除不符合要求的边),得到一个或者多个第一级的子连通图后,如果某些子连通图的聚集程度小于分簇阈值,即所述子连通图对应的文本集合相似度达不到分簇的标准,还需要对所述子连通图进行进一步拆解。拆解的方法也是通过移边来实现的。由于此时的子连通图的边都是小于初始距离阈值的,因此需要对初始距离阈值进行递减(默认是等量递减,默认每次递减0.05),即对于初始连通图移边后得到的第一级的子连通图,其当前距离阈值为上一级距离阈值(初始距离阈值)减去默认值(即0.4-0.05=0.35)。对于第一级聚集程度小于分簇阈值的子连通图,经过移除大于当前距离阈值的边后,得到第二级子连通图,计算第二级子连通图的聚集程度,如果某些第二级子连通图的聚集程度小于分簇阈值,计算第二级子连通图的当前距离阈值,当前距离阈值=上一级距离阈值-默认值(0.35-0.05=0.3),然后移除第二 级子连通图的大于当前距离阈值的边,得到第三级子连通图,计算第三级子连通图的聚集程度,判读是否需要移边。如此循环,直至所有的子连通图的聚集程度都大于或者等于分簇阈值,每个大于或者等于分簇阈值的子连通图对应的文本集合即为一个文本簇。
最终,将待聚类的文本标题列表所代表的文本标题集合,分成多个文本簇。
优选的,所述方法还包括:
将所述文本簇对应的子连通图中度最高的顶点所代表的文本作为所述文本簇的代表文本,提取所述文本簇的关键词作为所述文本簇的内容。
连通图的顶点的度是指顶点所连接的边的数目,“子连通图中度最高的顶点”是指子连通图中连接的边最多的顶点。将度最高的顶点所代表的文本作为所述文本簇的代表文本,提取所述文本簇的关键词作为所述文本簇的内容,通过代表文本和文本簇的内容,可以很快的了解这个文本簇的大概情况。
优选的,所述方法还包括:
所述文本为新闻,所述文本簇为新闻簇,将所述新闻簇中的新闻按照发布时间从新到旧排序,计算相邻新闻之间的时间间隔,将所有时间间隔的倒数的和作为此新闻簇的热度,将所述热度大于热度阈值的新闻簇定义为热点新闻。
由于新闻的热度与新闻爆发的集中程度有关,因此将所有时间间隔的倒数的和作为此新闻簇的热度,将所述热度大于热度阈值的新闻簇定义为热点新闻。
下面通过具体的实施例对本实施方式进行进一步的解释与说明。
获取到的待聚类的文本(新闻)标题列表如下:
A公司完成新一轮1.1亿元集资
某科技公司“A公司”获C公司1.1亿元独家策略集资
A公司获1.1亿元集资,由C公司独家策略投资
A公司完成新一轮1.1亿元集资
一线丨A公司完成新一轮1.1亿元集资C公司独家策略投资
张某评C公司踩雷10亿圈套:暴露了风险,这是件好事
张某回忆“C公司踩雷10亿圈套”:以前遇到过
张某回忆“C公司踩雷10亿圈套”:以前遇到过,但是对方拒绝核实
C公司踩雷D公司10亿集资罗生门核心:B公司是否参与
C公司踩雷10亿D公司前面的路看不清楚
C公司踩雷10亿,把责任推卸给B公司?
通过本实施方式得到的最终聚类结果为:
##Group1 164.800618(5)-1.000000(1)-A公司 C公司 集资 新一轮 独家 策略 投资
20190709 14:56:00 A公司完成新一轮1.1亿元集资
20190705 18:32:09 某科技公司“A公司”获C公司1.1亿元独家策略集资
20190705 17:08:00 A公司获2.5亿元集资,由C公司独家策略投资
20190705 16:40:00 A公司完成新一轮1.1亿元集资
20190705 16:33:00 一线丨A公司完成新一轮1.1亿元集资C公司独家策略投资
##Group2 111.744243(6)-0.550000(2)-张某 圈套 B公司 回忆
20190710 14:00:00 张某评C公司踩雷10亿圈套:暴露了风险,这是件好事
20190710 13:49:00 张某回忆“C公司踩雷10亿圈套”:以前遇到过
20190710 11:15:36 张某回忆“C公司踩雷10亿圈套”:以前遇到过,但是对方拒绝核实
20190709 22:51:00 C公司踩雷D公司10亿集资罗生门核心:B公司是否参与
20190709 13:59:00 C公司踩雷10亿D公司未来的路看不清楚
20190709 00:00:00 C公司踩雷10亿,把责任推卸给B公司?
从结果可以发现,两个新闻热点被很好的区分开来,第一个热点的热度为164.800618,有5条相关新闻,群聚系数是1.0,最大图直径是1,关键词是“A公司 C公司 集资 新一轮 独家 策略 投资”。第二个热点的热度为111.744243,有6条相关新闻,群聚系数是0.55,最大图直径是2,关键词是“张某 圈套 B公司 回忆”。
如图4所示,本发明第二实施方式中文本聚类的方法的流程示意图,所述方法包括:
步骤S31:获取待聚类的文本标题列表;
步骤S32:以所述文本标题为顶点,以所述文本标题向量化后的距离为边,构建所述文本标题之间的初始连通图;
步骤S33:计算所述初始连通图的聚集程度,若所述初始连通图的聚集程度大于或者等于分簇阈值,所述初始连通图对应的文本集合即为一个文本簇。
本实施方式与第一实施方式的不同之处在于,对于初始连通图也要计算其聚集程度,如果初始连通图的聚集程度大于或者等于分簇阈值,所述初始连通图对应的文本集合即为一个文本簇。
需要说明的是,如果初始连通图的聚集程度小于分簇阈值,按照第一实施方式中的方法,对初始连通图进行移边得到子连通图,计算子连通图的聚集程度,是否需要移边等,具体的方式参考第一实施方式,此处不再赘述。
本发明还提供一种电子设备,包括存储器和处理器,所述存储器存储有可在所述处理器上运 行的计算机程序,所述处理器执行所述程序时实现上述文本聚类的方法中的步骤。
本发明还提供一种计算机可读存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时实现上述文本聚类的方法中的步骤。
应当理解,虽然本说明书按照实施方式加以描述,但并非每个实施方式仅包含一个独立的技术方案,说明书的这种叙述方式仅仅是为清楚起见,本领域技术人员应当将说明书作为一个整体,各实施方式中的技术方案也可以经适当组合,形成本领域技术人员可以理解的其他实施方式。
上文所列出的一系列的详细说明仅仅是针对本发明的可行性实施方式的具体说明,它们并非用以限制本发明的保护范围,凡未脱离本发明技艺精神所作的等效实施方式或变更均应包含在本发明的保护范围之内。

Claims (15)

  1. 一种文本聚类的方法,其特征在于,所述方法包括:
    获取待聚类的文本标题列表;
    以所述文本标题为顶点,以所述文本标题向量化后的距离为边,构建所述文本标题之间的初始连通图;
    移除所述初始连通图的大于初始距离阈值的边,得到一个或者多个子连通图;
    计算每个所述子连通图的聚集程度,若一个所述子连通图的聚集程度大于或等于分簇阈值,所述子连通图对应的文本集合即为一个文本簇。
  2. 根据权利要求1所述的文本聚类的方法,其奇特在于,所述方法还包括:
    S21、若一个所述子连通图的聚集程度小于分簇阈值,获取所述子连通图的当前距离阈值,移除所述子连通图的大于当前距离阈值的边,得到一个或者多个子连通图;
    S22、计算每个所述子连通图的聚集程度,重复步骤S21~S22,直至所有的子连通图的聚集程度都大于或者等于分簇阈值,每个大于或者等于分簇阈值的子连通图对应的文本集合即为一个文本簇。
  3. 如权利要求1所述的文本聚类的方法,其特征在于:
    所述子连通图的聚集程度是指所述子连通图的群聚系数与最大图直径的比值。
  4. 根据权利要求1所述的文本聚类的方法,其特征在于,所述“所述文本标题向量化后的距离”的获取方法包括:
    通过对所述文本标题列表中的文本标题进行主题训练得到主题模型;
    利用所述主题模型将每条所述文本标题向量化,得到文本标题向量;
    计算两两所述文本标题向量之间的相似度;
    计算两两文本标题向量之间的距离。
  5. 根据权利要求1所述的文本聚类的方法,其特征在于,所述方法还包括:
    将所述文本簇对应的子连通图中度最高的顶点所代表的文本作为所述文本簇的代表文本,提取所述文本簇的关键词作为所述文本簇的内容。
  6. 根据权利要求1所述的文本聚类的方法,其特征在于,所述方法还包括:
    所述文本为新闻,所述文本簇为新闻簇,将所述新闻簇中的新闻按照发布时间从新到旧排序,计算相邻新闻之间的时间间隔,将所有时间间隔的倒数的和作为此新闻簇的热度,将所述热度大于热度阈值的新闻簇定义为热点新闻。
  7. 一种文本聚类的方法,其特征在于,所述方法包括:
    获取待聚类的文本标题列表;
    以所述文本标题为顶点,以所述文本标题向量化后的距离为边,构建所述文本标题之间的初始连通图;
    计算所述初始连通图的聚集程度,若所述初始连通图的聚集程度大于或者等于分簇阈值,所述初始连通图对应的文本集合即为一个文本簇。
  8. 如权利要求7所述的文本聚类的方法,其特征在于:
    若所述初始连通图的聚集程度小于分簇阈值,移除所述初始连通图的大于初始距离阈值的边,得到一个或者多个子连通图;
    计算每个所述子连通图的聚集程度,若一个所述子连通图的聚集程度大于或者等于分簇阈值,所述子连通图对应的文本集合即为一个文本簇。
  9. 根据权利要求8所述的文本聚类的方法,其奇特在于,所述方法还包括:
    S41、若一个所述子连通图的聚集程度小于分簇阈值,获取所述子连通图的当前距离阈值,移除所述子连通图的大于当前距离阈值的边,得到一个或者多个子连通图;
    S42、计算每个所述子连通图的聚集程度,重复步骤S41~S42,直至所有的子连通图的聚集程度都大于或者等于分簇阈值,每个大于或者等于分簇阈值的子连通图对应的文本集合即为一个文本簇。
  10. 如权利要求7所述的文本聚类的方法,其特征在于:
    所述初始连通图的聚集程度是指所述初始连通图的群聚系数与最大图直径的比值。
  11. 根据权利要求7所述的文本聚类的方法,其特征在于,所述“所述文本标题向量化后的距离”的获取方法包括:
    通过对所述文本标题列表中的文本标题进行主题训练得到主题模型;
    利用所述主题模型将每条所述文本标题向量化,得到文本标题向量;
    计算两两所述文本标题向量之间的相似度;
    计算两两文本标题向量之间的距离。
  12. 根据权利要求7所述的文本聚类的方法,其特征在于,所述方法还包括:
    将所述文本簇对应的子连通图中度最高的顶点所代表的文本作为所述文本簇的代表文本,提取所述文本簇的关键词作为所述文本簇的内容。
  13. 根据权利要求7所述的文本聚类的方法,其特征在于,所述方法还包括:
    所述文本为新闻,所述文本簇为新闻簇,将所述新闻簇中的新闻按照发布时间从新到旧排序, 计算相邻新闻之间的时间间隔,将所有时间间隔的倒数的和作为此新闻簇的热度,将所述热度大于热度阈值的新闻簇定义为热点新闻。
  14. 一种电子设备,包括存储器和处理器,所述存储器存储有可在所述处理器上运行的计算机程序,其特征在于,所述处理器执行所述程序时实现权利要求1-13任意一项所述文本聚类的方法中的步骤。
  15. 一种计算机可读存储介质,其上存储有计算机程序,其特征在于,所述计算机程序被处理器执行时实现权利要求1-13任意一项所述文本聚类的方法中的步骤。
PCT/CN2019/115118 2019-08-15 2019-11-01 文本聚类的方法、设备和存储介质 WO2021027086A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910753636.6 2019-08-15
CN201910753636.6A CN110532388B (zh) 2019-08-15 2019-08-15 文本聚类的方法、设备和存储介质

Publications (1)

Publication Number Publication Date
WO2021027086A1 true WO2021027086A1 (zh) 2021-02-18

Family

ID=68663389

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/115118 WO2021027086A1 (zh) 2019-08-15 2019-11-01 文本聚类的方法、设备和存储介质

Country Status (2)

Country Link
CN (1) CN110532388B (zh)
WO (1) WO2021027086A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114911939A (zh) * 2022-05-24 2022-08-16 腾讯科技(深圳)有限公司 热点挖掘方法、装置、电子设备、存储介质及程序产品
CN117034905A (zh) * 2023-08-07 2023-11-10 重庆邮电大学 一种基于大数据的互联网假新闻识别方法

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112597284B (zh) * 2021-03-08 2021-06-15 中邮消费金融有限公司 公司名称的匹配方法、装置、计算机设备及存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104778480A (zh) * 2015-05-08 2015-07-15 江南大学 一种基于局部密度和测地距离的分层谱聚类方法
CN107273412A (zh) * 2017-05-04 2017-10-20 北京拓尔思信息技术股份有限公司 一种文本数据的聚类方法、装置和系统
CN107451183A (zh) * 2017-06-19 2017-12-08 中国信息通信研究院 基于文本聚类思想的知识地图构建方法
US10129276B1 (en) * 2016-03-29 2018-11-13 EMC IP Holding Company LLC Methods and apparatus for identifying suspicious domains using common user clustering
CN109033200A (zh) * 2018-06-29 2018-12-18 北京百度网讯科技有限公司 事件抽取的方法、装置、设备及计算机可读介质

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106599181B (zh) * 2016-12-13 2019-06-18 浙江网新恒天软件有限公司 一种基于主题模型的新闻热点检测方法
US10489440B2 (en) * 2017-02-01 2019-11-26 Wipro Limited System and method of data cleansing for improved data classification

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104778480A (zh) * 2015-05-08 2015-07-15 江南大学 一种基于局部密度和测地距离的分层谱聚类方法
US10129276B1 (en) * 2016-03-29 2018-11-13 EMC IP Holding Company LLC Methods and apparatus for identifying suspicious domains using common user clustering
CN107273412A (zh) * 2017-05-04 2017-10-20 北京拓尔思信息技术股份有限公司 一种文本数据的聚类方法、装置和系统
CN107451183A (zh) * 2017-06-19 2017-12-08 中国信息通信研究院 基于文本聚类思想的知识地图构建方法
CN109033200A (zh) * 2018-06-29 2018-12-18 北京百度网讯科技有限公司 事件抽取的方法、装置、设备及计算机可读介质

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114911939A (zh) * 2022-05-24 2022-08-16 腾讯科技(深圳)有限公司 热点挖掘方法、装置、电子设备、存储介质及程序产品
CN117034905A (zh) * 2023-08-07 2023-11-10 重庆邮电大学 一种基于大数据的互联网假新闻识别方法
CN117034905B (zh) * 2023-08-07 2024-05-14 重庆邮电大学 一种基于大数据的互联网假新闻识别方法

Also Published As

Publication number Publication date
CN110532388B (zh) 2022-07-01
CN110532388A (zh) 2019-12-03

Similar Documents

Publication Publication Date Title
WO2021027086A1 (zh) 文本聚类的方法、设备和存储介质
Jiang et al. Relation extraction with multi-instance multi-label convolutional neural networks
US10671812B2 (en) Text classification using automatically generated seed data
Li et al. Distribution distance minimization for unsupervised user identity linkage
CN104750798B (zh) 一种应用程序的推荐方法和装置
CN103116588A (zh) 一种个性化推荐方法及系统
CN105512277B (zh) 一种面向图书市场书名的短文本聚类方法
US20140032207A1 (en) Information Classification Based on Product Recognition
CN107229731B (zh) 用于分类数据的方法和装置
CN105426426A (zh) 一种基于改进的K-Medoids的KNN文本分类方法
Gadepally et al. Using a power law distribution to describe big data
Fitriyani et al. The K-means with mini batch algorithm for topics detection on online news
CN107357895B (zh) 一种基于词袋模型的文本表示的处理方法
CN112968872B (zh) 基于自然语言处理的恶意流量检测方法、系统、终端
Prasad et al. Glocal: Incorporating global information in local convolution for keyphrase extraction
Galhotra et al. Robust entity resolution using random graphs
WO2017092581A1 (zh) 一种用户数据共享的方法和设备
Guo et al. Homophily-oriented heterogeneous graph rewiring
CN112906873A (zh) 一种图神经网络训练方法、装置、电子设备及存储介质
CN116150125A (zh) 结构化数据生成模型的训练方法、装置、设备及存储介质
CN109299887B (zh) 一种数据处理方法、装置及电子设备
CN108470035B (zh) 一种基于判别混合模型的实体-引文相关性分类方法
KR20110115281A (ko) 고차원 데이터의 유사도 검색을 위한 데이터 분할방법
CN110941645B (zh) 一种自动判定串案的方法、装置、存储介质及处理器
CN107665443B (zh) 获取目标用户的方法及装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19941347

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19941347

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 19941347

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 16.09.2022)

122 Ep: pct application non-entry in european phase

Ref document number: 19941347

Country of ref document: EP

Kind code of ref document: A1