CN110532388B

CN110532388B - Text clustering method, equipment and storage medium

Info

Publication number: CN110532388B
Application number: CN201910753636.6A
Authority: CN
Inventors: 龚朝辉; 陈汝龙; 陈誉; 段成阁
Original assignee: Qichacha Technology Co ltd
Current assignee: Qichacha Technology Co ltd
Priority date: 2019-08-15
Filing date: 2019-08-15
Publication date: 2022-07-01
Anticipated expiration: 2039-08-15
Also published as: CN110532388A; WO2021027086A1

Abstract

The invention discloses a text clustering method, text clustering equipment and a storage medium, wherein the method comprises the following steps: acquiring a text title list to be clustered; constructing an initial connected graph between the text titles by taking the text titles as vertexes and taking the vectorized distance of the text titles as an edge; removing edges of the initial connected graph, which are larger than an initial distance threshold value, to obtain one or more sub connected graphs; and calculating the aggregation degree of each sub-connected graph, wherein if the aggregation degree of one sub-connected graph is greater than or equal to a clustering threshold value, the text set corresponding to the sub-connected graph is a text cluster. Compared with the prior art, the method can quickly and stably cluster the texts, and the clustering result of the same text data is consistent each time. Meanwhile, the method is used for clustering the news related to the enterprise, the hot news of the enterprise can be rapidly and stably extracted, and the method has a good effect on extracting the hot news related to the enterprise.

Description

Text clustering method, equipment and storage medium

Technical Field

The present invention relates to information processing technologies, and in particular, to a method, an apparatus, and a storage medium for text clustering.

Background

The text is a main carrier of information, browsing news texts published on the network in time becomes an important means for people to acquire information along with the development of the internet, the number of news text information on the current network is huge, and in order to enable people to navigate and browse news quickly and conveniently, the news texts need to be clustered by using a text clustering technology. The text clustering technology can automatically divide a text set into a plurality of clusters, so that texts in the same cluster have certain similarity, and the similarity between texts in different clusters is as low as possible. The current common clustering methods include Kmeans, hierarchical clustering, Single pass algorithm and the like.

However, the Single pass algorithm has an input order dependency characteristic, that is, different clustering results appear when the same clustering object is input in different orders. Other clustering algorithms, such as the number of classes required to be specified by means of Kmeans, have the problem of hierarchical selection, and cause inconsistency of clustering results for different numbers of specified classes or different selected hierarchies.

Disclosure of Invention

The invention aims to provide a text clustering method, equipment and a storage medium.

In order to achieve one of the above objects, an embodiment of the present invention provides a method for clustering texts, including:

acquiring a text title list to be clustered;

constructing an initial connected graph between the text titles by taking the text titles as vertexes and taking the vectorized distance of the text titles as an edge;

removing edges of the initial connected graph, which are larger than an initial distance threshold value, to obtain one or more sub connected graphs;

and calculating the aggregation degree of each sub-connected graph, wherein if the aggregation degree of one sub-connected graph is greater than or equal to a clustering threshold value, the text set corresponding to the sub-connected graph is a text cluster.

As a further improvement of an embodiment of the present invention, the method further comprises:

s21, if the aggregation degree of one sub-connected graph is smaller than the clustering threshold, acquiring the current distance threshold of the sub-connected graph, and removing the edge of the sub-connected graph, which is larger than the current distance threshold, to obtain one or more sub-connected graphs;

and S22, calculating the aggregation degree of each sub-connected graph, and repeating the steps S21-S22 until the aggregation degrees of all the sub-connected graphs are larger than or equal to a clustering threshold value, wherein a text set corresponding to each sub-connected graph larger than or equal to the clustering threshold value is a text cluster.

As a further improvement of an embodiment of the present invention, the aggregation degree of the sub-connectivity graph refers to a ratio of the aggregation coefficient of the sub-connectivity graph to the maximum graph diameter.

As a further improvement of an embodiment of the present invention, the method for obtaining the "distance after vectorization of the text title" includes:

performing theme training on the text titles in the text title list to obtain a theme model;

vectorizing each text title by using the topic model to obtain a text title vector;

calculating the similarity between every two text title vectors;

the distance between two text heading vectors is calculated.

and taking the text represented by the highest vertex in the sub-connected graph corresponding to the text cluster as the representative text of the text cluster, and extracting the keywords of the text cluster as the content of the text cluster.

the text is news, the text cluster is a news cluster, the news in the news cluster are sorted from new to old according to the release time, the time interval between adjacent news is calculated, the sum of the reciprocals of all the time intervals is used as the heat of the news cluster, and the news cluster with the heat larger than the heat threshold value is defined as hot news.

acquiring a text title list to be clustered;

and calculating the aggregation degree of the initial connected graph, wherein if the aggregation degree of the initial connected graph is greater than or equal to a clustering threshold value, the text set corresponding to the initial connected graph is a text cluster.

As a further improvement of an embodiment of the present invention, if the aggregation degree of the initial connected graph is smaller than the clustering threshold, removing an edge of the initial connected graph that is larger than the initial distance threshold to obtain one or more sub-connected graphs;

s41, if the aggregation degree of one sub connected graph is smaller than the clustering threshold, acquiring the current distance threshold of the sub connected graph, and removing the edge of the sub connected graph, which is larger than the current distance threshold, to obtain one or more sub connected graphs;

and S42, calculating the aggregation degree of each sub-connected graph, and repeating the steps S41-S42 until the aggregation degrees of all the sub-connected graphs are larger than or equal to a clustering threshold value, wherein a text set corresponding to each sub-connected graph larger than or equal to the clustering threshold value is a text cluster.

As a further improvement of an embodiment of the present invention, the aggregation degree of the initial connected graph refers to a ratio of the aggregation coefficient of the initial connected graph to the maximum graph diameter.

calculating the similarity between every two text title vectors;

the distance between two text heading vectors is calculated.

In order to achieve one of the above objects, an embodiment of the present invention provides an electronic device, which includes a memory and a processor, where the memory stores a computer program operable on the processor, and the processor implements the steps in any of the above methods for text clustering when executing the program.

To achieve one of the above objects, an embodiment of the present invention provides a computer-readable storage medium, on which a computer program is stored, wherein the computer program, when executed by a processor, implements the steps in any of the above methods for text clustering.

Compared with the prior art, the method can quickly and stably cluster the texts, and the clustering result of the same text data is consistent each time. Meanwhile, the method is used for clustering the news related to the enterprise, the hot news of the enterprise can be rapidly and stably extracted, and the method has a good effect on extracting the hot news related to the enterprise.

Drawings

FIG. 1 is a flow chart illustrating a method for clustering texts according to a first embodiment of the present invention;

FIG. 2 is an example of a connectivity graph;

FIG. 3 is a sub-connectivity graph of FIG. 2 after edge trimming;

FIG. 4 is a flow chart illustrating a method for clustering texts according to a second embodiment of the present invention;

Detailed Description

The present invention will be described in detail below with reference to specific embodiments shown in the drawings. These embodiments are not intended to limit the present invention, and structural, methodological, or functional changes made by those skilled in the art according to these embodiments are included in the scope of the present invention.

As shown in fig. 1, a flowchart of a text clustering method in a first embodiment of the present invention is shown, in this embodiment, relationships between texts are represented by connected graphs, and then the connected graphs are disassembled to obtain different sub-connected graphs, so as to cluster the texts. The method comprises the following steps:

step S11: and acquiring a text title list to be clustered.

The text headline list may be a news headline list associated with a particular business, or may be other types of text headline lists. Each text header represents a text.

Step S12: and constructing an initial connected graph between the text titles by taking the text titles as vertexes and taking the vectorized distance of the text titles as an edge.

This step represents the relationship between the texts by a connectivity graph, and in a preferred embodiment, the step includes:

step S121: obtaining a topic model by performing topic training on the text titles in the list: the topic model can be obtained by performing topic training on the text titles in the list by using a vectorization method such as TF-IDF, word2vec, LSI, LDA and the like.

Step S122: vectorizing each text title by using the topic model to obtain a text title vector: and obtaining the theme vector representation of each text title by using the theme model, namely vectorizing each text title to obtain the text title vector.

Step S123: calculating the similarity between every two text title vectors: calculating the distance between two text title vectors by using cosine distance, Jaccard coefficient, Euclidean distance and other modes, taking the cosine distance as an example, firstly calculating cosine similarity between the two text title vectors, namely cosine values of the two text title vectors, wherein the range of the cosine values is between [ -1,1], the closer the value is to 1, the closer the direction of the two text title vectors is; the closer they approach-1, the more opposite their direction; close to 0 means that the two vectors are nearly orthogonal.

Step S124: calculating the distance between every two text title vectors: continuing with the example of cosine distance, the cosine distance is 1-cosine similarity, the value range is [0,2], and the smaller the distance is, the closer the directions of the two text title vectors are, i.e. the more similar the two text titles are.

Step S125: constructing an initial connected graph: the text titles are used as vertexes (a vertex represents a text title, and a text title represents a text, so that one vertex represents a text), and distances after the text titles are vectorized are used as edges to construct an initial connected graph between the text titles, wherein the connected graph (the connected graph comprises the initial connected graph and subsequent sub-connected graphs) is characterized in that paths are connected between every two vertexes of the graph, and a specific connected graph can refer to fig. 2. The initial connected graph constructed in the invention is preferably a complete connected graph, namely any two vertexes are connected by an edge, namely the vertexes are connected in pairs.

Step S13: and removing the edges of the initial connected graph, which are larger than the initial distance threshold value, to obtain one or more sub connected graphs.

The length of the side of the connected graph is the similarity between the vertexes, and the similarity is lower as the length is longer, and the similarity is higher as the length is shorter. The clustering process is to remove the edge with low similarity, and the initial distance threshold is preferably 0.4. Removing the edge of the initial connected graph larger than the initial distance threshold to obtain one or more sub-connected graphs (please refer to fig. 3, and the connected graph in fig. 2 is subjected to edge shifting to obtain two sub-connected graphs).

Step S14: and calculating the aggregation degree of each sub-connected graph, wherein if the aggregation degree of one sub-connected graph is greater than or equal to a clustering threshold value, the text set corresponding to the sub-connected graph is a text cluster.

And calculating the aggregation degree of the sub-connected graphs, wherein if the aggregation degree of the sub-connected graphs is higher, namely greater than or equal to a clustering threshold value, the similarity of the text set corresponding to the sub-connected graphs is high, and the text set is a text cluster.

Preferably, the aggregation degree of the connected component map is a ratio of the clustering coefficient of the connected component map to the maximum map diameter. The clustering coefficient is an index of the degree of clustering of the metric graph, and the maximum graph diameter, also called the tree diameter, is the longest path of connectivity in the graph. The larger the value of the clustering coefficient, the tighter the graph combination is, the larger the maximum graph diameter, the relatively looser the graph combination is, the ratio of the two can better balance the clustering degree of the graph, and the larger the ratio, the better the graph clustering degree is. The threshold value of the aggregation degree is a clustering threshold value, the default clustering threshold value is 0.09, namely when the ratio of the clustering coefficient of the connected graph to the maximum graph diameter is greater than or equal to the clustering threshold value, the aggregation degree of the connected graph meets the requirement, the similarity of the text sets corresponding to one connected graph reaches the clustering standard, and the text sets can be divided into one text cluster.

In the embodiment, the method for clustering texts is implemented, clustering is realized by a graphical method, texts can be clustered quickly and stably, the clustering result of the same text data is consistent each time, and the clustering result is clear.

Preferably, the method further comprises:

After the initial connected graph is subjected to edge shifting (i.e., edge shifting which is to remove an edge which does not meet the requirement), one or more first-level sub-connected graphs are obtained, and if the aggregation degree of some sub-connected graphs is smaller than a clustering threshold value, that is, the similarity of a text set corresponding to the sub-connected graphs does not reach the clustering standard, the sub-connected graphs need to be further disassembled. The disassembling method is also realized by edge moving. Since the edges of the sub-connected graph at this time are all smaller than the initial distance threshold, the initial distance threshold needs to be decremented (by default, equal decrements are performed, and by default, decrements by 0.05 each time), that is, for the first-level sub-connected graph obtained after the edge of the initial connected graph is moved, the current distance threshold is the distance threshold (initial distance threshold) of the previous level minus the default value (that is, 0.4-0.05 is 0.35). For the sub-connected graphs with the first-level aggregation degree smaller than the clustering threshold, after removing the edges larger than the current distance threshold, obtaining a second-level sub-connected graph, calculating the aggregation degree of the second-level sub-connected graph, if the aggregation degree of some second-level sub-connected graphs is smaller than the clustering threshold, calculating the current distance threshold of the second-level sub-connected graph, wherein the current distance threshold is the upper-level distance threshold-default value (0.35-0.05-0.3), then removing the edges larger than the current distance threshold of the second-level sub-connected graph, obtaining a third-level sub-connected graph, calculating the aggregation degree of the third-level sub-connected graph, and judging whether edge shifting is needed. And circulating in this way until the aggregation degree of all the sub-connected graphs is greater than or equal to the clustering threshold, wherein the text set corresponding to each sub-connected graph greater than or equal to the clustering threshold is a text cluster.

And finally, dividing a text title set represented by the text title list to be clustered into a plurality of text clusters.

Preferably, the method further comprises:

The degree of the vertex of the connected graph refers to the number of edges connected by the vertex, and the vertex with the highest degree in the sub-connected graph refers to the vertex with the most connected edges in the sub-connected graph. The text represented by the vertex with the highest degree is taken as the representative text of the text cluster, the keywords of the text cluster are extracted as the content of the text cluster, and the general situation of the text cluster can be quickly known through the representative text and the content of the text cluster.

Preferably, the method further comprises:

the text is news, the text cluster is a news cluster, the news in the news cluster are sorted from new to old according to the release time, the time interval between adjacent news is calculated, the sum of the inverses of all the time intervals is used as the heat of the news cluster, and the news cluster with the heat larger than the heat threshold value is defined as hot news.

Since the popularity of news is related to the concentration of news outbreaks, the sum of the reciprocals of all time intervals is taken as the popularity of the news cluster, and the news cluster with the popularity greater than the popularity threshold is defined as hot news.

The present embodiment will be further explained and explained with reference to specific examples.

The obtained list of text (news) titles to be clustered is as follows:

company A completes a new round of 1.1 billion investment

One science and technology company "company A" to C company 1.1 billion of exclusive strategy funding

Company A obtains 1.1 million dollar investment and is invested by company C in a unique strategy

Company A completes a new round of 1.1 billion investment

One line I A company completes a new round of 1.1 million yuan funding C company's exclusive strategic investment

Zhang Yi comment C company trample 10 hundred million sets of thunder: exposes the risk, which is a good matter

Zhang somebody Recall that "C company tramples 10 hundred million sets of thunder": has been encountered before

Zhang somebody Recall that "C company tramples 10 hundred million sets of thunder": previously encountered, but contra-rejection verification

Company C stepped on the Lei D company 10 billion fundamentals Roche core: whether company B participates in

The road ahead of 10 million D company, C company, is unclear

Company C steps 10 hundred million mines and remits responsibility to company B?

The final clustering result obtained by the present embodiment is:

# # Group1164.800618 (5) -1.000000(1) -A company C funding a new round of unique strategy investment

2019070914 company 56: 00A completed a new round of 1.1 billion funding

2019070518 32:09 science and technology company "company A" to C company 1.1 billion-dollar exclusive strategy funding

2019070517 company 08: 00A acquired 2.5 billion funding, invested by company C in a unique strategy

2019070516 company 40: 00A completed a new round of 1.1 billion funding

2019070516: 33:00 Yi I A company completes a new round of 1.1 hundred million yuan fund C company exclusive strategic investment

Recall from company # Group2111.744243 (6) -0.550000(2) -Zhangosnare B

2019071014: 00:00 Ann company C10 hundred million wells exposed to risk, which is a good matter

2019071013: 49: 00A memory of "C company stepping on thunder 10 hundred million snare": has been encountered before

2019071011: 15:36 memories of "C company stepping on thunder 10 hundred million snares": previously encountered, but contra-rejection verification

2019070922: 51: 00C Trend 10 billion fundamentals Roche D core: whether company B participates in

The future road view of the 2019070913: 59: 00C mine-10 billion D company is unclear

10 hundred million thunder was stepped on by 2019070900: 00: 00C, and responsibility was offloaded to company B?

From the results, it can be seen that two news hotspots are well separated, the first hotspot has a popularity of 164.800618, 5 related news, a clustering coefficient of 1.0, a maximum graph diameter of 1, and a keyword "company a C invests in a new round of exclusive strategy". The second hotspot has a heat of 111.744243, 6 related news items, a clustering coefficient of 0.55, a maximum graph diameter of 2, and a keyword "zhangyi snare B company recall".

As shown in fig. 4, a flowchart of a text clustering method according to a second embodiment of the present invention is shown, where the method includes:

step S31: acquiring a text title list to be clustered;

step S32: constructing an initial connected graph between the text titles by taking the text titles as vertexes and taking the vectorized distance of the text titles as an edge;

step S33: and calculating the aggregation degree of the initial connected graph, wherein if the aggregation degree of the initial connected graph is greater than or equal to a clustering threshold value, the text set corresponding to the initial connected graph is a text cluster.

The difference between this embodiment and the first embodiment is that the aggregation degree of the initial connected graph is also calculated, and if the aggregation degree of the initial connected graph is greater than or equal to the clustering threshold, the text set corresponding to the initial connected graph is a text cluster.

It should be noted that, if the aggregation degree of the initial connected graph is smaller than the clustering threshold, according to the method in the first embodiment, performing edge shifting on the initial connected graph to obtain a sub-connected graph, calculating the aggregation degree of the sub-connected graph, whether edge shifting is required, and the like.

The invention also provides an electronic device, which comprises a memory and a processor, wherein the memory stores a computer program capable of running on the processor, and the processor executes the program to realize the steps in the text clustering method.

The invention also provides a computer-readable storage medium, on which a computer program is stored, which computer program, when being executed by a processor, carries out the steps of the above-mentioned method of text clustering.

It should be understood that although the specification describes embodiments, not every embodiment includes only a single embodiment, and such description is for clarity purposes only, and it will be appreciated by those skilled in the art that the specification as a whole may be appropriately combined to form other embodiments as will be apparent to those skilled in the art.

The above-listed detailed description is only a specific description of a possible embodiment of the present invention, and they are not intended to limit the scope of the present invention, and equivalent embodiments or modifications made without departing from the technical spirit of the present invention should be included in the scope of the present invention.

Claims

1. A method of text clustering, the method comprising:

acquiring a text title list to be clustered;

calculating the aggregation degree of each sub-connected graph, wherein if the aggregation degree of one sub-connected graph is larger than or equal to a clustering threshold value, a text set corresponding to the sub-connected graph is a text cluster;

the method further comprises the following steps:

2. The method of text clustering of claim 1 wherein:

the aggregation degree of the sub-connected graph refers to the ratio of the clustering coefficient of the sub-connected graph to the maximum graph diameter.

3. The method for clustering texts according to claim 1, wherein the method for obtaining distance after vectorization of text titles comprises:

calculating the similarity between every two text title vectors;

the distance between two text heading vectors is calculated.

4. The method of text clustering according to claim 1, further comprising:

5. The method of text clustering according to claim 1, further comprising:

6. A method of text clustering, the method comprising:

acquiring a text title list to be clustered;

calculating the aggregation degree of the initial connected graph, wherein if the aggregation degree of the initial connected graph is greater than or equal to a clustering threshold value, a text set corresponding to the initial connected graph is a text cluster;

the method further comprises the following steps:

if the aggregation degree of the initial connected graph is smaller than the clustering threshold, removing the edge of the initial connected graph, which is larger than the initial distance threshold, to obtain one or more sub connected graphs;

the method further comprises the following steps:

s41, if the aggregation degree of one sub-connected graph is smaller than the clustering threshold, acquiring the current distance threshold of the sub-connected graph, and removing the edge of the sub-connected graph, which is larger than the current distance threshold, to obtain one or more sub-connected graphs;

7. The method of text clustering of claim 6 wherein:

the aggregation degree of the initial connected graph refers to the ratio of the clustering coefficient of the initial connected graph to the maximum graph diameter.

8. The method for clustering text according to claim 6, wherein the method for obtaining the vectorized distance of the text title comprises:

calculating the similarity between every two text title vectors;

the distance between two text heading vectors is calculated.

9. The method of text clustering according to claim 6, further comprising:

10. The method of text clustering according to claim 6, further comprising:

11. An electronic device comprising a memory and a processor, said memory storing a computer program operable on said processor, wherein said processor implements the steps in the method of text clustering according to any one of claims 1-10 when executing said program.

12. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method for text clustering according to any one of the claims 1 to 10.