CN110532388B - Text clustering method, equipment and storage medium - Google Patents

Text clustering method, equipment and storage medium Download PDF

Info

Publication number
CN110532388B
CN110532388B CN201910753636.6A CN201910753636A CN110532388B CN 110532388 B CN110532388 B CN 110532388B CN 201910753636 A CN201910753636 A CN 201910753636A CN 110532388 B CN110532388 B CN 110532388B
Authority
CN
China
Prior art keywords
text
sub
clustering
connected graph
cluster
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910753636.6A
Other languages
Chinese (zh)
Other versions
CN110532388A (en
Inventor
龚朝辉
陈汝龙
陈誉
段成阁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Qichacha Technology Co ltd
Original Assignee
Qichacha Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Qichacha Technology Co ltd filed Critical Qichacha Technology Co ltd
Priority to CN201910753636.6A priority Critical patent/CN110532388B/en
Priority to PCT/CN2019/115118 priority patent/WO2021027086A1/en
Publication of CN110532388A publication Critical patent/CN110532388A/en
Application granted granted Critical
Publication of CN110532388B publication Critical patent/CN110532388B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a text clustering method, text clustering equipment and a storage medium, wherein the method comprises the following steps: acquiring a text title list to be clustered; constructing an initial connected graph between the text titles by taking the text titles as vertexes and taking the vectorized distance of the text titles as an edge; removing edges of the initial connected graph, which are larger than an initial distance threshold value, to obtain one or more sub connected graphs; and calculating the aggregation degree of each sub-connected graph, wherein if the aggregation degree of one sub-connected graph is greater than or equal to a clustering threshold value, the text set corresponding to the sub-connected graph is a text cluster. Compared with the prior art, the method can quickly and stably cluster the texts, and the clustering result of the same text data is consistent each time. Meanwhile, the method is used for clustering the news related to the enterprise, the hot news of the enterprise can be rapidly and stably extracted, and the method has a good effect on extracting the hot news related to the enterprise.

Description

Text clustering method, equipment and storage medium
Technical Field
The present invention relates to information processing technologies, and in particular, to a method, an apparatus, and a storage medium for text clustering.
Background
The text is a main carrier of information, browsing news texts published on the network in time becomes an important means for people to acquire information along with the development of the internet, the number of news text information on the current network is huge, and in order to enable people to navigate and browse news quickly and conveniently, the news texts need to be clustered by using a text clustering technology. The text clustering technology can automatically divide a text set into a plurality of clusters, so that texts in the same cluster have certain similarity, and the similarity between texts in different clusters is as low as possible. The current common clustering methods include Kmeans, hierarchical clustering, Single pass algorithm and the like.
However, the Single pass algorithm has an input order dependency characteristic, that is, different clustering results appear when the same clustering object is input in different orders. Other clustering algorithms, such as the number of classes required to be specified by means of Kmeans, have the problem of hierarchical selection, and cause inconsistency of clustering results for different numbers of specified classes or different selected hierarchies.
Disclosure of Invention
The invention aims to provide a text clustering method, equipment and a storage medium.
In order to achieve one of the above objects, an embodiment of the present invention provides a method for clustering texts, including:
acquiring a text title list to be clustered;
constructing an initial connected graph between the text titles by taking the text titles as vertexes and taking the vectorized distance of the text titles as an edge;
removing edges of the initial connected graph, which are larger than an initial distance threshold value, to obtain one or more sub connected graphs;
and calculating the aggregation degree of each sub-connected graph, wherein if the aggregation degree of one sub-connected graph is greater than or equal to a clustering threshold value, the text set corresponding to the sub-connected graph is a text cluster.
As a further improvement of an embodiment of the present invention, the method further comprises:
s21, if the aggregation degree of one sub-connected graph is smaller than the clustering threshold, acquiring the current distance threshold of the sub-connected graph, and removing the edge of the sub-connected graph, which is larger than the current distance threshold, to obtain one or more sub-connected graphs;
and S22, calculating the aggregation degree of each sub-connected graph, and repeating the steps S21-S22 until the aggregation degrees of all the sub-connected graphs are larger than or equal to a clustering threshold value, wherein a text set corresponding to each sub-connected graph larger than or equal to the clustering threshold value is a text cluster.
As a further improvement of an embodiment of the present invention, the aggregation degree of the sub-connectivity graph refers to a ratio of the aggregation coefficient of the sub-connectivity graph to the maximum graph diameter.
As a further improvement of an embodiment of the present invention, the method for obtaining the "distance after vectorization of the text title" includes:
performing theme training on the text titles in the text title list to obtain a theme model;
vectorizing each text title by using the topic model to obtain a text title vector;
calculating the similarity between every two text title vectors;
the distance between two text heading vectors is calculated.
As a further improvement of an embodiment of the present invention, the method further comprises:
and taking the text represented by the highest vertex in the sub-connected graph corresponding to the text cluster as the representative text of the text cluster, and extracting the keywords of the text cluster as the content of the text cluster.
As a further improvement of an embodiment of the present invention, the method further comprises:
the text is news, the text cluster is a news cluster, the news in the news cluster are sorted from new to old according to the release time, the time interval between adjacent news is calculated, the sum of the reciprocals of all the time intervals is used as the heat of the news cluster, and the news cluster with the heat larger than the heat threshold value is defined as hot news.
In order to achieve one of the above objects, an embodiment of the present invention provides a method for clustering texts, including:
acquiring a text title list to be clustered;
constructing an initial connected graph between the text titles by taking the text titles as vertexes and taking the vectorized distance of the text titles as an edge;
and calculating the aggregation degree of the initial connected graph, wherein if the aggregation degree of the initial connected graph is greater than or equal to a clustering threshold value, the text set corresponding to the initial connected graph is a text cluster.
As a further improvement of an embodiment of the present invention, if the aggregation degree of the initial connected graph is smaller than the clustering threshold, removing an edge of the initial connected graph that is larger than the initial distance threshold to obtain one or more sub-connected graphs;
and calculating the aggregation degree of each sub-connected graph, wherein if the aggregation degree of one sub-connected graph is greater than or equal to a clustering threshold value, the text set corresponding to the sub-connected graph is a text cluster.
As a further improvement of an embodiment of the present invention, the method further comprises:
s41, if the aggregation degree of one sub connected graph is smaller than the clustering threshold, acquiring the current distance threshold of the sub connected graph, and removing the edge of the sub connected graph, which is larger than the current distance threshold, to obtain one or more sub connected graphs;
and S42, calculating the aggregation degree of each sub-connected graph, and repeating the steps S41-S42 until the aggregation degrees of all the sub-connected graphs are larger than or equal to a clustering threshold value, wherein a text set corresponding to each sub-connected graph larger than or equal to the clustering threshold value is a text cluster.
As a further improvement of an embodiment of the present invention, the aggregation degree of the initial connected graph refers to a ratio of the aggregation coefficient of the initial connected graph to the maximum graph diameter.
As a further improvement of an embodiment of the present invention, the method for obtaining the "distance after vectorization of the text title" includes:
performing theme training on the text titles in the text title list to obtain a theme model;
vectorizing each text title by using the topic model to obtain a text title vector;
calculating the similarity between every two text title vectors;
the distance between two text heading vectors is calculated.
As a further improvement of an embodiment of the present invention, the method further comprises:
and taking the text represented by the highest vertex in the sub-connected graph corresponding to the text cluster as the representative text of the text cluster, and extracting the keywords of the text cluster as the content of the text cluster.
As a further improvement of an embodiment of the present invention, the method further comprises:
the text is news, the text cluster is a news cluster, the news in the news cluster are sorted from new to old according to the release time, the time interval between adjacent news is calculated, the sum of the reciprocals of all the time intervals is used as the heat of the news cluster, and the news cluster with the heat larger than the heat threshold value is defined as hot news.
In order to achieve one of the above objects, an embodiment of the present invention provides an electronic device, which includes a memory and a processor, where the memory stores a computer program operable on the processor, and the processor implements the steps in any of the above methods for text clustering when executing the program.
To achieve one of the above objects, an embodiment of the present invention provides a computer-readable storage medium, on which a computer program is stored, wherein the computer program, when executed by a processor, implements the steps in any of the above methods for text clustering.
Compared with the prior art, the method can quickly and stably cluster the texts, and the clustering result of the same text data is consistent each time. Meanwhile, the method is used for clustering the news related to the enterprise, the hot news of the enterprise can be rapidly and stably extracted, and the method has a good effect on extracting the hot news related to the enterprise.
Drawings
FIG. 1 is a flow chart illustrating a method for clustering texts according to a first embodiment of the present invention;
FIG. 2 is an example of a connectivity graph;
FIG. 3 is a sub-connectivity graph of FIG. 2 after edge trimming;
FIG. 4 is a flow chart illustrating a method for clustering texts according to a second embodiment of the present invention;
Detailed Description
The present invention will be described in detail below with reference to specific embodiments shown in the drawings. These embodiments are not intended to limit the present invention, and structural, methodological, or functional changes made by those skilled in the art according to these embodiments are included in the scope of the present invention.
As shown in fig. 1, a flowchart of a text clustering method in a first embodiment of the present invention is shown, in this embodiment, relationships between texts are represented by connected graphs, and then the connected graphs are disassembled to obtain different sub-connected graphs, so as to cluster the texts. The method comprises the following steps:
step S11: and acquiring a text title list to be clustered.
The text headline list may be a news headline list associated with a particular business, or may be other types of text headline lists. Each text header represents a text.
Step S12: and constructing an initial connected graph between the text titles by taking the text titles as vertexes and taking the vectorized distance of the text titles as an edge.
This step represents the relationship between the texts by a connectivity graph, and in a preferred embodiment, the step includes:
step S121: obtaining a topic model by performing topic training on the text titles in the list: the topic model can be obtained by performing topic training on the text titles in the list by using a vectorization method such as TF-IDF, word2vec, LSI, LDA and the like.
Step S122: vectorizing each text title by using the topic model to obtain a text title vector: and obtaining the theme vector representation of each text title by using the theme model, namely vectorizing each text title to obtain the text title vector.
Step S123: calculating the similarity between every two text title vectors: calculating the distance between two text title vectors by using cosine distance, Jaccard coefficient, Euclidean distance and other modes, taking the cosine distance as an example, firstly calculating cosine similarity between the two text title vectors, namely cosine values of the two text title vectors, wherein the range of the cosine values is between [ -1,1], the closer the value is to 1, the closer the direction of the two text title vectors is; the closer they approach-1, the more opposite their direction; close to 0 means that the two vectors are nearly orthogonal.
Step S124: calculating the distance between every two text title vectors: continuing with the example of cosine distance, the cosine distance is 1-cosine similarity, the value range is [0,2], and the smaller the distance is, the closer the directions of the two text title vectors are, i.e. the more similar the two text titles are.
Step S125: constructing an initial connected graph: the text titles are used as vertexes (a vertex represents a text title, and a text title represents a text, so that one vertex represents a text), and distances after the text titles are vectorized are used as edges to construct an initial connected graph between the text titles, wherein the connected graph (the connected graph comprises the initial connected graph and subsequent sub-connected graphs) is characterized in that paths are connected between every two vertexes of the graph, and a specific connected graph can refer to fig. 2. The initial connected graph constructed in the invention is preferably a complete connected graph, namely any two vertexes are connected by an edge, namely the vertexes are connected in pairs.
Step S13: and removing the edges of the initial connected graph, which are larger than the initial distance threshold value, to obtain one or more sub connected graphs.
The length of the side of the connected graph is the similarity between the vertexes, and the similarity is lower as the length is longer, and the similarity is higher as the length is shorter. The clustering process is to remove the edge with low similarity, and the initial distance threshold is preferably 0.4. Removing the edge of the initial connected graph larger than the initial distance threshold to obtain one or more sub-connected graphs (please refer to fig. 3, and the connected graph in fig. 2 is subjected to edge shifting to obtain two sub-connected graphs).
Step S14: and calculating the aggregation degree of each sub-connected graph, wherein if the aggregation degree of one sub-connected graph is greater than or equal to a clustering threshold value, the text set corresponding to the sub-connected graph is a text cluster.
And calculating the aggregation degree of the sub-connected graphs, wherein if the aggregation degree of the sub-connected graphs is higher, namely greater than or equal to a clustering threshold value, the similarity of the text set corresponding to the sub-connected graphs is high, and the text set is a text cluster.
Preferably, the aggregation degree of the connected component map is a ratio of the clustering coefficient of the connected component map to the maximum map diameter. The clustering coefficient is an index of the degree of clustering of the metric graph, and the maximum graph diameter, also called the tree diameter, is the longest path of connectivity in the graph. The larger the value of the clustering coefficient, the tighter the graph combination is, the larger the maximum graph diameter, the relatively looser the graph combination is, the ratio of the two can better balance the clustering degree of the graph, and the larger the ratio, the better the graph clustering degree is. The threshold value of the aggregation degree is a clustering threshold value, the default clustering threshold value is 0.09, namely when the ratio of the clustering coefficient of the connected graph to the maximum graph diameter is greater than or equal to the clustering threshold value, the aggregation degree of the connected graph meets the requirement, the similarity of the text sets corresponding to one connected graph reaches the clustering standard, and the text sets can be divided into one text cluster.
In the embodiment, the method for clustering texts is implemented, clustering is realized by a graphical method, texts can be clustered quickly and stably, the clustering result of the same text data is consistent each time, and the clustering result is clear.
Preferably, the method further comprises:
s21, if the aggregation degree of one sub-connected graph is smaller than the clustering threshold, acquiring the current distance threshold of the sub-connected graph, and removing the edge of the sub-connected graph, which is larger than the current distance threshold, to obtain one or more sub-connected graphs;
and S22, calculating the aggregation degree of each sub-connected graph, and repeating the steps S21-S22 until the aggregation degrees of all the sub-connected graphs are larger than or equal to a clustering threshold value, wherein a text set corresponding to each sub-connected graph larger than or equal to the clustering threshold value is a text cluster.
After the initial connected graph is subjected to edge shifting (i.e., edge shifting which is to remove an edge which does not meet the requirement), one or more first-level sub-connected graphs are obtained, and if the aggregation degree of some sub-connected graphs is smaller than a clustering threshold value, that is, the similarity of a text set corresponding to the sub-connected graphs does not reach the clustering standard, the sub-connected graphs need to be further disassembled. The disassembling method is also realized by edge moving. Since the edges of the sub-connected graph at this time are all smaller than the initial distance threshold, the initial distance threshold needs to be decremented (by default, equal decrements are performed, and by default, decrements by 0.05 each time), that is, for the first-level sub-connected graph obtained after the edge of the initial connected graph is moved, the current distance threshold is the distance threshold (initial distance threshold) of the previous level minus the default value (that is, 0.4-0.05 is 0.35). For the sub-connected graphs with the first-level aggregation degree smaller than the clustering threshold, after removing the edges larger than the current distance threshold, obtaining a second-level sub-connected graph, calculating the aggregation degree of the second-level sub-connected graph, if the aggregation degree of some second-level sub-connected graphs is smaller than the clustering threshold, calculating the current distance threshold of the second-level sub-connected graph, wherein the current distance threshold is the upper-level distance threshold-default value (0.35-0.05-0.3), then removing the edges larger than the current distance threshold of the second-level sub-connected graph, obtaining a third-level sub-connected graph, calculating the aggregation degree of the third-level sub-connected graph, and judging whether edge shifting is needed. And circulating in this way until the aggregation degree of all the sub-connected graphs is greater than or equal to the clustering threshold, wherein the text set corresponding to each sub-connected graph greater than or equal to the clustering threshold is a text cluster.
And finally, dividing a text title set represented by the text title list to be clustered into a plurality of text clusters.
Preferably, the method further comprises:
and taking the text represented by the highest vertex in the sub-connected graph corresponding to the text cluster as the representative text of the text cluster, and extracting the keywords of the text cluster as the content of the text cluster.
The degree of the vertex of the connected graph refers to the number of edges connected by the vertex, and the vertex with the highest degree in the sub-connected graph refers to the vertex with the most connected edges in the sub-connected graph. The text represented by the vertex with the highest degree is taken as the representative text of the text cluster, the keywords of the text cluster are extracted as the content of the text cluster, and the general situation of the text cluster can be quickly known through the representative text and the content of the text cluster.
Preferably, the method further comprises:
the text is news, the text cluster is a news cluster, the news in the news cluster are sorted from new to old according to the release time, the time interval between adjacent news is calculated, the sum of the inverses of all the time intervals is used as the heat of the news cluster, and the news cluster with the heat larger than the heat threshold value is defined as hot news.
Since the popularity of news is related to the concentration of news outbreaks, the sum of the reciprocals of all time intervals is taken as the popularity of the news cluster, and the news cluster with the popularity greater than the popularity threshold is defined as hot news.
The present embodiment will be further explained and explained with reference to specific examples.
The obtained list of text (news) titles to be clustered is as follows:
company A completes a new round of 1.1 billion investment
One science and technology company "company A" to C company 1.1 billion of exclusive strategy funding
Company A obtains 1.1 million dollar investment and is invested by company C in a unique strategy
Company A completes a new round of 1.1 billion investment
One line I A company completes a new round of 1.1 million yuan funding C company's exclusive strategic investment
Zhang Yi comment C company trample 10 hundred million sets of thunder: exposes the risk, which is a good matter
Zhang somebody Recall that "C company tramples 10 hundred million sets of thunder": has been encountered before
Zhang somebody Recall that "C company tramples 10 hundred million sets of thunder": previously encountered, but contra-rejection verification
Company C stepped on the Lei D company 10 billion fundamentals Roche core: whether company B participates in
The road ahead of 10 million D company, C company, is unclear
Company C steps 10 hundred million mines and remits responsibility to company B?
The final clustering result obtained by the present embodiment is:
# # Group1164.800618 (5) -1.000000(1) -A company C funding a new round of unique strategy investment
2019070914 company 56: 00A completed a new round of 1.1 billion funding
2019070518 32:09 science and technology company "company A" to C company 1.1 billion-dollar exclusive strategy funding
2019070517 company 08: 00A acquired 2.5 billion funding, invested by company C in a unique strategy
2019070516 company 40: 00A completed a new round of 1.1 billion funding
2019070516: 33:00 Yi I A company completes a new round of 1.1 hundred million yuan fund C company exclusive strategic investment
Recall from company # Group2111.744243 (6) -0.550000(2) -Zhangosnare B
2019071014: 00:00 Ann company C10 hundred million wells exposed to risk, which is a good matter
2019071013: 49: 00A memory of "C company stepping on thunder 10 hundred million snare": has been encountered before
2019071011: 15:36 memories of "C company stepping on thunder 10 hundred million snares": previously encountered, but contra-rejection verification
2019070922: 51: 00C Trend 10 billion fundamentals Roche D core: whether company B participates in
The future road view of the 2019070913: 59: 00C mine-10 billion D company is unclear
10 hundred million thunder was stepped on by 2019070900: 00: 00C, and responsibility was offloaded to company B?
From the results, it can be seen that two news hotspots are well separated, the first hotspot has a popularity of 164.800618, 5 related news, a clustering coefficient of 1.0, a maximum graph diameter of 1, and a keyword "company a C invests in a new round of exclusive strategy". The second hotspot has a heat of 111.744243, 6 related news items, a clustering coefficient of 0.55, a maximum graph diameter of 2, and a keyword "zhangyi snare B company recall".
As shown in fig. 4, a flowchart of a text clustering method according to a second embodiment of the present invention is shown, where the method includes:
step S31: acquiring a text title list to be clustered;
step S32: constructing an initial connected graph between the text titles by taking the text titles as vertexes and taking the vectorized distance of the text titles as an edge;
step S33: and calculating the aggregation degree of the initial connected graph, wherein if the aggregation degree of the initial connected graph is greater than or equal to a clustering threshold value, the text set corresponding to the initial connected graph is a text cluster.
The difference between this embodiment and the first embodiment is that the aggregation degree of the initial connected graph is also calculated, and if the aggregation degree of the initial connected graph is greater than or equal to the clustering threshold, the text set corresponding to the initial connected graph is a text cluster.
It should be noted that, if the aggregation degree of the initial connected graph is smaller than the clustering threshold, according to the method in the first embodiment, performing edge shifting on the initial connected graph to obtain a sub-connected graph, calculating the aggregation degree of the sub-connected graph, whether edge shifting is required, and the like.
The invention also provides an electronic device, which comprises a memory and a processor, wherein the memory stores a computer program capable of running on the processor, and the processor executes the program to realize the steps in the text clustering method.
The invention also provides a computer-readable storage medium, on which a computer program is stored, which computer program, when being executed by a processor, carries out the steps of the above-mentioned method of text clustering.
It should be understood that although the specification describes embodiments, not every embodiment includes only a single embodiment, and such description is for clarity purposes only, and it will be appreciated by those skilled in the art that the specification as a whole may be appropriately combined to form other embodiments as will be apparent to those skilled in the art.
The above-listed detailed description is only a specific description of a possible embodiment of the present invention, and they are not intended to limit the scope of the present invention, and equivalent embodiments or modifications made without departing from the technical spirit of the present invention should be included in the scope of the present invention.

Claims (12)

1. A method of text clustering, the method comprising:
acquiring a text title list to be clustered;
constructing an initial connected graph between the text titles by taking the text titles as vertexes and taking the vectorized distance of the text titles as an edge;
removing edges of the initial connected graph, which are larger than an initial distance threshold value, to obtain one or more sub connected graphs;
calculating the aggregation degree of each sub-connected graph, wherein if the aggregation degree of one sub-connected graph is larger than or equal to a clustering threshold value, a text set corresponding to the sub-connected graph is a text cluster;
the method further comprises the following steps:
s21, if the aggregation degree of one sub-connected graph is smaller than the clustering threshold, acquiring the current distance threshold of the sub-connected graph, and removing the edge of the sub-connected graph, which is larger than the current distance threshold, to obtain one or more sub-connected graphs;
and S22, calculating the aggregation degree of each sub-connected graph, and repeating the steps S21-S22 until the aggregation degrees of all the sub-connected graphs are larger than or equal to a clustering threshold value, wherein a text set corresponding to each sub-connected graph larger than or equal to the clustering threshold value is a text cluster.
2. The method of text clustering of claim 1 wherein:
the aggregation degree of the sub-connected graph refers to the ratio of the clustering coefficient of the sub-connected graph to the maximum graph diameter.
3. The method for clustering texts according to claim 1, wherein the method for obtaining distance after vectorization of text titles comprises:
performing theme training on the text titles in the text title list to obtain a theme model;
vectorizing each text title by using the topic model to obtain a text title vector;
calculating the similarity between every two text title vectors;
the distance between two text heading vectors is calculated.
4. The method of text clustering according to claim 1, further comprising:
and taking the text represented by the highest vertex in the sub-connected graph corresponding to the text cluster as the representative text of the text cluster, and extracting the keywords of the text cluster as the content of the text cluster.
5. The method of text clustering according to claim 1, further comprising:
the text is news, the text cluster is a news cluster, the news in the news cluster are sorted from new to old according to the release time, the time interval between adjacent news is calculated, the sum of the reciprocals of all the time intervals is used as the heat of the news cluster, and the news cluster with the heat larger than the heat threshold value is defined as hot news.
6. A method of text clustering, the method comprising:
acquiring a text title list to be clustered;
constructing an initial connected graph between the text titles by taking the text titles as vertexes and taking the vectorized distance of the text titles as an edge;
calculating the aggregation degree of the initial connected graph, wherein if the aggregation degree of the initial connected graph is greater than or equal to a clustering threshold value, a text set corresponding to the initial connected graph is a text cluster;
the method further comprises the following steps:
if the aggregation degree of the initial connected graph is smaller than the clustering threshold, removing the edge of the initial connected graph, which is larger than the initial distance threshold, to obtain one or more sub connected graphs;
calculating the aggregation degree of each sub-connected graph, wherein if the aggregation degree of one sub-connected graph is larger than or equal to a clustering threshold value, a text set corresponding to the sub-connected graph is a text cluster;
the method further comprises the following steps:
s41, if the aggregation degree of one sub-connected graph is smaller than the clustering threshold, acquiring the current distance threshold of the sub-connected graph, and removing the edge of the sub-connected graph, which is larger than the current distance threshold, to obtain one or more sub-connected graphs;
and S42, calculating the aggregation degree of each sub-connected graph, and repeating the steps S41-S42 until the aggregation degrees of all the sub-connected graphs are larger than or equal to a clustering threshold value, wherein a text set corresponding to each sub-connected graph larger than or equal to the clustering threshold value is a text cluster.
7. The method of text clustering of claim 6 wherein:
the aggregation degree of the initial connected graph refers to the ratio of the clustering coefficient of the initial connected graph to the maximum graph diameter.
8. The method for clustering text according to claim 6, wherein the method for obtaining the vectorized distance of the text title comprises:
performing theme training on the text titles in the text title list to obtain a theme model;
vectorizing each text title by using the topic model to obtain a text title vector;
calculating the similarity between every two text title vectors;
the distance between two text heading vectors is calculated.
9. The method of text clustering according to claim 6, further comprising:
and taking the text represented by the highest vertex in the sub-connected graph corresponding to the text cluster as the representative text of the text cluster, and extracting the keywords of the text cluster as the content of the text cluster.
10. The method of text clustering according to claim 6, further comprising:
the text is news, the text cluster is a news cluster, the news in the news cluster are sorted from new to old according to the release time, the time interval between adjacent news is calculated, the sum of the reciprocals of all the time intervals is used as the heat of the news cluster, and the news cluster with the heat larger than the heat threshold value is defined as hot news.
11. An electronic device comprising a memory and a processor, said memory storing a computer program operable on said processor, wherein said processor implements the steps in the method of text clustering according to any one of claims 1-10 when executing said program.
12. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method for text clustering according to any one of the claims 1 to 10.
CN201910753636.6A 2019-08-15 2019-08-15 Text clustering method, equipment and storage medium Active CN110532388B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201910753636.6A CN110532388B (en) 2019-08-15 2019-08-15 Text clustering method, equipment and storage medium
PCT/CN2019/115118 WO2021027086A1 (en) 2019-08-15 2019-11-01 Text clustering method, device, and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910753636.6A CN110532388B (en) 2019-08-15 2019-08-15 Text clustering method, equipment and storage medium

Publications (2)

Publication Number Publication Date
CN110532388A CN110532388A (en) 2019-12-03
CN110532388B true CN110532388B (en) 2022-07-01

Family

ID=68663389

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910753636.6A Active CN110532388B (en) 2019-08-15 2019-08-15 Text clustering method, equipment and storage medium

Country Status (2)

Country Link
CN (1) CN110532388B (en)
WO (1) WO2021027086A1 (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112597284B (en) * 2021-03-08 2021-06-15 中邮消费金融有限公司 Company name matching method and device, computer equipment and storage medium
CN113963221B (en) * 2021-09-17 2024-07-02 深圳云天励飞技术股份有限公司 Image clustering method and device, computer equipment and readable storage medium
CN114911939B (en) * 2022-05-24 2024-08-02 腾讯科技(深圳)有限公司 Hot spot mining method, hot spot mining device, electronic equipment, storage medium and program product
CN117034905B (en) * 2023-08-07 2024-05-14 重庆邮电大学 Internet false news identification method based on big data

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106599181A (en) * 2016-12-13 2017-04-26 浙江网新恒天软件有限公司 Hot news detecting method based on topic model
CN107273412A (en) * 2017-05-04 2017-10-20 北京拓尔思信息技术股份有限公司 A kind of clustering method of text data, device and system
CN107451183A (en) * 2017-06-19 2017-12-08 中国信息通信研究院 Knowledge Map construction method based on text cluster thought

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104778480A (en) * 2015-05-08 2015-07-15 江南大学 Hierarchical spectral clustering method based on local density and geodesic distance
US10129276B1 (en) * 2016-03-29 2018-11-13 EMC IP Holding Company LLC Methods and apparatus for identifying suspicious domains using common user clustering
US10489440B2 (en) * 2017-02-01 2019-11-26 Wipro Limited System and method of data cleansing for improved data classification
CN109033200B (en) * 2018-06-29 2021-03-02 北京百度网讯科技有限公司 Event extraction method, device, equipment and computer readable medium

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106599181A (en) * 2016-12-13 2017-04-26 浙江网新恒天软件有限公司 Hot news detecting method based on topic model
CN107273412A (en) * 2017-05-04 2017-10-20 北京拓尔思信息技术股份有限公司 A kind of clustering method of text data, device and system
CN107451183A (en) * 2017-06-19 2017-12-08 中国信息通信研究院 Knowledge Map construction method based on text cluster thought

Also Published As

Publication number Publication date
CN110532388A (en) 2019-12-03
WO2021027086A1 (en) 2021-02-18

Similar Documents

Publication Publication Date Title
CN110532388B (en) Text clustering method, equipment and storage medium
CN109189991B (en) Duplicate video identification method, device, terminal and computer readable storage medium
CN107644010B (en) Text similarity calculation method and device
US10423648B2 (en) Method, system, and computer readable medium for interest tag recommendation
CN104750798B (en) Recommendation method and device for application program
Rekabsaz et al. Exploration of a threshold for similarity based on uncertainty in word embedding
CN110750704B (en) Method and device for automatically completing query
CN105512277B (en) A kind of short text clustering method towards Book Market title
CN105426426A (en) KNN text classification method based on improved K-Medoids
CN115630640B (en) Intelligent writing method, device, equipment and medium
CN112256822A (en) Text search method and device, computer equipment and storage medium
CN108664512B (en) Text object classification method and device
CN112000783B (en) Patent recommendation method, device and equipment based on text similarity analysis and storage medium
CA3059929A1 (en) Text searching method, apparatus, and non-transitory computer-readable storage medium
US8832015B2 (en) Fast binary rule extraction for large scale text data
CN106557777A (en) It is a kind of to be based on the improved Kmeans clustering methods of SimHash
CN111325030A (en) Text label construction method and device, computer equipment and storage medium
CN107861945A (en) Finance data analysis method, application server and computer-readable recording medium
CN110837555A (en) Method, equipment and storage medium for removing duplicate and screening of massive texts
CN108470035B (en) Entity-quotation correlation classification method based on discriminant hybrid model
CN116029280A (en) Method, device, computing equipment and storage medium for extracting key information of document
CN107766419B (en) Threshold denoising-based TextRank document summarization method and device
JP6426074B2 (en) Related document search device, model creation device, method and program thereof
CN113656575B (en) Training data generation method and device, electronic equipment and readable medium
CN110457455B (en) Ternary logic question-answer consultation optimization method, system, medium and equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: Room 503, 5th floor, C1 Building, 88 Dongchang Road, Suzhou Industrial Park, Jiangsu Province, 215000

Applicant after: Qicha Technology Co.,Ltd.

Address before: Room 503, 5th floor, C1 Building, 88 Dongchang Road, Suzhou Industrial Park, Jiangsu Province, 215000

Applicant before: SUZHOU LANGDONG NET TEC Co.,Ltd.

CB02 Change of applicant information
GR01 Patent grant
GR01 Patent grant
CP03 Change of name, title or address

Address after: No. 8 Huizhi Street, Suzhou Industrial Park, Suzhou Area, China (Jiangsu) Pilot Free Trade Zone, Suzhou City, Jiangsu Province, 215000

Patentee after: Qichacha Technology Co.,Ltd.

Address before: Room 503, 5th floor, C1 Building, 88 Dongchang Road, Suzhou Industrial Park, Jiangsu Province, 215000

Patentee before: Qicha Technology Co.,Ltd.

CP03 Change of name, title or address