CN112256842B - Method, electronic device and storage medium for text clustering - Google Patents

Method, electronic device and storage medium for text clustering Download PDF

Info

Publication number
CN112256842B
CN112256842B CN202011491126.5A CN202011491126A CN112256842B CN 112256842 B CN112256842 B CN 112256842B CN 202011491126 A CN202011491126 A CN 202011491126A CN 112256842 B CN112256842 B CN 112256842B
Authority
CN
China
Prior art keywords
text
texts
clusters
density
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011491126.5A
Other languages
Chinese (zh)
Other versions
CN112256842A (en
Inventor
尹扬
郭鹏华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Suntime Information Technology Co ltd
Original Assignee
Shanghai Suntime Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Suntime Information Technology Co ltd filed Critical Shanghai Suntime Information Technology Co ltd
Priority to CN202011491126.5A priority Critical patent/CN112256842B/en
Publication of CN112256842A publication Critical patent/CN112256842A/en
Application granted granted Critical
Publication of CN112256842B publication Critical patent/CN112256842B/en
Priority to PCT/CN2021/087169 priority patent/WO2022126944A1/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Embodiments of the present disclosure relate to a method, an electronic device, and a storage medium for text clustering, and relate to the field of information processing. According to the method, a word frequency-inverse document frequency of each word in a plurality of first texts is determined; removing the entity identification from a plurality of text titles in the plurality of first texts to generate a plurality of non-entity titles; determining a plurality of first feature representations associated with a plurality of non-entity titles based on the word frequency-inverse document frequency; density clustering the first texts to generate first text clusters and second texts which are not clustered based on the first feature representations and the first density radius; determining a plurality of second feature representations associated with a plurality of second texts based on the word frequency-inverse document frequency; and density clustering the second texts based on the second feature representations and a second density radius to generate second text clusters, wherein the second density radius is larger than the first density radius. Therefore, multilevel text clustering is realized.

Description

Method, electronic device and storage medium for text clustering
Technical Field
Embodiments of the present disclosure relate generally to the field of information processing, and more particularly, to a method, an electronic device, and a computer storage medium for text clustering.
Background
Nowadays, readers can acquire massive articles (news information, newspaper research and the like) on the network every day, but the following problems exist: 1. because the number of articles is huge, a reader cannot read all articles in a short time even if the reader only reads the article titles, so that the reader cannot quickly capture the articles which are interesting or valuable to the reader in a large number of articles; 2. when a reader is interested in a particular topic or event, he or she may wish to see different angled articles about the topic, or see the context of the event as it progresses over time, to get a more thorough and complete understanding of the topic or event. But in reality, the articles are scattered in a large number of articles disorderly and cannot be presented collectively in a form desired by the reader; 3. when reading, a large number of identical or similar articles are always encountered, and the time and energy of the reader are wasted.
Since the daily news focus topic is unpredictable and fixed, some common clustering models, such as K-Means clustering, require a value of K for the number of clusters to be specified in advance. For a supervised text classification model, not only the classes of texts need to be specified in advance, but also manually labeled training data is needed for machine learning training, and the prerequisites are impossible for unknown mass news emerging every day.
Disclosure of Invention
A method, an electronic device, and a computer storage medium for text clustering are provided, which can implement multi-level text clustering.
According to a first aspect of the present disclosure, a method for text clustering is provided. The method comprises the following steps: determining the word frequency-inverse document frequency of each word in a plurality of first texts to be clustered based on a text library; removing the entity identification from a plurality of text titles in the plurality of first texts to generate a plurality of non-entity titles; determining a plurality of first feature representations associated with the plurality of non-entity titles based on the word frequency-inverse document frequency and the bag of words model for each word in the plurality of non-entity titles; density clustering the first texts to generate first text clusters and second texts which are not clustered based on the first feature representations and the first density radius; determining a plurality of second feature representations associated with the plurality of second texts based on the word frequency-inverse document frequency and the bag of words model for each word in the plurality of second texts; and density clustering the second texts based on the second feature representations and a second density radius to generate second text clusters, wherein the second density radius is larger than the first density radius.
According to a second aspect of the present disclosure, an electronic device is provided. The electronic device includes: at least one processor, and a memory communicatively connected to the at least one processor, wherein the memory stores instructions executable by the at least one processor, the instructions being executable by the at least one processor to enable the at least one processor to perform the method according to the first aspect.
In a third aspect of the present disclosure, a computer-readable storage medium is provided, on which a computer program is stored which, when executed by a processor, implements a method according to the first aspect of the present disclosure.
It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.
Drawings
The above and other features, advantages and aspects of various embodiments of the present disclosure will become more apparent by referring to the following detailed description when taken in conjunction with the accompanying drawings. In the drawings, like or similar reference characters designate like or similar elements.
FIG. 1 is a schematic diagram of an information handling environment 100 according to an embodiment of the present disclosure.
Fig. 2 is a schematic diagram of a method 200 for text clustering according to an embodiment of the present disclosure.
Fig. 3 is a schematic diagram of a method 300 for cluster segmentation according to an embodiment of the present disclosure.
Fig. 4 is a schematic diagram of a method 400 for generating cluster titles according to an embodiment of the present disclosure.
FIG. 5 is a block diagram of an electronic device for implementing a method for text clustering in accordance with an embodiment of the present disclosure.
Detailed Description
Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
The term "include" and variations thereof as used herein is meant to be inclusive in an open-ended manner, i.e., "including but not limited to". Unless specifically stated otherwise, the term "or" means "and/or". The term "based on" means "based at least in part on". The terms "one example embodiment" and "one embodiment" mean "at least one example embodiment". The term "another embodiment" means "at least one additional embodiment". The terms "first," "second," and the like may refer to different or the same object. Other explicit and implicit definitions are also possible below.
As described above, the conventional scheme requires pre-specifying categories and manual training, and is not suitable for mass news emerging every day. On the other hand, if the conventional density clustering is used for clustering news information texts, the limitation that the dimension and the granularity are too single exists, and the actual clustering effect is not ideal.
To address, at least in part, one or more of the above problems, as well as other potential problems, example embodiments of the present disclosure propose a scheme for text clustering. In the scheme, the word frequency-inverse document frequency of each word in a plurality of first texts to be clustered is determined based on a text library; removing the entity identification from a plurality of text titles in the plurality of first texts to generate a plurality of non-entity titles; determining a plurality of first feature representations associated with the plurality of non-entity titles based on the word frequency-inverse document frequency and the bag of words model for each word in the plurality of non-entity titles; density clustering the first texts to generate first text clusters and second texts which are not clustered based on the first feature representations and the first density radius; determining a plurality of second feature representations associated with the plurality of second texts based on the word frequency-inverse document frequency and the bag of words model for each word in the plurality of second texts; and density clustering the second texts based on the second feature representations and a second density radius to generate second text clusters, wherein the second density radius is larger than the first density radius. In this way, multi-level text clustering can be achieved.
Hereinafter, specific examples of the present scheme will be described in more detail with reference to the accompanying drawings.
FIG. 1 shows a schematic diagram of an example of an information processing environment 100, according to an embodiment of the present disclosure. The information processing environment 100 may include a computing device 110, a plurality of first texts 120-1 through 120-n (collectively 120) to be clustered, a text library 130, and a clustering result 140 of the plurality of first texts 120.
With respect to the computing device 110, it includes, for example and without limitation, a personal computer, desktop computer, server computer, multiprocessor system, mainframe computer, distributed computing environment including any of the above systems or devices, and the like. In some embodiments, the computing device 110 may have one or more processing units, including special purpose processing units such as image processing units GPU, field programmable gate arrays FPGA, and application specific integrated circuits ASIC, and general purpose processing units such as central processing units CPU.
As to each of the plurality of first texts 120, it includes, for example, a text title and a text body. The plurality of first texts 120 includes, for example, but is not limited to, a plurality of news texts.
With respect to the text repository 130, it may, for example, include a large amount of text, such as a million amount. The inverse document frequency may be determined in advance for each word in all text included in the text repository 130 and stored in the text repository 130 for subsequent use.
The computing device 110 is configured to determine a word frequency-inverse document frequency for each term in the plurality of first texts 120 to be clustered based on the text corpus 130; removing the entity identification from the plurality of text headings in the plurality of first texts 120 to generate a plurality of non-entity headings; determining a plurality of first feature representations associated with the plurality of non-entity titles based on the word frequency-inverse document frequency and the bag of words model for each word in the plurality of non-entity titles; density clustering the first texts 120 based on the first feature representations and the first density radius to generate first text clusters and second texts that are not clustered; determining a plurality of second feature representations associated with the plurality of second texts based on the word frequency-inverse document frequency and the bag of words model for each word in the plurality of second texts; and density clustering the second texts based on the second feature representations and a second density radius to generate second text clusters, wherein the second density radius is larger than the first density radius.
Therefore, multilevel text clustering can be realized.
Fig. 2 shows a flow diagram of a method 200 for text clustering according to an embodiment of the present disclosure. For example, the method 200 may be performed by the computing device 110 as shown in FIG. 1. It should be understood that method 200 may also include additional blocks not shown and/or may omit blocks shown, as the scope of the present disclosure is not limited in this respect.
At block 202, the computing device 110 determines a word frequency-inverse document frequency for each term in the plurality of first texts 120 to be clustered based on the text corpus 130. The plurality of first texts 120 includes, for example, but is not limited to, a plurality of news texts. The first text may for example comprise a text title and a text body, such as a news title and a news body.
Specifically, for each first text in the plurality of first texts 120, the computing device 110 may perform word segmentation on the first text to obtain a plurality of words, and then the computing device 110 determines the frequency of occurrence (i.e., word frequency) of each word in the first text in the plurality of words and determines the inverse document frequency of each word in the text repository 130 (the more text a word occurs in the text repository 130, the lower the inverse document frequency thereof), and then multiplies the word frequency of each word by the inverse document frequency to generate the word frequency-inverse document frequency of each word.
The inverse document frequency formula may be
Figure 726772DEST_PATH_IMAGE001
Where N is the total number of texts in the text repository 130 and N (x) is the number of texts in the text repository 130 that include the word x.
At block 204, the computing device 110 removes the entity identification from the plurality of text titles in the plurality of first texts 120 to generate a plurality of non-entity titles.
In particular, the computing device 110 may perform entity recognition on the plurality of text titles to determine an entity identification, such as a company name or the like, in the plurality of text titles. Subsequently, the computing device 110 may remove the identified entity identification from the plurality of text titles to generate a plurality of non-entity titles.
For example, if a text title is "company a successfully listed," it may be determined that "company a" is the entity identifier, and after "company a" is removed from the text title, a non-entity title "successfully listed" is generated.
At block 206, the computing device 110 determines a plurality of first feature representations associated with the plurality of non-entity titles based on the word frequency-inverse document frequency and the bag of words model for each word in the plurality of non-entity titles.
In some embodiments, for each non-entity title of the plurality of non-entity titles, the computing device 110 may determine a plurality of word frequencies for a plurality of words included in the non-entity title-an inverse document frequency. Subsequently, computing device 110 may generate a vector including a plurality of word frequencies-inverse document frequencies based on the bag of words model. Next, the computing device 110 may perform an L2 norm normalization on the vector to generate a first feature representation associated with the non-entity title.
Taking the non-entity title "successfully marketed" above as an example, which includes two words, "successful" and "marketed", whose word frequencies-inverse document frequencies are, for example, 0.2 and 0.4, respectively, the resulting vector is [0.2, 0.4 ]. The L2 norm normalization results in a vector with a sum of squares of the elements of the vector of 1.
At block 208, the computing device 110 density clusters the plurality of first texts 120 based on the plurality of first feature representations and the first density radius to generate a plurality of first text clusters and a plurality of second texts that are not clustered.
Specifically, the computing device 110 may determine a similarity between two of the plurality of first texts 120 based on the plurality of first feature representations. The similarity includes, for example, but is not limited to, cosine similarity.
Subsequently, computing device 110 may Density cluster (DBSCAN) the plurality of first texts 120 Based on the similarity and the first Density radius to generate a plurality of first text clusters and a plurality of second texts that are not clustered.
For example, the distance between two texts in the plurality of first texts 120 may be represented by 1-similarity, and the neighborhood of each first text may be determined by the distance between two texts and the first density radius. For a first text a, the neighborhood may include a first text of the plurality of first texts 120 having a distance from the first text a smaller than a first density radius.
First text A is said to be core text if the neighborhood of first text A includes at least a predetermined number of first texts.
If the first text B is located in the neighborhood of the first text a and the first text a is the core text, the first text B may be said to be through by the first text a density.
For the first text A and B, if there is a first text sequence p1, p2, … …, pt, where p1 is A, pt is B, and pn +1 is directly reached by the pn (where n is 1 or more and t-1 or less) density, then the first text B is said to be reachable by the first text A density.
For the first text A and B, if there is core text C, such that both A and B have a reachable C density, the A and B densities are said to be connected.
The process of density clustering may specifically select an uncleaved core text of the first texts 120 as a seed, determine a first text set with the reachable density of the core text as a first text cluster, and then repeat the above selection and determination processes until all the core texts have clusters, thereby generating a plurality of first text clusters and a plurality of uncleaved second texts.
At block 210, the computing device 110 determines a plurality of second feature representations associated with the plurality of second texts based on the word frequency-inverse document frequency and the bag of words model for each word in the plurality of second texts.
In particular, computing device 110 may determine, based on the word frequency-inverse document frequency and the bag of words model for each word in the plurality of second texts, a plurality of third feature representations associated with a plurality of text headings in the plurality of second texts and a plurality of fourth feature representations associated with a plurality of text bodies in the plurality of second texts.
For a text heading, for example, a number of word frequencies for a number of words included in the text heading-an inverse document frequency may be vectorized, generating a third feature representation associated with the text heading. For a text body, for example, a plurality of word frequencies of a plurality of words included in the text body-an inverse document frequency may be vectorized, generating a fourth feature representation associated with the text body.
Subsequently, the computing device 110 may weight the plurality of third feature representations and the plurality of fourth feature representations based on the predetermined weights to generate a plurality of second feature representations associated with a plurality of second texts.
For example, for a second text, the third feature of its title is denoted Vt, the fourth feature of its body is denoted Vc, and the predetermined weight is wt, then the second feature of the second text is denoted V = wt × Vt + Vc. Further, the second eigenrepresentation may also be normalized to the L2 norm.
At block 212, the computing device 110 density clusters the second plurality of texts based on the second plurality of feature representations and a second density radius to generate a second plurality of text clusters, the second density radius being greater than the first density radius.
In particular, the computing device 110 may determine a similarity between two of the plurality of second texts based on the plurality of second feature representations. The similarity includes, for example, but is not limited to, cosine similarity.
Subsequently, the computing device 110 may density cluster the plurality of second texts based on the similarity and the second density radius to generate a plurality of second text clusters.
The specific process is similar to the density clustering process described above and is not repeated.
Therefore, multilevel text clustering can be realized. The method sequentially uses different repeated text density clustering algorithms to cluster the texts such as news information from different side key points and article dimensions, supports finer-grained and hierarchical clustering of subdivided topics under large topics, and solves the problems that the conventional clustering dimensions and granularity are too single, lack of hierarchy and inaccurate clustering.
Since density clustering is continually expanding according to density connectable samples, it is easy to connect together two different classes of text with individual connection points. For example, assume that there are two clusters, a and B, that say two unrelated events, E1 and E2, respectively. One of the news x is a news review, and the event E1 and the event E2 are spoken, the news x is likely to be adjacent to a part of articles in the a class and the B class at the same time, so that the a class and the B class are likely to be combined into a large class by density clustering, and the a class and the B class cannot be separated. Thus, news reviews or other review-like articles have a significant impact on density clustering. When the problem is serious, the mixed large cluster can be chain-extended, for example, a plurality of clusters such as A-B-C-D … … are connected in series to form a cluster.
Therefore, the present disclosure further provides a cluster segmentation method, which performs cluster segmentation on the plurality of second text clusters after generating the plurality of second text clusters.
Fig. 3 shows a flow diagram of a method 300 for cluster segmentation according to an embodiment of the present disclosure. For example, the method 300 may be performed by the computing device 110 as shown in FIG. 1. It should be understood that method 300 may also include additional blocks not shown and/or may omit blocks shown, as the scope of the disclosure is not limited in this respect. The method 300 may include, for each second text cluster in the plurality of second text clusters, the computing device 110 performing the following steps.
At block 302, the computing device 110 determines a core set of text from the second text cluster, a distance between any core text in the core set of text and at least a predetermined number of texts in the second text cluster being less than the second density radius. The predetermined number may be any suitable number, for example.
At block 304, the computing device 110 determines a set of segmented texts from the set of core texts, and a plurality of core texts remaining after any segmented text in the set of segmented texts is removed from the set of core texts may be divided into a plurality of first connected subsets.
In particular, the computing device 110 may traverse each core text in the core text set, searching for a first subset of connections in the remaining plurality of core texts that have the core text removed from the core text set.
For example, the computing device 110 may search the first subset of connections by: any core text P in a set V formed by the remaining multiple core texts is selected as a seed, a set P1 formed by all the neighbor core texts of P is searched in the set V, a set P2 formed by all the neighbor core texts which are not found in the set P1 is searched, a set P3 formed by all the neighbor core texts which are not found in the set P2 is searched, and the steps are repeated in this way until no new neighbor core texts can be found, and if a set P1, P2 and P3 … … Pn is obtained finally, all the core texts in the n sets and the seed core text P form a first connected subset together. Repeating the above steps for all the remaining core texts in the set V, a plurality of first connected subsets of V can be found. If only one set is output, i.e. returning V itself, it means that V is connected in its entirety and cannot be subdivided into a plurality of first connected subsets.
The first subset of connections has the following characteristics: (1) the subset is internally connected: i.e. where a path can be found between any text x to any other text y: x-p1-p2- … -pn-y, wherein x is adjacent to p1, p1 is adjacent to p2, … …, and pn is adjacent to y; (2) any text in any subset must be located a distance from any text in the other subset that is greater than the second density radius. I.e. the above-mentioned path cannot be found between any two texts in different subsets to make them connected. Here "neighbors" or "neighbors" may be defined as the distance between texts does not exceed the second density radius.
Determining the removed core text as a segmented text if the remaining plurality of core texts is divisible into the plurality of first connected subsets, and not the segmented text if the remaining plurality of core texts is not divisible into the plurality of first connected subsets.
At block 306, the computing device 110 removes the set of segmented text and the set of non-core text having a distance from the set of segmented text that is less than the second density radius from the second text cluster to generate a set of remaining text.
At block 308, the computing device 110 determines whether the remaining text set can be divided into a plurality of second connected subsets.
For example, it is determined whether the set of remaining text can be divided into a plurality of second connected subsets by searching the set of remaining text for the second connected subsets. The specific search process is similar to that of the first connection subset, and reference is made to the above, which is not described herein again.
If the computing device 110 determines at block 308 that the remaining set of text can be divided into a plurality of second connected subsets, then the plurality of second connected subsets are segmented as a plurality of third text clusters after the second text cluster at block 310. For example, text in the second connected subsets is labeled with a corresponding third text cluster.
At block 312, the computing device 110 partitions the set of segmented text and the set of non-core text that is less than the second density radius from the set of segmented text (hereinafter, the set of segmented text and its set of non-core neighbor text) into a plurality of third clusters of text based on the second density radius.
For example, for each text in the segmented text set and its non-core neighbor text set, if the distance between the text and the core text in the plurality of third text clusters is determined to be less than the second density radius, the text is divided into third text clusters corresponding to the core text. It should be understood here that if the distance between the text and the plurality of core texts in the plurality of third text clusters is less than the second density radius, the text is classified into a plurality of third text clusters, i.e. labeled with a plurality of third text clusters. For example, whether the text P is adjacent to a core text Pa in a third text cluster a (the distance between the texts is smaller than the second density radius) is judged, if P is adjacent to Pa, P belongs to the third text cluster a, and a label of a is marked. If P is adjacent to the core texts in the third text clusters, a plurality of different cluster labels can be marked.
For a plurality of texts in the segmented text set and the non-core neighbor text set thereof which are not divided into the third text cluster, if the segmented text and the non-core neighbor text associated with the segmented text (the distance between the segmented text and the non-core neighbor text is smaller than the second density radius) are included in the plurality of texts, the segmented text and the non-core neighbor text associated with the segmented text are taken as a new cluster, for example, a new cluster label is marked, otherwise, the plurality of texts are set as noise.
Therefore, the wrongly aggregated classes in the second text cluster can be corrected and segmented, and the clustering accuracy is improved.
Alternatively or additionally, in some embodiments, for each of a plurality of second text clusters and a plurality of third text clusters, the computing device 110 may further determine a maximum distance between two texts in the text cluster, and density cluster the text cluster based on the second feature representation set and a third density radius if the maximum distance is determined to be greater than a threshold value to generate a plurality of fourth text clusters, the third density radius being less than the second density radius.
Therefore, density clustering is carried out on a plurality of second text clusters and a plurality of third text clusters obtained by clustering under a wider clustering condition by adopting a stricter clustering condition, so that finer-grained secondary clustering is realized.
The ordered cubic density clustering and the primary clustering segmentation overcome the problems that the conventional clustering dimension and granularity are too single, lack of hierarchy and inaccurate in clustering.
In addition, in order to enable a reader to roughly know the main content of each cluster without looking at specific articles in each cluster, the disclosure also provides a method for generating cluster titles.
Fig. 4 shows a flow diagram of a method 400 for generating cluster titles according to an embodiment of the present disclosure. For example, the method 400 may be performed by the computing device 110 as shown in FIG. 1. It should be understood that method 400 may also include additional blocks not shown and/or may omit blocks shown, as the scope of the disclosure is not limited in this respect. The method 400 may include, for at least one of the first plurality of text clusters, the second plurality of text clusters, the third plurality of text clusters, and the fourth plurality of text clusters, the computing device 110 performing the following steps.
At block 402, the computing device 110 divides a plurality of text titles in a text cluster into a plurality of title segments based on punctuation. For example, a plurality of text headings in a text cluster are divided into a plurality of heading segments based on punctuation marks such as commas, exclamation marks, question marks, semicolons, spaces, and the like. In some embodiments, certain rules may also be implemented, such as the removal of comma spaces at the end of the beginning, etc.
At block 404, the computing device 110 determines a plurality of first scores associated with the plurality of title segments based on the segment frequency of occurrence of each term in the plurality of title segments. The clip occurrence frequency is how many title clips a word occurs.
In some embodiments, the first score of a title segment may be determined, for example, as an average of a plurality of segment occurrence frequencies of a plurality of words included in the title segment.
At block 406, the computing device 110 determines a plurality of feature representations associated with the plurality of title segments based on the word frequency-inverse document frequency and the bag of words model for each word in the plurality of title segments.
For example, a vector is generated as a feature representation of a title segment based on a plurality of word frequencies of a plurality of words included in the title segment — an inverse document frequency.
At block 408, the computing device 110 density clusters the plurality of title segments based on the plurality of feature representations associated with the plurality of title segments to generate a plurality of segment clusters.
For example, a similarity between two title segments may be generated based on a plurality of feature representations of the plurality of title segments, and the plurality of title segments may be density clustered based on the similarity and the density radius to generate a plurality of segment clusters. The density clustering method can be referred to above, and is not described herein again.
At block 410, the computing device 110 determines a plurality of second scores associated with the plurality of segment clusters based on the plurality of first scores associated with the plurality of title segments.
For example, a plurality of first scores of a plurality of title segments included in a segment cluster may be summed as a second score of the segment cluster.
At block 412, the computing device 110 determines a first segment cluster with the highest second score from the plurality of segment clusters.
At block 414, the computing device 110 determines the shortest title segment from the plurality of title segments in the first cluster of segments that includes the word with the highest frequency of occurrence of the segment as the cluster title for the text cluster.
Therefore, the titles capable of summarizing the subject contents of each cluster can be automatically generated, a reader can generally know the main contents of each cluster without looking up the specific articles in each cluster, then the interested clusters are selected to read the specific contents of the articles in the clusters, and the efficiency of the reader in mastering information is further improved.
Alternatively or additionally, in some embodiments, the computing device 110 may also present a plurality of first text clusters, a plurality of second text clusters, a plurality of third text clusters, a plurality of fourth text clusters, and a plurality of headings associated with the plurality of first text clusters, the plurality of second text clusters, the plurality of third text clusters, the plurality of fourth text clusters.
Alternatively or additionally, in some embodiments, the computing device 110 may further obtain a search result based on the search keyword input by the user, the search result including a plurality of first texts to be clustered, and present a plurality of first text clusters, a plurality of second text clusters, a plurality of third text clusters, a plurality of fourth text clusters, and a plurality of cluster titles associated with the plurality of first text clusters, the plurality of second text clusters, the plurality of third text clusters, the plurality of fourth text clusters.
The method has wide application and application prospect, and can be used in any scene or product containing news information. For example, daily news information can be clustered by the scheme disclosed by the disclosure and displayed to the user in a clustering form, so that the reading efficiency of the user is improved; it can also be used in news search scenarios: the user inputs search keywords, then the news set returned by the search is clustered by using the scheme disclosed by the invention, and the final result is returned to the user in a news clustering form.
Fig. 5 illustrates a schematic block diagram of an example device 500 that may be used to implement embodiments of the present disclosure. For example, computing device 110 as shown in fig. 1 may be implemented by device 500. As shown, device 500 includes a Central Processing Unit (CPU) 501 that may perform various appropriate actions and processes according to computer program instructions stored in a Read Only Memory (ROM) 502 or computer program instructions loaded from a storage unit 508 into a Random Access Memory (RAM) 503. In the RAM 503, various programs and data required for the operation of the device 500 can also be stored. The CPU 501, ROM 502, and RAM 503 are connected to each other via a bus 504. An input/output (I/O) interface 505 is also connected to bus 504.
A number of components in the device 500 are connected to the I/O interface 505, including: an input unit 506 such as a keyboard, a mouse, a microphone, and the like; an output unit 507 such as various types of displays, speakers, and the like; a storage unit 508, such as a magnetic disk, optical disk, or the like; and a communication unit 509 such as a network card, modem, wireless communication transceiver, etc. The communication unit 509 allows the device 500 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunication networks.
The various processes and processes described above, such as the method 200 and 400, may be performed by the central processing unit 501. For example, in some embodiments, the method 200-400 may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as the storage unit 508. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 500 via the ROM 502 and/or the communication unit 509. When the computer program is loaded into RAM 503 and executed by the central processing unit 501, one or more of the actions of method 200-400 described above may be performed.
The present disclosure relates to methods, apparatuses, systems, electronic devices, computer-readable storage media and/or computer program products. The computer program product may include computer-readable program instructions for performing various aspects of the present disclosure.
The computer readable storage medium may be a tangible device that can hold and store the instructions for use by the instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical coding device, such as punch cards or in-groove projection structures having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media as used herein is not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission medium (e.g., optical pulses through a fiber optic cable), or electrical signals transmitted through electrical wires.
The computer-readable program instructions described herein may be downloaded from a computer-readable storage medium to a respective computing/processing device, or to an external computer or external storage device via a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. The network adapter card or network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in the respective computing/processing device.
The computer program instructions for carrying out operations of the present disclosure may be assembler instructions, Instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer-readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, the electronic circuitry that can execute the computer-readable program instructions implements aspects of the present disclosure by utilizing the state information of the computer-readable program instructions to personalize the electronic circuitry, such as a programmable logic circuit, a Field Programmable Gate Array (FPGA), or a Programmable Logic Array (PLA).
Various aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.
These computer-readable program instructions may be provided to a processing unit of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processing unit of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer-readable program instructions may also be stored in a computer-readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer-readable medium storing the instructions comprises an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.
The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
Having described embodiments of the present disclosure, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the disclosed embodiments. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terms used herein were chosen in order to best explain the principles of the embodiments, the practical application, or technical improvements to the techniques in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims (9)

1. A method for text clustering, comprising:
determining the word frequency-inverse document frequency of each word in a plurality of first texts to be clustered based on a text library;
removing entity identifications from a plurality of text titles in the plurality of first texts to generate a plurality of non-entity titles;
determining a plurality of first feature representations associated with the plurality of non-entity titles based on the word frequency-inverse document frequency and the bag of words model for each word in the plurality of non-entity titles;
density clustering the first texts based on the first feature representations and the first density radius to generate first text clusters and second texts which are not clustered;
determining a plurality of second feature representations associated with the plurality of second texts based on the word frequency-inverse document frequency of each word in the plurality of second texts and the bag of words model; and
density clustering the second plurality of texts based on the second plurality of feature representations and a second density radius to generate a second plurality of text clusters, the second density radius being greater than the first density radius;
the method further comprises, for each second text cluster of the plurality of second text clusters, performing the steps of:
determining a set of core texts from the second text cluster, wherein the distance between any core text in the set of core texts and at least a predetermined number of texts in the second text cluster is smaller than the second density radius;
determining a set of segmented texts from the core text set, wherein after any segmented text in the set of segmented texts is removed from the core text set, the remaining plurality of core texts can be divided into a plurality of first connected subsets;
removing the segmented text set and a set of non-core text having a distance from the segmented text set less than the second density radius from the second text cluster to generate a set of remaining text;
if it is determined that the set of remaining text can be divided into a plurality of second connected subsets:
taking the second connected subsets as a second text cluster after the second text cluster is segmented; and
partitioning the segmented text set and the non-core text set into the plurality of third text clusters based on the second density radius.
2. The method of claim 1, further comprising, for each of the plurality of second text clusters and the plurality of third text clusters, performing the steps of:
determining the maximum distance between every two texts in the text cluster; and
density clustering the text clusters based on the plurality of second feature representations and a third density radius to generate a plurality of fourth text clusters if it is determined that the maximum distance is greater than a threshold, the third density radius being less than the second density radius.
3. The method of claim 2, further comprising performing the following steps for at least one of the first plurality of text clusters, the second plurality of text clusters, the third plurality of text clusters, and the fourth plurality of text clusters:
dividing a plurality of text titles in the text cluster into a plurality of title segments based on punctuation marks;
determining a plurality of first scores associated with the plurality of title segments based on a segment frequency of occurrence of each word in the plurality of title segments;
determining a plurality of feature representations associated with the plurality of title segments based on the word frequency-inverse document frequency of each word in the plurality of title segments and the bag of words model;
density clustering the plurality of title segments based on a plurality of feature representations associated with the plurality of title segments to generate a plurality of segment clusters;
determining a plurality of second scores associated with the plurality of segment clusters based on a plurality of first scores associated with the plurality of title segments;
determining a first segment cluster with a highest second score from the plurality of segment clusters; and
and determining the shortest title segment from a plurality of title segments including the words with the highest segment occurrence frequency in the first segment cluster as the cluster title of the text cluster.
4. The method of claim 1, wherein determining the plurality of first feature representations comprises, for each non-entity title of the plurality of non-entity titles, performing the steps of:
determining a plurality of word frequencies-inverse document frequencies of a plurality of words included in the non-entity title;
generating a vector comprising the plurality of word frequencies-inverse document frequencies based on the bag of words model; and
the vector is normalized by a L2 norm to generate a first feature representation associated with the non-entity title.
5. The method of claim 1, wherein density clustering the plurality of first texts comprises:
determining similarity between two texts in the first texts based on the first feature representations; and
density clustering the first texts to generate first text clusters and second texts which are not clustered based on the similarity and the first density radius.
6. The method of claim 1, wherein determining the plurality of second feature representations comprises:
determining a plurality of third feature representations associated with a plurality of text titles in the plurality of second texts and a plurality of fourth feature representations associated with a plurality of text bodies in the plurality of second texts based on the word frequency-inverse document frequency of each word in the plurality of second texts and the bag-of-words model; and
weighting the plurality of third feature representations and the plurality of fourth feature representations based on predetermined weights to generate the plurality of second feature representations associated with the plurality of second texts.
7. The method of claim 1, wherein the plurality of first texts comprises a plurality of news texts.
8. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-7.
9. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-7.
CN202011491126.5A 2020-12-17 2020-12-17 Method, electronic device and storage medium for text clustering Active CN112256842B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202011491126.5A CN112256842B (en) 2020-12-17 2020-12-17 Method, electronic device and storage medium for text clustering
PCT/CN2021/087169 WO2022126944A1 (en) 2020-12-17 2021-04-14 Text clustering method, electronic device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011491126.5A CN112256842B (en) 2020-12-17 2020-12-17 Method, electronic device and storage medium for text clustering

Publications (2)

Publication Number Publication Date
CN112256842A CN112256842A (en) 2021-01-22
CN112256842B true CN112256842B (en) 2021-03-26

Family

ID=74225827

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011491126.5A Active CN112256842B (en) 2020-12-17 2020-12-17 Method, electronic device and storage medium for text clustering

Country Status (2)

Country Link
CN (1) CN112256842B (en)
WO (1) WO2022126944A1 (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112256842B (en) * 2020-12-17 2021-03-26 上海朝阳永续信息技术股份有限公司 Method, electronic device and storage medium for text clustering
CN113011155B (en) * 2021-03-16 2023-09-05 北京百度网讯科技有限公司 Method, apparatus, device and storage medium for text matching
CN113792784B (en) * 2021-09-14 2022-06-21 上海任意门科技有限公司 Method, electronic device and storage medium for user clustering
CN115577124B (en) * 2022-11-10 2023-04-07 上海朝阳永续信息技术股份有限公司 Method, apparatus and medium for interacting financial data
CN115544213B (en) * 2022-11-28 2023-03-10 上海朝阳永续信息技术股份有限公司 Method, device and storage medium for acquiring information in text
CN118016225B (en) * 2024-04-09 2024-06-25 山东第一医科大学附属省立医院(山东省立医院) Intelligent management method for electronic health record data after kidney transplantation

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108520009A (en) * 2018-03-19 2018-09-11 北京工业大学 A kind of English text clustering method and system
CN109189934A (en) * 2018-11-13 2019-01-11 平安科技(深圳)有限公司 Public sentiment recommended method, device, computer equipment and storage medium
CN111985228A (en) * 2020-07-28 2020-11-24 招联消费金融有限公司 Text keyword extraction method and device, computer equipment and storage medium

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9710544B1 (en) * 2016-05-19 2017-07-18 Quid, Inc. Pivoting from a graph of semantic similarity of documents to a derivative graph of relationships between entities mentioned in the documents
CN110209808B (en) * 2018-08-08 2023-03-10 腾讯科技(深圳)有限公司 Event generation method based on text information and related device
CN110489558B (en) * 2019-08-23 2022-03-18 网易传媒科技(北京)有限公司 Article aggregation method and device, medium and computing equipment
CN111143479B (en) * 2019-12-10 2023-09-01 易点生活数字科技有限公司 Knowledge graph relation extraction and REST service visualization fusion method based on DBSCAN clustering algorithm
CN112256842B (en) * 2020-12-17 2021-03-26 上海朝阳永续信息技术股份有限公司 Method, electronic device and storage medium for text clustering

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108520009A (en) * 2018-03-19 2018-09-11 北京工业大学 A kind of English text clustering method and system
CN109189934A (en) * 2018-11-13 2019-01-11 平安科技(深圳)有限公司 Public sentiment recommended method, device, computer equipment and storage medium
CN111985228A (en) * 2020-07-28 2020-11-24 招联消费金融有限公司 Text keyword extraction method and device, computer equipment and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于标题特征词密度聚类以及相似度计算的热点发现研究;韩晨靖;《中国优秀硕士学位论文全文数据库信息科技辑》;20140115(第01期);正文第1-68页 *

Also Published As

Publication number Publication date
CN112256842A (en) 2021-01-22
WO2022126944A1 (en) 2022-06-23

Similar Documents

Publication Publication Date Title
CN112256842B (en) Method, electronic device and storage medium for text clustering
CN108399228B (en) Article classification method and device, computer equipment and storage medium
US10565244B2 (en) System and method for text categorization and sentiment analysis
WO2021212675A1 (en) Method and apparatus for generating adversarial sample, electronic device and storage medium
Al-Anazi et al. Finding similar documents using different clustering techniques
CN107004159B (en) Active machine learning
WO2015007175A1 (en) Subject-matter analysis of tabular data
CN107357895B (en) Text representation processing method based on bag-of-words model
US20140040297A1 (en) Keyword extraction
WO2014073206A1 (en) Information-processing device and information-processing method
Patel et al. Dynamic lexicon generation for natural scene images
WO2020172649A1 (en) System and method for text categorization and sentiment analysis
CN111324810A (en) Information filtering method and device and electronic equipment
CN111353045A (en) Method for constructing text classification system
CN114995903B (en) Class label identification method and device based on pre-training language model
CN111753514B (en) Automatic generation method and device of patent application text
Jayady et al. Theme Identification using Machine Learning Techniques
Kadhim et al. Improving TF-IDF with singular value decomposition (SVD) for feature extraction on Twitter
CN113934848A (en) Data classification method and device and electronic equipment
CN108038109A (en) Method and system, the computer program of Feature Words are extracted from non-structured text
CN112417147A (en) Method and device for selecting training samples
US9946765B2 (en) Building a domain knowledge and term identity using crowd sourcing
JP2008276344A (en) Multi-topic classification apparatus, multi-topic classification method and multi-topic classification program
CN103678355B (en) Text mining method and text mining device
CN113378557B (en) Automatic keyword extraction method, medium and system based on fault-tolerant rough set

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CP02 Change in the address of a patent holder
CP02 Change in the address of a patent holder

Address after: 201203 Room 501, building 4, No. 690, Bibo Road, China (Shanghai) pilot Free Trade Zone, Pudong New Area, Shanghai

Patentee after: SHANGHAI SUNTIME INFORMATION TECHNOLOGY CO.,LTD.

Address before: Building 4, 690 Bibo Road, China (Shanghai) pilot Free Trade Zone, Pudong New Area, Shanghai, 201203

Patentee before: SHANGHAI SUNTIME INFORMATION TECHNOLOGY CO.,LTD.