WO2022143069A1

WO2022143069A1 - Text clustering method and apparatus, electronic device, and storage medium

Info

Publication number: WO2022143069A1
Application number: PCT/CN2021/136677
Authority: WO
Inventors: 封江涛; 陈家泽; 周浩; 李磊
Original assignee: 北京有竹居网络技术有限公司
Priority date: 2020-12-31
Filing date: 2021-12-09
Publication date: 2022-07-07
Also published as: CN112632285A

Abstract

A text clustering method and apparatus, an electronic device, and a storage medium. The method comprises: acquiring a target text data set to be clustered (110); for each piece of target text data in the target text data set, calculating a first importance score of at least one word in each piece of target text data, and sorting the at least one word in each piece of target text data on the basis of the first importance score to generate a word sequence to be searched corresponding to each piece of target text data (120); for each word sequence, searching in a pre-constructed dictionary tree for a target word sequence adapted to each word sequence, the target word sequence belonging to a sub-sequence of each word sequence (130); and according to at least one target word sequence, respectively clustering the target text data corresponding to the at least one target word sequence to obtain a text clustering result (140).

Description

A text clustering method, device, electronic device and storage medium

This application claims the priority of the Chinese Patent Application No. 202011630633.2 filed with the China Patent Office on December 31, 2020, the entire contents of which are incorporated herein by reference.

technical field

The embodiments of the present disclosure relate to the field of computer technology, for example, to a text clustering method, apparatus, electronic device, and storage medium.

Background technique

Text clustering is to divide similar text data into the same cluster and distinguish different text clusters, among which, clusters can also be called "clusters". Clustering methods are divided into different fields such as networking, medicine, biology, computer vision, natural language, etc.

The text clustering method in the related art represents the text as a feature vector, and then calculates the similarity between the texts by calculating the feature vector corresponding to the text; finally, the text is clustered according to the similarity between the texts, as can be seen in It is pointed out that the text clustering method in the related art first needs to represent the text as a feature vector, and then the similarity between the texts can be calculated by the feature vector, which makes the calculation process of text clustering complicated and the efficiency is low.

SUMMARY OF THE INVENTION

Embodiments of the present disclosure provide a text clustering method, apparatus, electronic device, and storage medium, which can effectively improve the efficiency and accuracy of text clustering.

In a first aspect, an embodiment of the present disclosure provides a text clustering method, including:

Obtain a target text data set to be clustered; wherein, the target text data set includes at least one piece of target text data;

For each piece of target text data in the target text data set, calculate a first importance score of at least one word in each piece of target text data, and assign each piece of target text data based on the first importance score Sort at least one word in the to-be-searched word sequence corresponding to each piece of target text data;

For each word sequence to be searched, a pre-built dictionary tree is searched for a target word sequence adapted to each word sequence to be searched; wherein, the target word sequence belongs to the child of each word sequence to be searched sequence;

The target text data corresponding to the at least one target word sequence is clustered according to the at least one target word sequence, respectively, to obtain a text clustering result.

In a second aspect, an embodiment of the present disclosure further provides a text clustering apparatus, including:

A text data acquisition module, configured to acquire a target text data set to be clustered; wherein, the target text data set includes at least one piece of target text data;

A search word sequence generation module, configured to calculate the first importance score of at least one word in each piece of target text data for each piece of target text data in the target text data set, and based on the first importance score Sort at least one word in each piece of target text data, and generate a word sequence to be searched corresponding to each piece of target text data;

The target word sequence determination module is configured to search a pre-built dictionary tree for a target word sequence adapted to each to-be-searched word sequence for each to-be-searched word sequence; wherein, the target word sequence belongs to the a subsequence of each sequence of words to be searched;

The text clustering module is configured to cluster the target text data corresponding to the at least one target word sequence according to the at least one target word sequence, respectively, to obtain a text clustering result.

In a third aspect, an embodiment of the present disclosure further provides an electronic device, the electronic device comprising:

at least one processing device;

a storage device configured to store at least one program;

When the at least one program is executed by the at least one processing apparatus, the at least one processing apparatus implements the text clustering method according to the embodiment of the present disclosure.

In a fourth aspect, an embodiment of the present disclosure further provides a computer-readable medium on which a computer program is stored, and when the program is executed by a processing apparatus, implements the text clustering method according to the embodiment of the present disclosure.

Description of drawings

1 is a flowchart of a text clustering method in an embodiment of the present disclosure;

2 is a schematic diagram of a dictionary tree in an embodiment of the present disclosure;

3 is a flowchart of another text clustering method in an embodiment of the present disclosure;

FIG. 4 is a flowchart of yet another text clustering method in an embodiment of the present disclosure;

FIG. 5 is a flowchart of yet another text clustering method in an embodiment of the present disclosure;

FIG. 6 is a schematic structural diagram of a text clustering apparatus in an embodiment of the present disclosure;

FIG. 7 is a schematic structural diagram of an electronic device in an embodiment of the present disclosure.

Detailed ways

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings.

It should be understood that the various steps described in the method embodiments of the present disclosure may be performed in different orders and/or in parallel. Furthermore, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this regard.

As used herein, the term "including" and variations thereof are open-ended inclusions, ie, "including but not limited to". The term "based on" is "based at least in part on." The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments". Relevant definitions of other terms will be given in the description below.

It should be noted that concepts such as "first" and "second" mentioned in the present disclosure are only used to distinguish different devices, modules or units, and are not used to limit the order of functions performed by these devices, modules or units or interdependence.

It should be noted that the modifications of "a" and "a plurality" mentioned in the present disclosure are illustrative rather than restrictive, and those skilled in the art should understand that unless the context clearly indicates otherwise, they should be understood as "at least one" ".

The names of messages or information exchanged between multiple devices in the embodiments of the present disclosure are only used for illustrative purposes, and are not used to limit the scope of these messages or information.

FIG. 1 is a flowchart of a text clustering method provided by an embodiment of the present disclosure. The embodiment of the present disclosure can be applied to the case of text clustering. and/or software, and generally can be integrated into a device with text clustering function, which can be an electronic device such as a server, a mobile terminal, or a server cluster. As shown in Figure 1, the method includes the following steps:

Step 110: Obtain a target text data set to be clustered; wherein, the target text data set includes at least one piece of target text data.

In the embodiment of the present disclosure, the target text data set includes at least one piece of target text data, where the target text may be various types of text data, such as news, advertisement, network, natural language, medical, etc. text data. The categories of at least one piece of target text data in the target text data set may be the same or different. The target text data may be English text, Chinese text, or Korean text. Exemplarily, the target text data to be clustered can be collected through a web crawler technology, and the target text data can also be obtained through optical character recognition, speech recognition, handwriting recognition, and the like. Optionally, when the target text data set to be clustered contains a piece of target text data, the text data input by the user may be collected in real time, and the collected text data may be used as the text data to be clustered.

It should be noted that the embodiments of the present disclosure do not limit the content category, language category and acquisition method of the target text data.

Step 120, for each piece of target text data in the target text data set, calculate the first importance score of at least one word in the target text data, and assign a value to the target text data based on the first importance score. Sort at least one word of the target text data to generate a word sequence to be searched corresponding to the target text data.

In the embodiment of the present disclosure, word segmentation processing is performed on each piece of target text data in the target text data set, so as to divide each piece of target text data into at least one word. Optionally, before performing word segmentation processing on each piece of target text data, word segmentation preprocessing may also be performed on each piece of target text data, such as removing punctuation and stop words. Then, the first importance score of at least one word in each piece of target text data is calculated, and the first importance score is used to reflect the importance of each word in the target text data. The more important the word is in the target text data, on the contrary, the smaller the first importance score is, the less important the word is in the target text data.

Optionally, the number of occurrences of each word in the target text data may be counted, and the number of occurrences of the word in the target text data may be used as the first importance score. Optionally, the word frequency-inverse document frequency of a word in the target text data may be used as the first importance score of the word. Optionally, for each piece of target text data in the target text data set, calculating the first importance score of at least one word in the target text data, including: for each piece of target text data in the target text data set , respectively calculating the first word frequency-inverse document frequency of at least one word in the target text data; respectively calculating the first importance score of at least one word in the target text data according to at least one first word frequency-inverse document frequency. It should be noted that, the embodiment of the present disclosure does not limit the calculation method of the first importance score of at least one word in the target text data.

Exemplarily, the at least one word in the target text data is sorted based on the first importance score, for example, the at least one word in the target text data may be sorted in descending order of the first importance score, and the The sequence composed of the sorted words is used as the to-be-searched word sequence corresponding to the target text data. It can be understood that the higher the word in the word sequence to be searched, the greater the first importance score of the word, indicating that the word is more important in the target text data, and the more it can indicate the meaning that the target text data wants to express, content, or the more it can indicate the category of the target text data.

Step 130: For each word sequence to be searched, search a pre-built dictionary tree for a target word sequence adapted to the word sequence to be searched; wherein, the target word sequence belongs to a subsequence of the word sequence to be searched .

In the embodiment of the present disclosure, a pre-built dictionary tree is acquired, wherein the dictionary tree is constructed based on a pre-configured target corpus. Exemplarily, for each word sequence to be searched, a dictionary tree is searched for a target word sequence adapted to the word sequence to be searched. Optionally, for each to-be-searched word sequence, searching a pre-built dictionary tree for a target word sequence adapted to the to-be-searched word sequence, including: for each to-be-searched word sequence, searching in a pre-built dictionary tree In the sequence from the root node to the child node, search for the target word sequence adapted to the to-be-searched word sequence. Exemplarily, in order from the root node to the child nodes, the dictionary tree is searched for the first target node that matches the first word in the word sequence to be searched, and then all child nodes connected to the first target node are searched. The second target node that matches the second word in the sequence of words to be searched, and then searches all subnodes connected to the second target node for the third target node that matches the third word in the sequence of words to be searched, And so on, until no node matching p+1 words in the word sequence to be searched is found in all the child nodes connected to the p-th target node, and the sequence of words in multiple target nodes is taken as The target word sequence, that is, the sequence consisting of words in the word sequence to be searched that can be searched in the dictionary tree, which matches the node, as the target word sequence. The target word sequence is a subsequence of the word sequence to be searched. Exemplarily, the sequence of words to be searched is [A-B-C-D-E], where A, B, C, D, and E respectively represent words in the sequence of words to be searched, and can be searched in the order from the root node to the child nodes in the dictionary tree. To the target node matching A, B, and C, that is, the first target node matching A can be searched in the dictionary tree, and the second target node matching B can be found in the child nodes connected to the first target node. The target node, the third target node that matches C can be searched in the child nodes connected to the second target node, but the third target node that matches D cannot be found in the child nodes connected to the third target node, then The sequence consisting of A, B, and C is used as the target word sequence.

Step 140: Cluster the target text data corresponding to the at least one target word sequence according to the at least one target word sequence, respectively, to obtain a text clustering result.

In the embodiment of the present disclosure, the target text data corresponding to the at least one target word sequence is clustered according to the at least one target word sequence. It can be understood that the target word sequence can intuitively reflect the category of the target text data or the target text data. If the target word sequence corresponding to the target text data is the same or has a high degree of similarity, it can indicate that the category of the target text data or the content expressed are the same or similar. Therefore, the target text data can be clustered according to the target word sequence. Exemplarily, target text data with the same target word sequence can be clustered into the same cluster, and target text data with different target word sequences can be clustered into different clusters; the similarity between at least one target word sequence can also be calculated, The target text data whose similarity is greater than the preset threshold are clustered into the same cluster, and the target text data whose similarity is less than the preset threshold are clustered into different clusters. It should be noted that the embodiment of the present disclosure does not limit the manner of clustering the corresponding target text according to the target word sequence.

In this embodiment of the present disclosure, a target text data set to be clustered is obtained; wherein, the target text data set includes at least one piece of target text data; for each piece of target text data in the target text data set, the target text data is calculated the first importance score of at least one word in the target text data, and sort at least one word in the target text data based on the first importance score, and generate a sequence of words to be searched corresponding to the target text data; A word sequence to be searched is searched in a pre-built dictionary tree for a target word sequence adapted to the word sequence to be searched; wherein, the target word sequence belongs to a subsequence of the word sequence to be searched; according to at least one The target word sequence performs clustering on the target text data corresponding to the at least one target word sequence to obtain a text clustering result. The text clustering method provided by the embodiment of the present disclosure calculates the importance score of at least one word in the text data to be clustered, and sorts the at least one word in the text data to be clustered based on the importance score, and generates the word to be searched Then, based on the pre-built dictionary tree, the target word sequence that is suitable for the search word is found, so that the text data is clustered based on the target word sequence, which simplifies the process of text clustering and greatly reduces the time complexity of text clustering. It effectively improves the efficiency and accuracy of text clustering.

In some embodiments, for each piece of target text data in the target text data set, calculating the first importance score of at least one word in the target text data includes: for each piece of target text data in the target text data set text data, respectively calculate the first word frequency-inverse document frequency of at least one word in the target text data; respectively calculate the first importance of at least one word in the target text data according to at least one first word frequency-inverse document frequency Fraction. Among them, the first term frequency-inverse document frequency (TF-IDF) can indirectly reflect the importance of each word in the target text data. Therefore, the first term of each word in the target text data can be calculated. Word frequency-inverse document frequency, and then calculate the first importance score of each word in the target text data according to each first word frequency-inverse document frequency.

Optionally, calculating the first word frequency-inverse document frequency of at least one word in the target text data respectively includes: respectively determining the first word frequency and the first inverse document frequency of each word in the target text data; Describe the first word frequency and the first inverse document frequency to calculate the first word frequency-inverse document frequency of the corresponding word; wherein, the first word frequency-inverse document frequency is the first word frequency and the first inverse document frequency. product. Exemplarily, respectively determining the first word frequency and the first inverse document frequency of each word in the target text data includes: determining the number of occurrences of each word in the target text data, and using the number of occurrences as the first word frequency of the corresponding word; obtain parameter configuration information corresponding to the dictionary tree; wherein, the parameter configuration information includes an inverse document frequency list, and the inverse document frequency list includes each The inverse document frequency of the word; in the inverse document frequency list, find the inverse document frequency corresponding to at least one word in the target text data respectively, as the first inverse document frequency of at least one word in the target text data .

Exemplarily, count the number of occurrences of each word in the target text data, and use the number of occurrences as the first word frequency (Term Frequency, TF) of the corresponding word. It is understandable that a certain word may appear in the target text data. It may appear several times, and may also appear once, wherein, the more times the word appears, the more important the word is in the content or language expression of the target text data. The parameter configuration information corresponding to the dictionary tree is acquired, wherein the parameter configuration information is the parameter information determined in the process of constructing the dictionary tree based on the target corpus. The parameter configuration information may include an inverse document frequency list composed of an inverse document frequency (Inverse Document Frequency, IDF) of each word contained in the dictionary tree. It can be understood that in the process of constructing a dictionary tree based on the target corpus, it is necessary to calculate the inverse document frequencies of multiple words in the target corpus, and then construct a dictionary tree based on the inverse document frequencies of the multiple words. Find the inverse document frequency corresponding to each word in the target text data in the inverse document frequency list corresponding to the dictionary tree, and use the found target inverse document frequency corresponding to each word as the first inverse document frequency of each word . Then, the product of the first word frequency and the first inverse document frequency is taken as the first word frequency-inverse document frequency of the corresponding word.

In some embodiments, the parameter configuration information further includes a distribution deviation list; wherein, the distribution deviation list includes the distribution deviation of each word included in the dictionary tree; Inverse document frequency, before calculating the first importance score of at least one word in the target text data, further comprising: in the distribution deviation list, respectively searching for the distribution deviation corresponding to each word in the target text data , as the first distribution deviation of each word in the target text data; calculate the first importance score of at least one word in the target text data according to at least one first word frequency-inverse document frequency, including: Calculate the first importance score of each word in the target text data according to each first word frequency-inverse document frequency and the corresponding first distribution deviation; wherein, the first importance score is the The first word frequency - the product of the inverse document frequency and the deviation of the first distribution.

Exemplarily, the parameter configuration information may further include a distribution deviation list consisting of the distribution table deviation of each word in the dictionary tree. It is understandable that in the process of constructing a dictionary tree based on the target corpus, it is not only necessary to calculate the inverse document frequency of each word in the target corpus, but also the distribution deviation of each word in the target corpus, and then based on the inverse of multiple words. Document frequency and distribution bias to build a dictionary tree. Among them, the distribution deviation is used to reflect the distribution deviation of each word in the target corpus and the total corpus. In the distribution deviation list corresponding to the dictionary tree, find the distribution deviation corresponding to each word in the target text data, and use the found target distribution deviation corresponding to each word as the first distribution of the word in the target text data deviation. Then, according to the first word frequency-inverse document frequency and the corresponding first distribution deviation, the first importance score of the word in the target text data is calculated, wherein the first importance score is the first word frequency-inverse document frequency and the first importance score. The product of distribution deviations.

In some embodiments, before acquiring the target text data set to be clustered, the method further includes: acquiring a total corpus and a target corpus; wherein the total corpus includes the target corpus, and the target corpus contains at least one piece of sample text data; calculate the second distribution deviation of each word contained in the target corpus relative to the total corpus; for each piece of sample text data in the target corpus, according to the The second distribution deviation calculates the second importance score of the corresponding word, and sorts at least one word in each piece of sample text data according to the second importance score in descending order, and generates the same value as the sample text data. corresponding sample word sequences; constructing the dictionary tree based on at least one sample word sequence. This setting can accurately and quickly build a dictionary tree corresponding to the target corpus.

Exemplarily, the target corpus may be a corpus belonging to a certain category or a certain field. For example, the target corpus may be an advertising corpus, a network corpus, a legal corpus, or a medical corpus. The total corpus is the total corpus that includes the target corpus. For example, if the target corpus is an advertising corpus, the total corpus may include a network corpus, a legal corpus, a medical corpus, and an advertising corpus. The target corpus includes at least one piece of sample text data. Exemplarily, the total corpus and the target corpus can be obtained through web crawling technology. It should be noted that, the embodiment of the present disclosure does not limit the type of the target corpus, nor does it limit other corpus contents in the total corpus except the target corpus.

Exemplarily, because the corpora of different fields or different categories contain different words and the importance of the words, such as the words contained in the advertising corpus and the legal corpus are quite different, therefore, the target corpus can be calculated. The second distribution deviation of each word contained in the relative to the total corpus, wherein the second distribution deviation can reflect the difference between each word in the target corpus and the total corpus. Optionally, calculating the second distribution deviation of each word included in the target corpus relative to the total corpus includes: calculating the relative value of each word included in the target corpus to the total corpus according to the following formula: Second distribution bias of the corpus:

Among them, b represents the second distribution deviation of the word w in the target corpus relative to the total corpus, freq _a (w) represents the frequency of the word w in the target corpus, freq(w) represents the word w in the total corpus. The frequency of occurrence in the corpus, t represents the number of occurrences of the word w in the target corpus, M represents the total number of words contained in the target corpus, t' represents the number of occurrences of the word w in the total corpus, M ' represents the total number of words contained in the total corpus.

Exemplarily, the total number of words contained in the target corpus is 1000, and the word "movement" appears 100 times in the target corpus, then the frequency of occurrence of "movement" in the target corpus is:

The total number of words contained in the total corpus is 5000, and the word "movement" appears 120 times in the total corpus, so the frequency of "movement" in the total corpus is:

Then the second distribution deviation of "Motion" is:

In the embodiment of the present disclosure, for each piece of sample data text in the target corpus, the second importance score of the corresponding word is calculated according to the second distribution deviation of at least one word in the sample text data, wherein the second importance score reflects the The importance of each word in the sample text data, where the larger the second importance score is, the more important the word is in the sample text data; otherwise, the smaller the second importance score is, the more important the word is in the sample text data. unimportant. Then at least one word in the sample text data is sorted in descending order of the second importance score, and the sequence consisting of the sorted words is used as a sample word sequence corresponding to the sample text data. It can be understood that the higher the word in the sample word sequence, the greater the second importance score of the word, indicating that the word is more important in the sample text data, and the more it can indicate the meaning and content that the sample text data wants to express. , or the more able to indicate the category of the sample text data.

A dictionary tree is constructed based on a sample word sequence corresponding to at least one piece of sample text data in the target corpus. Exemplarily, when the first words in all sample word sequences are different, it can be assumed that the empty node is used as the root node of the dictionary tree, the first word in all sample word sequences is used as the child node of the root node, and the The second word in all sample word sequences is taken as the child node of the node where the first word in the same sample word sequence is located, and the third word in all sample word sequences is taken as the location of the second word in the same sample word sequence The child nodes of the node, and so on, until all words in all sample word sequences are filled in multiple nodes of the dictionary tree. When the first words in all sample word sequences are the same, the first word in all sample word sequences can be used as the root node of the dictionary tree, and the second word in all sample word sequences can be used as a child of the root node node, take the third word in all sample word sequences as the child node of the node where the second word in the same sample word sequence is located, and so on, until all words in all sample word sequences are filled in multiple up to the node. Exemplarily, the sample word sequences corresponding to the five pieces of sample text data in the target corpus are: [intermediate commodity], [intermediate bigbuy], [intermediate business Korean], [business middle], [behind the middle], based on the above The dictionary tree constructed by five sample word sequences is shown in Figure 2.

In some embodiments, for each piece of sample text data in the target corpus, calculating the second importance score of the corresponding word according to the second distribution deviation of each word in the sample text data, respectively, includes: for the target For each piece of sample text data in the corpus, calculate the second word frequency-inverse document frequency of each word in the sample text data; respectively, according to each second word frequency-inverse document frequency and the corresponding second distribution deviation, calculate the The second importance score for each word described in the sample text data. Among them, the second word frequency-inverse document frequency can indirectly reflect the importance of each word in the sample text data. Therefore, the second word frequency-inverse document frequency of each word in the sample text data can be calculated, and then according to each second word frequency-inverse document frequency Word frequency - inverse document frequency and corresponding second distribution deviation, to calculate the second importance score of each word in the sample text data. The second importance score is the product of the second word frequency-inverse document frequency and the corresponding second distribution deviation. Exemplarily, calculating the second importance score of each word in the sample text data according to each second word frequency-inverse document frequency and the corresponding second distribution deviation respectively, includes: calculating the second importance score according to the following formula: The second importance score for each word described in the sample text data:

Wherein, s(w) represents the second importance score of the word w in the sample text data, tf-idf _a (w) represents the second word frequency-inverse document frequency of the word w in the sample data text,

Represents the second distribution bias of word w in the sample text data.

Optionally, determining the second word frequency and the second inverse document frequency of each word in the sample text data respectively includes: calculating the second word frequency and the second inverse document frequency of each word in the sample text data respectively according to the following formula: Document frequency:

tf(w)=m

idf(w)=log((N/n))

Calculating the second word frequency-inverse document frequency of the corresponding word in the sample text data according to the second word frequency and the second inverse document frequency includes: calculating the second word frequency of each word in the sample text data according to the following formula Term Frequency - Inverse Document Frequency:

tf-idf(w)=tf(w)*idf(w)

Wherein, w represents any word in the sample data text, tf(w) represents the second word frequency of the word w in the sample data text, and idf(w) represents the second word frequency of the word w in the sample data text Two inverse document frequency, tf-idf(w) represents the second word frequency-inverse document frequency of the word w in the sample data text, m represents the number of times the word w appears in the sample data text, n represents the target The number of pieces of sample text data containing word w in the corpus, and N represents the total number of pieces of sample text data contained in the target corpus.

Exemplarily, a total of 200 pieces of sample text data are included in the target corpus, then N=200. In a certain piece of sample text data, the word "movement" appears twice, then m=2, and there are a total of 200 pieces of sample text data. 80 pieces of sample text data contain the word "movement", then n=80, so the second word frequency of the word "movement" in the sample text data is: tf(w)=m=2, the second inverse The document frequency is: idf(w)=log(N/n)=log(200/80)=0.398, then the second word frequency-inverse document frequency of the word "movement" in the sample text data is: tf-idf( w)=tf(w)*idf(w)=2*0.398=0.796.

In some embodiments, after constructing the dictionary tree based on at least one sample word sequence, the method further includes: determining the number of occurrences of the word of each node in the dictionary tree at the same position in all sample word sequences; according to the dictionary The dictionary tree is pruned by the number of occurrences of the word of each node in the tree at the same position in all sample word sequences until the number of nodes contained in the dictionary tree reaches a preset number. This setting can effectively improve the search speed of the target word sequence on the premise that the target word sequence corresponding to the target text data can be accurately determined based on the dictionary tree, thereby improving the efficiency of text clustering. Exemplarily, determine the total number of occurrences of the word of each node in the dictionary tree at the same position in all sample word sequences. Exemplarily, in the dictionary tree shown in FIG. 2, in the order from the root node to the child node, The number of occurrences of the word "middle" in the first level of the dictionary tree at the same position in all sample word sequences is 3, the number of occurrences of the word "quotient" in the first level is 1, and the word "behind" in the first level The number of occurrences of the word "quotient" in the second level is 2, the number of occurrences of the word "big" in the second level is 1, and the number of occurrences of the word "middle" in the second level is 2, The occurrences of the words "pin", "bu" and "han" in the third level are all 1. According to the number of occurrences of the word of each node in the dictionary tree at the same position in all sample word sequences, the dictionary tree is pruned until the number of nodes contained in the dictionary tree reaches a preset number.

Optionally, the dictionary tree is pruned according to the number of occurrences of the word of each node in the dictionary tree at the same position in all sample word sequences, until the number of nodes contained in the dictionary tree reaches a preset number. , including: according to the order of occurrences of the words of each node in the dictionary tree at the same position in all sample word sequences from small to large, sequentially delete the nodes corresponding to the same occurrence in the dictionary tree, until the dictionary tree The number of included nodes reaches the preset number. Exemplarily, the node whose word in the dictionary tree appears at the same position in all sample word sequences can be deleted, and then the number of occurrences of the node word in the dictionary tree at the same position in all sample word sequences is 2. Delete the nodes of , and so on, until the number of nodes contained in the dictionary tree reaches the preset number. Wherein, the nodes corresponding to the same number of occurrences in the dictionary tree can be deleted in sequence from the root node to the child nodes.

FIG. 3 is a flowchart of a text clustering method in another embodiment of the present disclosure. As shown in FIG. 3 , the method includes the following steps:

Step 310: Obtain a target text data set to be clustered; wherein, the target text data set includes at least one piece of target text data.

Step 320: Obtain parameter configuration information corresponding to the pre-built dictionary tree; wherein, the parameter configuration information includes an inverse document frequency list and a distribution deviation list; wherein, the inverse document frequency list includes the inverse of each word contained in the dictionary tree. Document frequency, distribution deviation The list includes the distribution deviation for each word contained in the dictionary tree.

Step 330: For each piece of target text data in the target text data set, determine the number of occurrences of each word in the target text data, and use the number of occurrences as the first word frequency of the corresponding word.

Step 340 , in the inverse document frequency list, search for the inverse document frequency corresponding to at least one word in the target text data, respectively, as the first inverse document frequency of the at least one word in the target text data.

Step 350: Calculate the first word frequency-inverse document frequency of the corresponding word according to the first word frequency and the first inverse document frequency; wherein the first word frequency-inverse document frequency is the product of the first word frequency and the first inverse document frequency.

Step 360 , in the distribution deviation list, search for the distribution deviation corresponding to each word in the target text data, as the first distribution deviation of each word in the target text data.

Step 370: Calculate the first importance score of each word in the target text data according to each first word frequency-inverse document frequency and the corresponding first distribution deviation; wherein, the first importance score is the first word frequency - The product of the inverse document frequency and the deviation of the first distribution.

Step 380: Rank at least one word in the target text data based on the first importance score, and generate a to-be-searched word sequence corresponding to the target text data.

Step 390: For each word sequence to be searched, in the pre-built dictionary tree, search for a target word sequence adapted to the word sequence to be searched in the order from the root node to the child nodes.

Step 3100: Cluster the target text data corresponding to the at least one target word sequence according to the at least one target word sequence, respectively, to obtain a text clustering result.

The technical solution of the embodiment of the present disclosure calculates the importance score of each word by determining the word frequency, inverse document frequency and distribution deviation of each word in the text data to be clustered, and based on the importance score of the text to be clustered Sort at least one word in the data to generate a sequence of words to be searched, and then search for a target word sequence that matches the word to be searched based on a pre-built dictionary tree, so as to cluster the text data based on the target word sequence, which simplifies text clustering. The class process greatly reduces the time complexity of text clustering and effectively improves the efficiency and accuracy of text clustering.

FIG. 4 is a flowchart of a text clustering method in another embodiment of the present disclosure. As shown in FIG. 4 , the method includes the following steps:

Step 410: Obtain a total corpus and a target corpus; wherein, the total corpus includes a target corpus, and the target corpus includes at least one piece of sample text data.

Step 420: Calculate the second distribution deviation of each word contained in the target corpus relative to the total corpus.

Optionally, calculating the second distribution deviation of each word included in the target corpus relative to the total corpus includes: calculating the second distribution deviation of each word included in the target corpus relative to the total corpus according to the following formula:

Among them, b represents the second distribution deviation of the word w of the target corpus relative to the total corpus, freq _a (w) represents the frequency of the word w in the target corpus, freq(w) represents the frequency of the word w in the total corpus, t represents the number of occurrences of word w in the target corpus, M represents the total number of words contained in the target corpus, t' represents the number of occurrences of word w in the total corpus, and M' represents the total number of words contained in the total corpus.

Step 430, for each piece of sample text data in the target corpus, calculate the second importance score of the corresponding word according to the second distribution deviation of each word in the sample text data, and follow the second importance score in descending order. Sort at least one word in each piece of sample text data to generate a sample word sequence corresponding to the sample text data.

Optionally, for each piece of sample text data in the target corpus, calculating the second importance score of the corresponding word according to the second distribution deviation of each word in the sample text data, including: for each piece of sample text data in the target corpus, Calculate the second word frequency-inverse document frequency of each word in the sample text data respectively; according to each second word frequency-inverse document frequency and the corresponding second distribution deviation, calculate the second word frequency of each word in the sample text data. Importance Score.

Optionally, calculating the second word frequency-inverse document frequency of each word in the sample text data respectively includes: respectively determining the second word frequency and the second inverse document frequency of each word in the sample text data; The second inverse document frequency calculates the second word frequency-inverse document frequency of the corresponding word in the sample text data.

Optionally, determining the second word frequency and the second inverse document frequency of each word in the sample text data respectively includes: calculating the second word frequency and the second inverse document frequency of each word in the sample text data according to the following formula:

tf(w)=m

idf(w)=log((N/n))

Calculate the second word frequency-inverse document frequency of corresponding words in the sample text data according to the second word frequency and the second inverse document frequency, comprising: calculating the second word frequency-inverse document frequency of each word in the sample text data according to the following formula:

tf-idf(w)=tf(w)*idf(w)

Among them, w represents any word in the sample data text, tf(w) represents the second word frequency of the word w in the sample data text, idf(w) represents the second inverse document frequency of the word w in the sample data text, tf -idf(w) represents the second word frequency-inverse document frequency of the word w in the sample data text, m represents the number of times the word w appears in the sample data text, n represents the number of sample text data containing the word w in the target corpus , N represents the total number of sample text data contained in the target corpus.

Optionally, calculate the second importance score of each word in the sample text data according to each second word frequency-inverse document frequency and the corresponding second distribution deviation, including:

Calculate the second importance score of each word in the sample text data according to the following formula:

Among them, s(w) represents the second importance score of the word w in the sample text data, tf-idf _a (w) represents the second word frequency-inverse document frequency of the word w in the sample data text,

Represents the second distribution deviation of word w in the sample text data.

Step 440, construct a dictionary tree based on at least one sample word sequence.

Step 450: Determine the number of occurrences of the word of each node in the dictionary tree at the same position in all sample word sequences.

Step 460: Delete the nodes corresponding to the same number of occurrences in the dictionary tree in order according to the number of occurrences of the word of each node in the dictionary tree at the same position in all sample word sequences, until the number of nodes contained in the dictionary tree reaches 460. up to the preset number.

Step 470: Obtain a target text data set to be clustered; wherein, the target text data set includes at least one piece of target text data.

Step 480, for each piece of target text data in the target text data set, calculate the first importance score of at least one word in the target text data, and sort at least one word in the target text data based on the first importance score, and generate The sequence of words to be searched corresponding to the target text data.

Step 490, for each word sequence to be searched, in the pre-built dictionary tree in the order from the root node to the child node, search for a target word sequence that is adapted to the word sequence to be searched; wherein, the target word sequence belongs to the word to be searched. A subsequence of a sequence.

Step 4100: Cluster the target text data corresponding to the at least one target word sequence according to the at least one target word sequence, respectively, to obtain a text clustering result.

The text clustering method provided by the embodiment of the present disclosure can build a dictionary tree matching the target corpus, prune the dictionary tree, and then calculate the importance score of at least one word in the text data to be clustered, and based on the importance The score sorts at least one word in the text data to be clustered to generate a sequence of words to be searched, and then searches for a target word sequence adapted to the word to be searched based on the dictionary tree, thereby clustering the text data based on the target word sequence. By pruning the dictionary tree, the depth of the dictionary tree can be reduced. On the premise that the target word sequence corresponding to the target text data can be accurately determined based on the dictionary tree, the search speed of the target word sequence can be effectively improved, and the search speed of the target word sequence can be greatly reduced. The time complexity of text clustering effectively improves the efficiency and accuracy of text clustering.

FIG. 5 is a flowchart of a text clustering method in another embodiment of the present disclosure. As shown in FIG. 5 , the method includes the following steps:

Step 510: Obtain a total corpus and a target corpus; wherein, the total corpus includes a target corpus, and the target corpus includes at least one piece of sample text data.

Step 520: Calculate the second distribution deviation of each word contained in the target corpus relative to the total corpus.

Step 530: For each piece of sample text data in the target corpus, determine the second word frequency and the second inverse document frequency of each word in the sample text data, respectively.

Step 540: Calculate the second word frequency-inverse document frequency of the corresponding word in the sample text data according to the second word frequency and the second inverse document frequency.

Step 550: Calculate the second importance score of each word in the sample text data according to each second word frequency-inverse document frequency and the corresponding second distribution deviation.

Step 560: Sort at least one word in each piece of sample text data in descending order of the second importance score to generate a sample word sequence corresponding to the sample text data.

Step 570, construct a dictionary tree based on at least one sample word sequence.

Step 580: Store at least one distribution deviation list composed of the second distribution deviation and at least one inverse document list composed of the second inverse document frequency as parameter configuration information of the dictionary tree.

Step 590: Obtain a target text data set to be clustered; wherein, the target text data set includes at least one piece of target text data.

Step 5100: For each piece of target text data in the target text data set, determine the number of occurrences of each word in the target text data, and use the number of occurrences as the first word frequency of each word.

Step 5110: In the inverse document frequency list, search for the inverse document frequency corresponding to each word in the target text data, as the first inverse document frequency of each word in the target text data.

Step 5120: Calculate the first word frequency-inverse document frequency of the corresponding word according to the first word frequency and the first inverse document frequency; wherein the first word frequency-inverse document frequency is the product of the first word frequency and the first inverse document frequency.

Step 5130: In the distribution deviation list, search for the distribution deviation corresponding to each word in the target text data, as the first distribution deviation of each word in the target text data.

Step 5140: Calculate the first importance score of each word in the target text data according to each first word frequency-inverse document frequency and the corresponding first distribution deviation; wherein, the first importance score is the first word frequency - The product of the inverse document frequency and the deviation of the first distribution.

Step 5150: Sort at least one word in the target text data based on the first importance score, and generate a to-be-searched word sequence corresponding to the target text data.

Step 5160: For each word sequence to be searched, a pre-built dictionary tree is searched for a target word sequence adapted to the word sequence to be searched; wherein the target word sequence belongs to a subsequence of the word sequence to be searched.

Step 5170: Cluster the target text data corresponding to the at least one target word sequence according to the at least one target word sequence, respectively, to obtain a text clustering result.

The text clustering method provided by the embodiments of the present disclosure clusters text data based on a dictionary tree, which simplifies the process of text clustering, greatly reduces the time complexity of text clustering, and effectively improves the efficiency and accuracy of text clustering. sex.

FIG. 6 is a schematic structural diagram of a text clustering apparatus according to another embodiment of the present disclosure. As shown in FIG. 6 , the apparatus includes: a text data acquisition module 610 , a search word sequence generation module 620 , a target word sequence determination module 630 and a text clustering module 640 .

The text data acquisition module 610 is configured to acquire the target text data set to be clustered; wherein, the target text data set includes at least one piece of target text data;

The search word sequence generation module 620 is configured to, for each piece of target text data in the target text data set, calculate the first importance score of at least one word in the target text data, and based on the first importance score Sort at least one word in the target text data to generate a to-be-searched word sequence corresponding to the target text data;

The target word sequence determination module 630 is configured to search a pre-built dictionary tree for a target word sequence that is adapted to the to-be-searched word sequence for each to-be-searched word sequence; wherein, the target word sequence belongs to the to-be-searched word sequence. search for subsequences of word sequences;

The text clustering module 640 is configured to cluster the target text data corresponding to the at least one target word sequence according to the at least one target word sequence, respectively, to obtain a text clustering result.

In this embodiment of the present disclosure, a target text data set to be clustered is obtained; wherein, the target text data set includes at least one piece of target text data; for each piece of target text data in the target text data set, the target text data is calculated the first importance score of at least one word in the target text data, and sort at least one word in the target text data based on the first importance score, and generate a sequence of words to be searched corresponding to the target text data; A word sequence to be searched is searched in a pre-built dictionary tree for a target word sequence adapted to the word sequence to be searched; wherein, the target word sequence belongs to a subsequence of the word sequence to be searched; according to at least one The target word sequence performs clustering on the target text data corresponding to the at least one target word sequence to obtain a text clustering result. The text clustering apparatus provided by the embodiment of the present disclosure calculates the importance score of each word in the text data to be clustered, and sorts at least one word in the text data to be clustered based on the importance score, and generates the word to be searched Then, based on the pre-built dictionary tree, the target word sequence that is suitable for the search word is found, so that the text data is clustered based on the target word sequence, which simplifies the process of text clustering and greatly reduces the time complexity of text clustering. It effectively improves the efficiency and accuracy of text clustering.

Optionally, the search word sequence generation module includes:

a first word frequency-inverse document frequency calculation unit, configured to calculate the first word frequency-inverse document frequency of at least one word in the target text data for each piece of target text data in the target text data set;

The first importance score calculation unit is configured to calculate the first importance score of at least one word in the target text data according to at least one first word frequency-inverse document frequency, respectively.

Optionally, the first word frequency-inverse document frequency calculation unit includes:

a first frequency determination subunit, configured to respectively determine the first word frequency and the first inverse document frequency of each word in the target text data;

a first word frequency-inverse document frequency calculation subunit, configured to calculate the first word frequency-inverse document frequency of the corresponding word according to the first word frequency and the first inverse document frequency; wherein, the first word frequency-inverse document frequency is the product of the first word frequency and the first inverse document frequency.

Optionally, the first frequency determination subunit is set to:

Determine the number of occurrences of each word in the target text data, and use the number of occurrences as the first word frequency of the corresponding word;

Acquiring parameter configuration information corresponding to the dictionary tree; wherein the parameter configuration information includes an inverse document frequency list, and the inverse document frequency list includes the inverse document frequency of each word contained in the dictionary tree;

In the inverse document frequency list, the inverse document frequency corresponding to each word in the target text data is respectively searched as the first inverse document frequency of each word in the target text data.

Optionally, the parameter configuration information further includes a distribution deviation list; wherein, the distribution deviation list includes the distribution deviation of each word contained in the dictionary tree;

The device also includes:

The distribution deviation determination module is set to, before calculating the first importance score of at least one word in the target text data according to the at least one first word frequency-inverse document frequency, respectively, in the distribution deviation list, to search for the distribution deviation list corresponding to the The distribution deviation corresponding to each word in the target text data is used as the first distribution deviation of each word in the target text data;

The first importance score calculation unit is set to:

Calculate the first importance score of each word in the target text data according to each first word frequency-inverse document frequency and the corresponding first distribution deviation; wherein, the first importance score is the The first word frequency - the product of the inverse document frequency and the deviation of the first distribution.

Optionally, the target word sequence determination module is set to:

For each word sequence to be searched, a pre-built dictionary tree is searched for a target word sequence adapted to the word sequence to be searched in the order from the root node to the child node.

Optionally, the device further includes:

The corpus acquisition module is configured to acquire a general corpus and a target corpus before acquiring the target text data set to be clustered; wherein, the general corpus includes the target corpus, and the target corpus contains at least one piece of sample text data;

a distribution deviation calculation module, configured to calculate the second distribution deviation of each word contained in the target corpus relative to the total corpus;

The sample word sequence generation module is configured to, for each piece of sample text data in the target corpus, calculate the second importance score of the corresponding word according to the second distribution deviation of each word in the sample text data, and calculate the second importance score of the corresponding word according to the The second importance score sorts at least one word in each piece of sample text data in descending order to generate a sample word sequence corresponding to the sample text data;

A dictionary tree building module, configured to build the dictionary tree based on at least one sample word sequence.

Optionally, the sample word sequence generation module includes:

A second word frequency-inverse document frequency calculation unit, configured to calculate the second word frequency-inverse document frequency of each word in the sample text data for each piece of sample text data in the target corpus;

The second importance score calculation unit is configured to calculate the second importance score of each word in the sample text data according to each second word frequency-inverse document frequency and the corresponding second distribution deviation, respectively.

Optionally, the second word frequency-inverse document frequency calculation unit, including:

A second frequency determination subunit, configured to respectively determine the second word frequency and the second inverse document frequency of each word in the sample text data;

The second word frequency-inverse document frequency calculation subunit is configured to calculate the second word frequency-inverse document frequency of the corresponding word in the sample text data according to the second word frequency and the second inverse document frequency.

Optionally, the second frequency determination subunit is set to:

Calculate the second word frequency and the second inverse document frequency of each word in the sample text data according to the following formulas:

tf(w)=m

idf(w)=log((N/n))

The second word frequency-inverse document frequency calculation subunit is set to:

Calculate the second word frequency-inverse document frequency of each word in the sample text data according to the following formula:

tf-idf(w)=tf(w)*idf(w)

Optionally, the second importance score calculation unit is set to:

represents the second distributional bias of word w in the sample text data.

Optionally, the shown distribution deviation calculation module is set to:

Calculate the second distribution deviation of each word contained in the target corpus relative to the total corpus according to the following formula:

Optionally, the device further includes:

A number of occurrence determination module, configured to determine the number of occurrences of the word of each node in the dictionary tree at the same position in all sample word sequences after constructing the dictionary tree based on at least one sample word sequence;

The dictionary tree pruning module is set to prune the dictionary tree according to the number of occurrences of the word of each node in the dictionary tree at the same position in all sample word sequences, until the number of nodes contained in the dictionary tree reaches up to the preset number.

Optionally, the dictionary tree pruning module is set to:

According to the order of occurrences of the words of each node in the dictionary tree at the same position in all sample word sequences from small to large, delete the nodes corresponding to the same occurrence in the dictionary tree in turn, until the dictionary tree contains The number of nodes reaches the preset number.

The foregoing apparatus can execute the methods provided by all the foregoing embodiments of the present disclosure, and has functional modules corresponding to executing the foregoing methods. For technical details that are not described in detail in the embodiments of the present disclosure, reference may be made to the methods provided by all the foregoing embodiments of the present disclosure.

Referring next to FIG. 7 , it shows a schematic structural diagram of an electronic device 300 suitable for implementing an embodiment of the present disclosure. The electronic devices in the embodiments of the present disclosure may include, but are not limited to, such as mobile phones, notebook computers, digital broadcast receivers, personal digital assistants (PDAs), PADs (tablets), portable multimedia players (Portable Media Players). , PMP), mobile terminals such as in-vehicle terminals (such as in-vehicle navigation terminals), and stationary terminals such as digital televisions (TVs), desktop computers, etc., or various forms of servers, such as independent servers or server clusters. The electronic device shown in FIG. 7 is only an example, and should not impose any limitation on the function and scope of use of the embodiments of the present disclosure.

As shown in FIG. 7 , the electronic device 300 may include a processing device (eg, a central processing unit, a graphics processor, etc.) 301, which may be stored in accordance with a program stored in a read-only storage device (Read-Only Memory, ROM) 302 or from a storage device The device 305 loads a program into a random access memory (RAM) 303 to perform various appropriate actions and processes. In the RAM 303, various programs and data required for the operation of the electronic device 300 are also stored. The processing device 301, the ROM 302, and the RAM 303 are connected to each other through a bus 304. An Input/Output (I/O) interface 305 is also connected to the bus 304 .

Typically, the following devices can be connected to the I/O interface 305: input devices 306 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; including, for example, a Liquid Crystal Display (LCD) Output device 307 , speaker, vibrator, etc.; storage device 308 including, eg, magnetic tape, hard disk, etc.; and communication device 309 . Communication means 309 may allow electronic device 300 to communicate wirelessly or by wire with other devices to exchange data. While FIG. 7 illustrates electronic device 300 having various means, it should be understood that not all of the illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided.

In particular, according to embodiments of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program carried on a computer-readable medium, the computer program containing program code for performing a recommended method of a word. In such an embodiment, the computer program may be downloaded and installed from the network via the communication device 309, or from the storage device 305, or from the ROM 302. When the computer program is executed by the processing device 301, the above-mentioned functions defined in the methods of the embodiments of the present disclosure are executed.

It should be noted that the computer-readable medium mentioned above in the present disclosure may be a computer-readable signal medium or a computer-readable storage medium, or any combination of the above two. The computer-readable storage medium can be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus or device, or a combination of any of the above. More specific examples of computer readable storage media may include, but are not limited to, electrical connections having at least one wire, portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), erasable programmable Read memory ((Erasable Programmable Read-Only Memory, EPROM) or flash memory), optical fiber, portable compact disk read only memory (Compact Disc-Read Only Memory, CD-ROM), optical storage device, magnetic storage device, or any of the above suitable combination.

In this disclosure, a computer-readable storage medium may be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device. In the present disclosure, however, a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave with computer-readable program code embodied thereon. Such propagated data signals may take a variety of forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing. A computer-readable signal medium can also be any computer-readable medium other than a computer-readable storage medium that can transmit, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device . Program code embodied on a computer readable medium may be transmitted using any suitable medium including, but not limited to, electrical wire, optical fiber cable, radio frequency (RF) (radio frequency), etc., or any suitable combination of the foregoing.

In some embodiments, the client and server can use any currently known or future developed network protocol such as HTTP (HyperText Transfer Protocol) to communicate, and can communicate with digital data in any form or medium Communication (eg, a communication network) interconnects. Examples of communication networks include local area networks ("Local Area Network, LAN"), wide area networks ("Wide Area Network, WAN"), the Internet (eg, the Internet), and peer-to-peer networks (eg, ad hoc peer-to-peer networks), and any currently known or future developed networks.

The above-mentioned computer-readable medium may be included in the above-mentioned electronic device; or may exist alone without being assembled into the electronic device.

The computer-readable medium carries at least one program, and when the at least one program is executed by the electronic device, the electronic device: acquires a target text data set to be clustered; wherein, the target text data set includes at least one target text data; for each piece of target text data in the target text data set, calculate the first importance score of each word in the target text data, and assign a value to each word in the target text data based on the first importance score. Each word is sorted to generate a sequence of words to be searched corresponding to the target text data; for each sequence of words to be searched, a pre-built dictionary tree is searched for a sequence of target words adapted to the sequence of words to be searched; wherein, The target word sequence belongs to a subsequence of the to-be-searched word sequence; the corresponding target text data are clustered according to each of the target word sequences to obtain a text clustering result.

Computer program code for performing operations of the present disclosure may be written in one or more programming languages, including but not limited to object-oriented programming languages—such as Java, Smalltalk, C++, and This includes conventional procedural programming languages - such as the "C" language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer, or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or may be connected to an external computer (eg, using an Internet service provider through Internet connection).

The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code that contains at least one configurable function for implementing the specified logical function. Execute the instruction. It should also be noted that, in some alternative implementations, the functions noted in the blocks may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It is also noted that each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations, can be implemented in dedicated hardware-based systems that perform the specified functions or operations , or can be implemented in a combination of dedicated hardware and computer instructions.

The units involved in the embodiments of the present disclosure may be implemented in a software manner, and may also be implemented in a hardware manner. Among them, the name of the unit does not constitute a limitation of the unit itself under certain circumstances.

The functions described herein above may be performed, at least in part, by at least one hardware logic component. For example, without limitation, exemplary types of hardware logic components that may be used include: Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (Application Specific Standard Products) Standard Parts, ASSP), system on chip (System on Chip, SOC), complex programmable logic device (Complex Programmable Logic Device, CPLD) and so on.

In the context of the present disclosure, a machine-readable medium may be a tangible medium that may contain or store a program for use by or in connection with the instruction execution system, apparatus or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. Machine-readable media may include, but are not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, devices, or devices, or any suitable combination of the foregoing. More specific examples of machine-readable storage media would include at least one wire-based electrical connection, portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM) or flash memory), fiber optics, compact disk read only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination of the foregoing.

According to at least one embodiment of the present disclosure, the present disclosure provides a text clustering method, including:

For each piece of target text data in the target text data set, calculate a first importance score of at least one word in the target text data, and assign at least one word in the target text data based on the first importance score The words are sorted, and a sequence of words to be searched corresponding to the target text data is generated;

For each word sequence to be searched, a pre-built dictionary tree is searched for a target word sequence adapted to the word sequence to be searched; wherein, the target word sequence belongs to a subsequence of the word sequence to be searched;

Optionally, for each piece of target text data in the target text data set, calculating the first importance score of at least one word in the target text data, including:

For each piece of target text data in the target text data set, calculate the first word frequency-inverse document frequency of at least one word in the target text data respectively;

A first importance score of at least one word in the target text data is calculated according to at least one first word frequency-inverse document frequency, respectively.

Optionally, respectively calculating the first word frequency-inverse document frequency of at least one word in the target text data, including:

respectively determining the first word frequency and the first inverse document frequency of each word in the target text data;

Calculate the first word frequency-inverse document frequency of the corresponding word according to the first word frequency and the first inverse document frequency; wherein, the first word frequency-inverse document frequency is the first word frequency and the first inverse document frequency product of frequencies.

Optionally, respectively determining the first word frequency and the first inverse document frequency of each word in the target text data, including:

Before calculating the first importance score of at least one word in the target text data according to at least one first word frequency-inverse document frequency, the method further includes:

In the distribution deviation list, respectively find the distribution deviation corresponding to each word in the target text data as the first distribution deviation of each word in the target text data;

Calculate the first importance score of at least one word in the target text data according to at least one first word frequency-inverse document frequency, including:

Optionally, for each word sequence to be searched, a pre-built dictionary tree is searched for a target word sequence adapted to the word sequence to be searched, including:

Optionally, before acquiring the target text data set to be clustered, the method further includes:

Acquiring a total corpus and a target corpus; wherein, the total corpus includes the target corpus, and the target corpus contains at least one piece of sample text data;

calculating the second distribution deviation of each word contained in the target corpus relative to the total corpus;

For each piece of sample text data in the target corpus, calculate the second importance score of the corresponding word according to the second distribution deviation of each word in the sample text data, and according to the second importance score from large to high Sort at least one word in each piece of sample text data in a small order to generate a sample word sequence corresponding to the sample text data;

The dictionary tree is constructed based on at least one sample word sequence.

Optionally, for each piece of sample text data in the target corpus, calculate the second importance score of the corresponding word according to the second distribution deviation of each word in the sample text data, including:

For each piece of sample text data in the target corpus, calculate the second word frequency-inverse document frequency of each word in the sample text data;

A second importance score of each word in the sample text data is calculated according to each second word frequency-inverse document frequency and the corresponding second distribution deviation, respectively.

Optionally, separately calculating the second word frequency-inverse document frequency of each word in the sample text data, including:

respectively determining the second word frequency and the second inverse document frequency of each word in the sample text data;

The second word frequency-inverse document frequency of the corresponding word in the sample text data is calculated according to the second word frequency and the second inverse document frequency.

Optionally, respectively determining the second word frequency and the second inverse document frequency of each word in the sample text data, including:

tf(w)=m

idf(w)=log((N/n))

Calculate the second word frequency-inverse document frequency of the corresponding word in the sample text data according to the second word frequency and the second inverse document frequency, including:

tf-idf(w)=tf(w)*idf(w)

The second importance score of each word in the sample text data is calculated according to the following formula:

Represents the second distribution deviation of word w in the sample text data.

Optionally, calculating the second distribution deviation of each word contained in the target corpus relative to the total corpus, including:

Optionally, after constructing the dictionary tree based on each sample word sequence, the method further includes:

Determine the number of occurrences of the word of each node in the dictionary tree at the same position in all sample word sequences;

The dictionary tree is pruned according to the number of occurrences of the word of each node in the dictionary tree at the same position in all sample word sequences, until the number of nodes included in the dictionary tree reaches a preset number.

Optionally, the dictionary tree is pruned according to the number of occurrences of the word of each node in the dictionary tree at the same position in all sample word sequences, until the number of nodes contained in the dictionary tree reaches a preset number. ,include:

Claims

A text clustering method comprising:

Obtain a target text data set to be clustered; wherein, the target text data set includes at least one piece of target text data;

For each piece of target text data in the target text data set, calculate a first importance score of at least one word in each piece of target text data, and assign each piece of target text data based on the first importance score Sort at least one word in the to-be-searched word sequence corresponding to each piece of target text data;

For each word sequence to be searched, a pre-built dictionary tree is searched for a target word sequence adapted to each word sequence to be searched; wherein, the target word sequence belongs to the child of each word sequence to be searched sequence;

The target text data corresponding to the at least one target word sequence is clustered according to the at least one target word sequence, respectively, to obtain a text clustering result.
The method according to claim 1, wherein, for each piece of target text data in the target text data set, calculating the first importance score of at least one word in the each piece of target text data, comprising:

For each piece of target text data in the target text data set, calculate the first word frequency-inverse document frequency of at least one word in each piece of target text data;

Calculate the first importance score of at least one word in each piece of target text data according to at least one first word frequency-inverse document frequency, respectively.
The method according to claim 2, wherein calculating the first word frequency-inverse document frequency of at least one word in each piece of target text data respectively comprises:

Respectively determine the first word frequency and the first inverse document frequency of each word in each piece of target text data;

Calculate the first word frequency-inverse document frequency of the corresponding word according to the first word frequency and the first inverse document frequency; wherein, the first word frequency-inverse document frequency is the first word frequency and the first inverse document frequency product of frequencies.
The method according to claim 3, wherein, respectively determining the first word frequency and the first inverse document frequency of each word in each piece of target text data, comprising:

Determine the number of occurrences of each word in each piece of target text data, and use the number of occurrences as the first word frequency of the corresponding word;

Acquiring parameter configuration information corresponding to the dictionary tree; wherein the parameter configuration information includes an inverse document frequency list, and the inverse document frequency list includes the inverse document frequency of each word contained in the dictionary tree;

In the inverse document frequency list, search for the inverse document frequency corresponding to each word in each piece of target text data, as the first inverse document frequency of each word in each piece of target text data .
The method according to claim 4, wherein the parameter configuration information further includes a distribution deviation list; wherein, the distribution deviation list includes the distribution deviation of each word included in the dictionary tree;

Before calculating the first importance score of at least one word in each piece of target text data according to at least one first word frequency-inverse document frequency, the method further includes:

In the distribution deviation list, find the distribution deviation corresponding to each word in each piece of target text data, as the first distribution deviation of each word in each piece of target text data;

Calculate the first importance score of at least one word in each piece of target text data according to at least one first word frequency-inverse document frequency, including:

Calculate the first importance score of each word in each piece of target text data according to each first word frequency-inverse document frequency and the corresponding first distribution deviation; wherein, the first importance score is The first word frequency - the product of the inverse document frequency and the first distribution deviation.
The method according to claim 1, wherein, for each to-be-searched word sequence, searching a pre-built dictionary tree for a target word sequence adapted to the to-be-searched word sequence, comprising:

For each word sequence to be searched, a pre-built dictionary tree is searched for a target word sequence adapted to each word sequence to be searched in the order from the root node to the child node.
The method according to claim 1, before acquiring the target text data set to be clustered, also comprising:

Acquiring a total corpus and a target corpus; wherein, the total corpus includes the target corpus, and the target corpus contains at least one piece of sample text data;

calculating the second distribution deviation of each word contained in the target corpus relative to the total corpus;

For each piece of sample text data in the target corpus, calculate the second importance score of the corresponding word according to the second distribution deviation of each word in each piece of sample text data, and calculate from the second importance score from Sort at least one word in each piece of sample text data in an order from large to small, and generate a sample word sequence corresponding to each piece of sample text data;

The dictionary tree is constructed based on at least one sample word sequence.
The method according to claim 7, wherein, for each piece of sample text data in the target corpus, a second importance score of a corresponding word is calculated according to the second distribution deviation of each word in each piece of sample text data, respectively ,include:

For each piece of sample text data in the target corpus, calculate the second word frequency-inverse document frequency of each word in the each piece of sample text data;

The second importance score of each word in each piece of sample text data is calculated according to each second word frequency-inverse document frequency and the corresponding second distribution deviation, respectively.
The method according to claim 8, wherein calculating the second word frequency-inverse document frequency of each word in each piece of sample text data respectively comprises:

respectively determining the second word frequency and the second inverse document frequency of each word in the each piece of sample text data;

The second word frequency-inverse document frequency of the corresponding word in each piece of sample text data is calculated according to the second word frequency and the second inverse document frequency.
The method according to claim 9, wherein determining the second word frequency and the second inverse document frequency of each word in each piece of sample text data respectively comprises:

Calculate the second word frequency and the second inverse document frequency of each word in each piece of sample text data according to the following formulas:

tf(w)=m

idf(w)=log((N/n))

Calculate the second word frequency-inverse document frequency of the corresponding word in each piece of sample text data according to the second word frequency and the second inverse document frequency, including:

Calculate the second word frequency-inverse document frequency of each word in each piece of sample text data according to the following formula:

tf-idf(w)=tf(w)*idf(w)

Wherein, w represents any word in each piece of sample data text, tf(w) represents the second word frequency of word w in each piece of sample data text, and idf(w) represents each piece of sample data text The second inverse document frequency of the word w in The number of occurrences in the target corpus, n represents the number of pieces of sample text data containing word w in the target corpus, and N represents the total number of pieces of sample text data contained in the target corpus.
The method according to claim 8, wherein the second importance score of each word in each piece of sample text data is calculated according to each second word frequency-inverse document frequency and the corresponding second distribution deviation, respectively ,include:

Calculate the second importance score of each word in each piece of sample text data according to the following formula:

Among them, s(w) represents the second importance score of word w in each piece of sample text data, and tf-idf a (w) represents the second word frequency-inverse document of word w in each piece of sample data text frequency,
represents the second distribution deviation of the word w in each piece of sample text data.
The method according to any one of claims 8-11, wherein calculating the second distribution deviation of each word included in the target corpus relative to the total corpus comprises:

Calculate the second distribution deviation of each word contained in the target corpus relative to the total corpus according to the following formula:

Among them, b represents the second distribution deviation of the word w in the target corpus relative to the total corpus, freq a (w) represents the frequency of the word w in the target corpus, freq(w) represents the word w in the total corpus. The frequency of occurrence in the corpus, t represents the number of occurrences of the word w in the target corpus, M represents the total number of words contained in the target corpus, t' represents the number of occurrences of the word w in the total corpus, M ' represents the total number of words contained in the total corpus.
The method according to claim 7, after constructing the dictionary tree based on at least one sample word sequence, further comprising:

Determine the number of occurrences of the word of each node in the dictionary tree at the same position in all sample word sequences;

The dictionary tree is pruned according to the number of occurrences of the word of each node in the dictionary tree at the same position in all sample word sequences, until the number of nodes included in the dictionary tree reaches a preset number.
The method according to claim 13, wherein the dictionary tree is pruned according to the number of occurrences of the word of each node in the dictionary tree at the same position in all sample word sequences, until the dictionary tree contains The number of nodes reaches the preset number, including:

According to the order of occurrences of the words of each node in the dictionary tree at the same position in all sample word sequences from small to large, delete the nodes corresponding to the same occurrence in the dictionary tree in turn, until the dictionary tree contains The number of nodes reaches the preset number.
A text clustering device, comprising:

A text data acquisition module, configured to acquire a target text data set to be clustered; wherein, the target text data set includes at least one piece of target text data;

A search word sequence generation module, configured to calculate the first importance score of at least one word in each piece of target text data for each piece of target text data in the target text data set, and based on the first importance score Sort at least one word in each piece of target text data, and generate a word sequence to be searched corresponding to each piece of target text data;

The target word sequence determination module is configured to search a pre-built dictionary tree for a target word sequence adapted to each to-be-searched word sequence for each to-be-searched word sequence; wherein, the target word sequence belongs to the a subsequence of each sequence of words to be searched;

The text clustering module is configured to cluster the target text data corresponding to the at least one target word sequence according to the at least one target word sequence, respectively, to obtain a text clustering result.
An electronic device comprising:

at least one processing device;

a storage device configured to store at least one program;

When the at least one program is executed by the at least one processing device, the at least one processing device implements the text clustering method according to any one of claims 1-14.
A computer-readable medium having a computer program stored on the computer-readable medium, when the computer program is executed by a processing device, implements the text clustering method according to any one of claims 1-14.