Detailed Description
Referring to fig. 1, in a first embodiment of the present application, a method for clustering k-means texts with built-in constraint rules is provided, which includes the following steps S100 to S600.
S100: and preprocessing a text set to be clustered by utilizing a second constraint rule to obtain a second preprocessing set corresponding to the second constraint rule, wherein the second constraint rule comprises two sub-rules, texts conforming to one sub-rule and texts conforming to the other sub-rule are required to be clustered into different clusters, the second preprocessing set comprises two sub-sets, and each sub-set comprises texts conforming to one corresponding sub-rule.
In the step of S100, the second constraint rule includes two sub-rules, and the text conforming to one sub-rule and the text conforming to the other sub-rule must be clustered into different clusters. For example, the second constraint rule 1 includes sub-rule 1-1 and sub-rule 1-2, where text a conforms to sub-rule 1-1, and text J conforms to sub-rule 1-2, and when clustering a text set to be clustered including text a and text J, text a and text J must be clustered into different clusters.
Specifically, in one implementation manner of the second constraint rules, each sub-rule of the second constraint rules includes at least one mutually exclusive word Bag (Bag of words), and each mutually exclusive word Bag includes at least one preset second keyword. When the number of the second keywords in the same mutex bag is more than or equal to 2, the mutex bag further comprises a logical AND relationship between the second keywords. When a certain text includes any mutually exclusive word bag in the sub-rules of a certain second constraint rule, the text conforms to the sub-rules of the second constraint rule.
An example of a second constraint rule is given in table 1, where "+" indicates a logical and relationship in table 1. Taking the second constraint rule of table 1 as an example, if the text a includes the mutually exclusive bag 1, that is, the text a includes the second keyword 1, the second keyword 2, and the second keyword 3 at the same time, the text a conforms to the sub-rule 1-1 of the second constraint rule 1. If the text J comprises a mutex bag 3, i.e. the text J comprises both the second keyword 7 and the second keyword 8, the text J complies with the sub-rules 1-2 of the second constraint rule 1. Because the texts A and J respectively accord with the two sub-rules of the first constraint rule 1, when clustering is carried out on a text set to be clustered including the texts A and J, the texts A and J are clustered into different clusters.
Table 1 example of one implementation of the second constraint rule
Each of the second keywords may be an explicit word, such as "transact" or "error". In different application scenarios, the second keyword in the mutually exclusive bag of sub-rules may be a word closely related to the topic of the class cluster desired by the user, so that the mutually exclusive bag of second keywords can represent the topic of the class cluster. And for a second constraint rule, the mutually exclusive word bags of the same sub-rule all represent the same class cluster theme, and the mutually exclusive word bags of different sub-rules represent different class cluster themes.
For example, for a text set to be clustered formed by customer service work order texts of the bank credit card department, a user expects that each cluster obtained by clustering the text set can represent different topics, such as "card handling problem", "accounting problem", "data entry error", "unskilled work of staff", and the like, so as to be respectively processed according to different topics subsequently. Therefore, assuming that the second keyword in each mutex bag of the sub-rule 1-1 in the second constraint rule 1 can represent "card transaction problem", the second keyword 1 may be "card transaction", the second keyword 2 may be "no until now", the second keyword 3 may be "approval pass", the mutex bag 1 is "card transaction + no until now + approval pass", and the mutex bag represents the subject of the cluster of "card transaction problem". Similarly, a mutex bag 2 and other mutex bags can be further arranged to embody the subject of the cluster of the "card transaction problem", for example, the mutex bag 2 can be "apply for + card + not received". In the second constraint rule 1, the second keyword in each mutex bag of the sub-rules 1-2 can represent "data entry error", then the second keyword 4 can be "card transaction", the second keyword 5 can be "no progress", the second keyword 6 can be "error finding", then the mutex bag 3 is "card transaction + no progress + error finding", and the mutex bag represents the subject of the cluster of "data entry error".
Similarly, a plurality of second constraint rules such as the second constraint rule 2 may also be set. For example, in the second constraint rule 2, each mutually exclusive word bag of the sub-rule 2-1 may embody the subject of the cluster of the "accounting problem"; each mutex bag of sub-rule 2-2 may then embody the subject of the "card handling problem" cluster.
Further, the second keywords may also be represented by regular expressions, i.e., each second keyword may include several sub-keywords and association relations between the sub-keywords. The association relationship here can be represented by a meta character in a regular expression. For example "+" indicates a logical AND; "|" represents a logical or; "()" indicates multiple rounding; "" indicates matching the previous sub-expression any number of times; "? "means zero or one matching the previous sub-expression; is "(. -)? "means the shortest match, i.e. match to" (-)? "the following qualifier character ends the match.
For example, the second keyword 1 in table 9 can be expressed as a regular expression "(office | apply | application) (#)? Card ". Wherein, "do", "transact", "apply for", "card" are all sub-keywords, "()", "|", ". and", "? "element character. If any of the 4 sub-keywords "do", "transact", "apply" are included in a certain text and matched to the keyword "card" thereafter, the regular expression can match the text. That is, the second keyword 1 is included in the text.
In the step S100, when the text set to be clustered is preprocessed by using the second constraint rule, the text or the unique identification information of the text, such as the number information, that conforms to the sub-rule of the second constraint rule in the text set to be clustered is stored in the corresponding sub-set of the second preprocessed set for the subsequent step. Part of texts in the text set to be clustered may not meet any sub-rule of any second constraint rule, and the texts or the unique identification information of the texts are not stored in the sub-set of the second preprocessing set.
It should be noted that the second preprocessing set, the mutually exclusive set of class clusters, and the first preprocessing set involved in the second embodiment that follows may be the text itself or the unique identification information of the text, and no matter what form is stored, the present invention does not depart from the core idea of the present invention. For convenience of description, in the following embodiments, no distinction is made between these and only "text" is used to indicate this.
When n second constraint rules exist in the k-means text clustering method in the embodiment of the application, preprocessing a text set to be clustered by using the second constraint rules to correspondingly obtain n second preprocessing sets, wherein each second preprocessing set comprises 2 subsets, and n is a positive integer greater than or equal to 1.
This is further illustrated below by an example. Assume that the text set to be clustered includes 11 texts, namely, a text a, a text B, a text C … …, a text H, a text I, a text J, and a text L. Table 2 shows the correspondence between the second constraint rule and the second preprocessing set when there are 2 second constraint rules in this example. Wherein the text A conforms to the sub-rule 1-1, and the text J conforms to the sub-rule 1-2; text C, text D, and text I conform to sub-rule 2-1, and text F and text G conform to sub-rule 2-2.
Table 2 second preprocessing set example
S200: and acquiring k texts in the text set to be clustered as cluster centers, wherein k is less than N, and N is the total number of the texts in the text set to be clustered.
In the step S200, k texts in the text set to be clustered are selected as cluster centers. This means that the remaining (N-k) texts in the text set to be clustered will be classified into these k clusters by the subsequent steps. The k texts are acquired as cluster centers, the k texts which are manually specified are acquired as cluster centers by a computer, or the k texts are randomly acquired as cluster centers by the computer, which is not limited in the present application.
S300: if the cluster center is contained in any one of the subsets of the second preprocessing set, the text in the other subset of the second preprocessing set is added to the cluster-like mutually exclusive set corresponding to the cluster center.
In the step S300, all cluster centers may be traversed, and for each cluster center, if the cluster center is included in any subset of the second pre-processing set, the text in another subset of the second pre-processing set is added to the cluster-like mutually-exclusive set corresponding to the cluster center. And if the cluster center is not contained in any subset of any second preprocessing set, processing the next cluster center until all cluster centers are processed.
In this step, the cluster-like mutual exclusion sets correspond to the cluster centers one to one, that is, k texts are obtained as the cluster centers in the step S200, and there are also k cluster-like mutual exclusion sets correspondingly, which are used for storing texts to be clustered into different clusters together with the cluster centers in the text set to be clustered.
Following the example in step 100, assume that the step of S200 acquires 3 texts as cluster centers, which are: text a, text D, and text H, the results of table 3 can be obtained after the step of S300 is performed. Specifically, first, for the cluster center of the cluster 1, text a, since text a is contained in the subset 1-1 of the second preprocessing set 1, text J in the subset 1-2 is stored in the cluster mutual exclusion set 1 corresponding to the cluster center 1. Then, for the next cluster center, text D, since text D is contained in the subset 2-1 of the second preprocessing set 2, the text F and the text G in the subset 2-2 are stored in the cluster-like mutual exclusion set 2 corresponding to the cluster center 2. Finally, for the next cluster center, i.e., the text H, since the text H is not included in any subset of the second preprocessed set, the cluster-like mutually exclusive set 3 corresponding to the cluster center 3 is temporarily empty.
Example of a cluster-like mutual exclusion set case after the step of Table 3S300
Serial number of cluster core
|
Cluster core
|
Class cluster mutual exclusion set sequence number
|
Text of mutually exclusive sets of class clusters
|
Cluster core |
1
|
Text A
|
Class cluster mutual exclusion set 1
|
Text J
|
Cluster core |
2
|
Text D
|
Class cluster mutual exclusion set 2
|
Text F and text G
|
Cluster core |
3
|
Text H
|
Class cluster mutual exclusion set 3
|
—— |
S400: if the current text in the text set to be clustered is contained in the x cluster mutual exclusion sets, calculating the distances between the current text and other (k-x) cluster centers except the cluster center corresponding to the cluster mutual exclusion set, and adding the current text into the cluster corresponding to the cluster center closest to the current text, wherein x is more than 0 and less than k;
s500: and if the current text in the text set to be clustered is not contained in any one cluster mutual exclusion set or the current text is contained in all the cluster mutual exclusion sets, calculating the distances between all cluster centers and the current text, and adding the current text into the cluster corresponding to the cluster center closest to the current text.
In the steps S400 and S500, texts in the text set to be clustered may be traversed, and for each text, it is determined whether the text is included in the cluster-like mutual exclusion set. And if the current text is contained in the x class cluster mutual exclusion sets, calculating the distances between the current text and other (k-x) cluster centers except the cluster center corresponding to the class cluster mutual exclusion set, and adding the current text into the class cluster corresponding to the cluster center closest to the current text, wherein x is more than 0 and less than k. If the text is not contained in any one cluster mutually exclusive set or is contained in all cluster mutually exclusive sets, the distance between the text and each cluster center is calculated, and then the text is added to the cluster corresponding to the cluster center closest to the text.
It should be noted that, if a certain text is included in the mutually exclusive set of all clusters, it is indicated that the combination of the text and all cluster cores in this case can conform to the second constraint rule. That is, at this time, the text must be clustered with all cluster centers under different class clusters according to the second constraint rule. However, if the method is operated in this way, the text cannot be classified in the current iteration, which easily causes the clustering method to be executed incorrectly. For this reason, when this situation is encountered, all the cluster centers are regarded as equal, the distances from the text to all the cluster centers are calculated one by one, and then the text is added to the cluster class corresponding to the cluster center closest to the text.
Following the example of step S300 in this embodiment, there are 8 texts that have not been classified in the text set to be clustered: text B, text C, text E, text F, text G, text I, text J, text L. The following sequentially puts 8 texts into 3 clusters.
(1) And judging whether the text B is contained in the class cluster mutual exclusion set shown in the table 3, and respectively calculating the distances between the text B and three cluster centers if the judgment result shows that the text B is not contained in any class cluster mutual exclusion set. Assuming that text B is closest in distance to the cluster center 1 "text A", text B is added to class cluster 1.
(2) And judging whether the text C is contained in the class cluster mutual exclusion set shown in the table 3, and respectively calculating the distances between the text C and three cluster centers if the judgment result shows that the text C is not contained in any class cluster mutual exclusion set. Assuming that text C is closest in distance to cluster center 2 "text D", text C is added to class cluster 2.
(3) And judging whether the text E is contained in the class cluster mutual exclusion set shown in the table 3, and respectively calculating the distances between the text E and three cluster centers if the judgment result shows that the text E is not contained in any class cluster mutual exclusion set. Assuming that text E is closest in distance to cluster center 2 "text D", text E is added to class cluster 2.
(4) And judging whether the text F is contained in the class cluster mutual exclusion set of the table 3, wherein the judgment result is that the text F is contained in the class cluster mutual exclusion set 2, and the cluster center 2 corresponding to the class cluster mutual exclusion set 2 is the text D, so that the distances between the other cluster centers except the cluster center 2 'text D', namely the cluster center 1 and the cluster center 3, and the text F are respectively calculated. Assuming that text F is closest in distance to cluster center 3 "text H", text F is added to class cluster 3.
(5) And judging whether the text G is contained in the class cluster mutual exclusion set of the table 3, wherein the judgment result is that the text G is contained in the class cluster mutual exclusion set 2, and the cluster center 2 corresponding to the class cluster mutual exclusion set 2 is the text D, so that the distances between the other cluster centers except the cluster center 2 'text D', namely the cluster center 1 and the cluster center 3, and the text G are respectively calculated. Assuming that text G is closest in distance to cluster center 3 "text H", text G is added to class cluster 3.
(6) And judging whether the text I is contained in the class cluster mutual exclusion set shown in the table 3, and respectively calculating the distances between the text I and three cluster centers if the judgment result shows that the text I is not contained in any class cluster mutual exclusion set. Assuming that text I is closest in distance to the cluster center 1 "text A", text I is added to class cluster 1.
(7) Judging whether the text J is contained in the class cluster mutual exclusion set shown in the table 3, wherein the judgment result is that the text J is contained in the class cluster mutual exclusion set 1, and the cluster center 1 corresponding to the class cluster mutual exclusion set 1 is the text A, so that the distances between the other cluster centers except the cluster center 1 and the text A, namely the cluster center 2 and the cluster center 3, and the text J are respectively calculated. Assuming that text J is closest in distance to cluster center 3 "text H", text J is added to class cluster 3.
(8) And judging whether the text L is contained in the class cluster mutual exclusion set shown in the table 3, and respectively calculating the distances between the text L and three cluster centers if the judgment result shows that the text L is not contained in any class cluster mutual exclusion set. Assuming that text L is closest in distance to cluster center 3 "text H", text L is added to class cluster 3.
In the classifying process, when the text F, the text G and the text J are classified into the class clusters, only the distances between the current text and the cluster centers except the cluster center corresponding to the class cluster mutual exclusion set need to be calculated, so that the number of the cluster centers needing to participate in calculation is reduced, and the calculation complexity of the text clustering is reduced.
So far, all texts in the text set to be clustered have been clustered into corresponding clusters, and the result is shown in table 4.
Examples of cluster-like situations after the steps of tables 4S400 and S500
Cluster number of class
|
Cluster core
|
Remaining text in class clusters
|
Cluster class |
1
|
Text A
|
Text B and text I
|
Cluster class |
2
|
Text D
|
Text C, text E
|
Cluster |
3
|
Text H
|
Text F, text G, text J, text L |
S600: and recalculating a new cluster center of each class cluster, and outputting all the class clusters if the new cluster center meets a preset stop condition.
In the step S600, a new cluster center of each class cluster is recalculated, and specifically, a calculation method in a typical k-means clustering method may be adopted, that is, a distance between every two texts in the class cluster is calculated, and then a text closest to a comprehensive distance of other texts in the class cluster is found and is used as the new cluster center.
And if the new cluster center does not meet the preset stop condition, resetting the text set to be clustered, repeatedly executing the steps from S300 to S500 by taking the new cluster center as the cluster center, and then recalculating the new cluster center of each cluster. Here, the steps of S300, S400, and S500 and the process of calculating a new cluster center for each class cluster are repeated each time, and an iterative process is accumulated. And outputting all the cluster classes until the new cluster core meets the preset stop condition.
In the step S600, the preset stop condition may be specifically set according to different situations, for example, if a deviation value between the new cluster center and the original cluster center is smaller than a preset threshold, the new cluster center meets the preset stop condition; for another example, if the accumulated number of iterations reaches a preset number of iterations, the new cluster center satisfies a preset stop condition.
The case where the new cluster center satisfies the preset stop condition is described below along with the examples in the foregoing steps of S400 and S500.
After one iteration, it is assumed that a new cluster center of each class cluster is recalculated, which is: class cluster 1-text B, class cluster 2-text C, class cluster 3-text J.
If the distance between the original cluster center text A of the cluster 1 and the new cluster center text B is smaller than the preset threshold value, the distance between the original cluster center text D of the cluster 2 and the new cluster center text C is smaller than the preset threshold value, and the distance between the original cluster center text H of the cluster 3 and the new cluster center text J is smaller than the preset threshold value through calculation, the new cluster center meets the stop condition at this time, and all the clusters are output as shown in table 5.
Table 5 one example of outputting all the class clusters
Cluster number of class
|
Text in class clusters
|
Cluster class |
1
|
Text A, text B, text I
|
Cluster class |
2
|
Text D, text C, text E
|
Cluster |
3
|
Text F, text G, text H, text J, text L |
The following explains a case where the new cluster center does not satisfy the preset stop condition with the example in the aforementioned steps of S400 and S500.
After one iteration, it is assumed that a new cluster center of each class cluster is recalculated, which is: class cluster 1-text B, class cluster 2-text C, class cluster 3-text J. And if the new cluster center does not meet the preset stop condition after calculation, entering a second iteration process. The specific process comprises the following steps:
and resetting the text set to be clustered, wherein the text set to be clustered comprises 11 texts including a text A, a text B, a text C … …, a text H, a text I, a text J and a text L. Resetting the cluster-like mutual exclusion set.
And taking the new cluster center of each cluster as a cluster center, and if the cluster center is contained in any one of the subsets of the second preprocessing set, adding the text in the other subset of the second preprocessing set into the cluster-like mutually-exclusive set corresponding to the cluster center, thereby obtaining the new cluster-like mutually-exclusive set. All cluster centers are traversed, and the obtained cluster-like mutual exclusion set is shown in table 6.
TABLE 6 example of class cluster mutex set cases after performing the above steps in the second iteration
Serial number of cluster core
|
Cluster core
|
Class cluster mutual exclusion set sequence number
|
Text of mutually exclusive sets of class clusters
|
Cluster core |
1
|
Text B
|
Class cluster mutual exclusion set 1
|
——
|
Cluster core 2
|
Text C
|
Class cluster mutual exclusion set 2
|
Text G and text F
|
Cluster core |
3
|
Text J
|
Class cluster mutual exclusion set 3
|
Text A |
The total number of texts to be clustered is 8 that have not been classified: text A, text D, text E, text F, text G, text H, text I, text L. The following sequentially puts 8 texts into 3 clusters.
(1) Judging whether the text A is contained in the class cluster mutual exclusion set of the table 6, wherein the judgment result is that the text A is contained in the class cluster mutual exclusion set 3, and the cluster center 3 corresponding to the class cluster mutual exclusion set 3 is the text J, so that the distances between the other cluster centers except the cluster center 3 'the text J', namely the cluster center 1 and the cluster center 2, and the text A are respectively calculated. Assuming that text A is closest in distance to the cluster center 1 "text B", text A is added to class cluster 1.
(2) And judging whether the text D is contained in the class cluster mutual exclusion set in the table 6, and respectively calculating the distances between the text D and three cluster centers if the judgment result shows that the text D is not contained in any class cluster mutual exclusion set. Assuming that text D is closest in distance to cluster center 2 "text C", text D is added to class cluster 2.
(3) And judging whether the text E is contained in the class cluster mutual exclusion set in the table 6, and respectively calculating the distances between the text E and three cluster centers if the judgment result shows that the text E is not contained in any class cluster mutual exclusion set. Assuming that text E is closest in distance to cluster center 2 "text C", text E is added to class cluster 2.
(4) And judging whether the text F is contained in the class cluster mutual exclusion set of the table 6, wherein the judgment result is that the text F is contained in the class cluster mutual exclusion set 2, and the cluster center 2 corresponding to the class cluster mutual exclusion set 2 is the text C, so that the distances between the other cluster centers except the cluster center 2 'the text C', namely the cluster center 1 and the cluster center 3, and the text F are respectively calculated. Assuming that text F is closest in distance to cluster center 3 "text J", text F is added to class cluster 3.
(5) And judging whether the text G is contained in the class cluster mutual exclusion set of the table 6, wherein the judgment result is that the text G is contained in the class cluster mutual exclusion set 2, and the cluster center 2 corresponding to the class cluster mutual exclusion set 2 is the text C, so that the distances between the other cluster centers except the cluster center 2 'the text C', namely the cluster center 1 and the cluster center 3, and the text G are respectively calculated. Assuming that text G is closest in distance to cluster center 3 "text J", text G is added to class cluster 3.
(6) And judging whether the text H is contained in the class cluster mutual exclusion set in the table 6, and respectively calculating the distances between the text H and three cluster centers if the judgment result shows that the text H is not contained in any class cluster mutual exclusion set. Assuming that text H is closest in distance to cluster center 2 "text C", text H is added to class cluster 2.
(7) And judging whether the text I is contained in the class cluster mutual exclusion set in the table 6, and respectively calculating the distances between the text I and three cluster centers if the judgment result shows that the text I is not contained in any class cluster mutual exclusion set. Assuming that text I is closest in distance to the cluster center 1 "text B", text I is added to class cluster 1.
(8) And judging whether the text L is contained in the class cluster mutual exclusion set in the table 6, and respectively calculating the distances between the text L and three cluster centers if the judgment result shows that the text L is not contained in any class cluster mutual exclusion set. Assuming that text L is closest in distance to cluster center 1 "text B", text L is added to class cluster 1.
So far, all texts in the text set to be clustered have been clustered into corresponding clusters, and the result is shown in table 7.
TABLE 7 example of class clustering after the second iteration of performing the above steps
Cluster number of class
|
Cluster core
|
Remaining text in class clusters
|
Cluster class |
1
|
Text B
|
Text A, text I, text L
|
Cluster class |
2
|
Text C
|
Text D, text E, text H
|
Cluster |
3
|
Text J
|
Text F and text G |
And then recalculating a new cluster center of each class cluster again, assuming that the new cluster centers calculated at this time are respectively: class cluster 1-text A, class cluster 2-text E, class cluster 3-text J.
If the distance between the original cluster center text B of the cluster 1 and the new cluster center text A is smaller than the preset threshold value, the distance between the original cluster center text C of the cluster 2 and the new cluster center text E is smaller than the preset threshold value, and the distance between the original cluster center text J of the cluster 3 and the new cluster center text J is smaller than the preset threshold value, the new cluster center meets the stop condition at this time, all the clusters are output, and the output result is shown in table 8.
Table 8 Another example of outputting all the class clusters
Cluster number of class
|
Text in class clusters
|
Cluster class |
1
|
Text A, text B, text I, text L
|
Cluster class |
2
|
Text D, text C, text E, text H
|
Cluster |
3
|
Text F, text G, text J |
In a typical k-means clustering method, after k texts are acquired as cluster centers, for each text except for the cluster center, the distance from the text to each cluster center needs to be measured one by one, and then the text can be added to a class cluster corresponding to the cluster center closest to the text.
In the clustering method of this embodiment, a second constraint rule built in the k-means clustering method is used to preprocess a text set to be clustered to obtain a second preprocessing set, and then k texts in the text set to be clustered are obtained as a cluster center. And if the cluster center is contained in any one subset of the second preprocessing set, adding the text in the other subset of the second preprocessing set into the cluster-like mutually exclusive set corresponding to the cluster center, so that the cluster center and the text in the cluster-like mutually exclusive set corresponding to the cluster center conform to a second constraint rule. And judging whether the texts in the text set to be clustered are contained in the cluster mutual exclusion set one by one, if a certain text is contained in a certain cluster mutual exclusion set, indicating that the cluster center corresponding to the text and the cluster mutual exclusion set accords with a second constraint rule, and classifying the texts into different clusters. Therefore, when the text is classified, the distance between the text and each cluster center does not need to be calculated one by one, and only the distance between the text and other cluster centers except the cluster center corresponding to the similar cluster mutual exclusion set needs to be calculated, so that the number of the cluster centers participating in calculation is reduced. If the text is contained in the x class cluster mutual exclusion sets, the distances between the current text and other (k-x) cluster centers except the x cluster centers corresponding to the class cluster mutual exclusion sets are calculated. Therefore, the k-means clustering process is improved, and the calculation complexity of text clustering is further reduced. Particularly, under the conditions that the number of texts in a text set to be clustered is large and the number of clustered clusters is large, the k-means clustering method adopting the built-in constraint rule can obviously reduce the calculation complexity.
In addition, for the text set with more feature intersections, the clustering effect by adopting a typical k-means clustering method is poor. Specifically, when calculating the distance between the text to be clustered and the text serving as the cluster center, feature words need to be extracted from the text to be clustered and the text serving as the cluster center, respectively, and then the similarity between the feature words of the two texts is compared, so as to calculate the distance between the two texts. Among the feature words, some feature words have low relevance to the actually expressed subject of the text or have no relevance to the actually expressed subject. If the overlap ratio of the feature words which are irrelevant to the theme or have low relevance is high and the ratio of the feature words which are closely relevant to the theme is low in the feature words extracted from the two texts respectively, the features of the two texts can be considered to be crossed seriously. When clustering is carried out, for two texts with different themes but serious cross features, a user expects to cluster the two texts into different clusters, and a computer probably clusters the two texts into the same cluster due to high overlapping proportion of feature words of the two texts, namely, adds one text to be clustered into a cluster corresponding to the other text serving as a cluster center, so that clustering accuracy is reduced, and the whole clustering effect is not ideal.
In the clustering method of this embodiment, each sub-rule of each second constraint rule includes at least one mutually exclusive bag, each mutually exclusive bag includes at least one preset second keyword, and the second keywords can be preset as words closely related to the cluster-like subject, so that the mutually exclusive bags formed by the second keywords can embody the cluster-like subject. When the problem of text feature crossing is serious, firstly, preprocessing is carried out by using a second constraint rule, and the text which is taken as a cluster center and is definitely expected by a user to be clustered in different clusters with the cluster center can be determined. Then, when the texts in the text set to be clustered are classified, if a certain text is contained in the cluster mutual exclusion set corresponding to a certain cluster center, when the distance between the text and the cluster center is calculated, the cluster center is directly excluded, only the distance between the text and other cluster centers is calculated, and finally the text is added into the cluster corresponding to the cluster center closest to the text. By the mode, on one hand, the subject words concerned by the user are highlighted, and the text which is not expected to be clustered into the same cluster by the user is eliminated by utilizing the second constraint rule; on the other hand, the number of the cluster centers needing to participate in calculation in the process of classifying the text to be clustered is reduced, so that the influence of feature words which are irrelevant to the theme or have low relevance but are crossed seriously on the clustering is weakened, and the clustering precision is improved.
For the text to be clustered, it may satisfy the condition of "being included in x mutually exclusive sets of class clusters", or may satisfy the condition of "not being included in any one mutually exclusive set of class clusters or being included in all mutually exclusive sets of class clusters". However, for a certain text to be clustered, it cannot satisfy both of the above two conditions, and accordingly, only one of the steps S400 and S500 may be performed. In both step S400 and step S500, the text is finally required to be added to the class cluster corresponding to the cluster center closest to the text.
On this basis, optionally, referring to fig. 2, after the step of adding the current text to the cluster class corresponding to the cluster center closest to the current text in S400 or S500, the method may further include:
s700: if the current text is contained in any one of the subsets of the second preprocessing set, adding the text in the other subset of the second preprocessing set to a cluster-exclusive set of the clusters corresponding to the cluster center of the cluster to which the current text belongs.
In the step of executing S700, there may be a case where all or part of the text in another subset of the second preprocessing set is already stored in the corresponding cluster-like mutually exclusive set. At this time, for the text already existing in the class cluster mutual exclusion set, repeated addition is not needed, and only the text not existing in the class cluster mutual exclusion set is added.
Here, still following the example of step S300 in this embodiment, there are 8 texts that have not been classified in the text set to be clustered: text B, text C, text E, text F, text G, text I, text J, text L. The following sequentially puts 8 texts into 3 clusters.
(1) And judging whether the text B is contained in the class cluster mutual exclusion set shown in the table 3, and respectively calculating the distances between the text B and three cluster centers if the judgment result shows that the text B is not contained in any class cluster mutual exclusion set. Assuming that text B is closest in distance to the cluster center 1 "text A", text B is added to class cluster 1.
The cluster center of the class cluster 1 to which the text B belongs is a cluster center 1 'text A', and the cluster center 1 corresponds to the class cluster mutual exclusion set 1. Since text B is not included in any of the second preprocessed sets, the next uncategorized text is then categorized.
(2) And judging whether the text C is contained in the class cluster mutual exclusion set shown in the table 3, and respectively calculating the distances between the text C and three cluster centers if the judgment result shows that the text C is not contained in any class cluster mutual exclusion set. Assuming that text C is closest in distance to cluster center 2 "text D", text C is added to class cluster 2.
The cluster center of the class cluster 2 to which the text C belongs is a cluster center 2 'text D', and the cluster center 2 corresponds to the class cluster mutual exclusion set 2. The text C is contained in the subset 2-1 of the second preprocessing set 2, the subset 2-2 comprises the text F and the text G which are both contained in the class cluster mutual exclusion set 2 of the table 3, and therefore the addition of the text F and the text G to the class cluster mutual exclusion set 2 is not repeated. The situation of the mutually exclusive set of the class clusters in the current iteration is still shown in table 3.
(3) And judging whether the text E is contained in the class cluster mutual exclusion set shown in the table 3, and respectively calculating the distances between the text E and three cluster centers if the judgment result shows that the text E is not contained in any class cluster mutual exclusion set. Assuming that text E is closest in distance to cluster center 2 "text D", text E is added to class cluster 2.
Since text B is not included in any of the second preprocessed sets, the next uncategorized text is then categorized.
(4) And judging whether the text F is contained in the class cluster mutual exclusion set of the table 3, wherein the judgment result is that the text F is contained in the class cluster mutual exclusion set 2, and the cluster center 2 corresponding to the class cluster mutual exclusion set 2 is the text D, so that the distances between the other cluster centers except the cluster center 2 'text D', namely the cluster center 1 and the cluster center 3, and the text F are respectively calculated. Assuming that text F is closest in distance to cluster center 3 "text H", text F is added to class cluster 3.
The cluster center of the class cluster 3 to which the text F belongs is a cluster center 3 'text H', and the cluster center 3 corresponds to the class cluster mutual exclusion set 3. Since the text F is contained in the subset 2-2 of the second pre-processing set 2, the subset 2-1 comprises the text C, the text D and the text I, the text C, the text D and the text I are added to the class cluster mutual exclusion set 3. The case of mutually exclusive sets to such clusters is shown in table 9.
TABLE 9 example of the case of a cluster-like mutual exclusion set after performing the above steps
(5) And judging whether the text G is contained in the class cluster mutual exclusion set of the table 9, wherein the judgment result is that the text G is contained in the class cluster mutual exclusion set 2, and the cluster center 2 corresponding to the class cluster mutual exclusion set 2 is the text D, so that the distances between the other cluster centers except the cluster center 2 'text D', namely the cluster center 1 and the cluster center 3, and the text G are respectively calculated. Assuming that text G is closest in distance to cluster center 3 "text H", text G is added to class cluster 3.
The cluster center of the class cluster 3 to which the text G belongs is a cluster center 3 'text H', and the cluster center 3 corresponds to the class cluster mutual exclusion set 3. Since the text F is contained in the subset 2-2 of the second preprocessing set 2, and the text C, the text D, and the text I in the subset 2-1 have been added to the class cluster mutual exclusion set 3, the addition of the text C, the text D, and the text I to the class cluster mutual exclusion set 3 is not repeated. The situation for such mutually exclusive sets of clusters is still as shown in table 9.
(6) Judging whether the text I is contained in the class cluster mutual exclusion set of the table 9, wherein the judgment result is that the text I is contained in the class cluster mutual exclusion set 3, and the cluster center 3 corresponding to the class cluster mutual exclusion set 3 is the text H, so that the distances between the other cluster centers except the cluster center 3 'text H', namely the cluster center 1 and the cluster center 2, and the text I are respectively calculated. Assuming that text I is closest in distance to the cluster center 1 "text A", text I is added to class cluster 1.
The cluster center of the class cluster 3 to which the text I belongs is a cluster center 1 'text A', and the cluster center 1 corresponds to the class cluster mutual exclusion set 1. Since the text I is contained in the subset 2-1 of the second preprocessing set 2, the text F and the text G in the subset 2-2 are added to the class cluster mutual exclusion set 1. The situation for such mutually exclusive sets of clusters is still as shown in table 10.
TABLE 10 example of the case of a cluster-like mutual exclusion set after performing the above steps
(7) Judging whether the text J is contained in the class cluster mutual exclusion set of the table 10, wherein the judgment result is that the text J is contained in the class cluster mutual exclusion set 1, and the cluster center 1 corresponding to the class cluster mutual exclusion set 1 is the text A, so that the distances between the other cluster centers except the cluster center 1 and the text A, namely the cluster center 2 and the cluster center 3, and the text J are respectively calculated. Assuming that text J is closest in distance to cluster center 3 "text H", text J is added to class cluster 3.
The cluster center of the class cluster 3 to which the text J belongs is a cluster center 3 'text H', and the cluster center 3 corresponds to the class cluster mutual exclusion set 3. Since the text J is contained in the subset 1-2 of the second pre-processing set 1, the text A in the subset 11-1 is added to the class cluster mutual exclusion set 3. The situation for such mutually exclusive sets of clusters is still as shown in table 11.
TABLE 11 example of the case of a cluster-like mutually exclusive set after performing the above steps
Serial number of cluster core
|
Cluster core
|
Class cluster mutual exclusion set sequence number
|
Text of mutually exclusive sets of class clusters
|
Cluster core |
1
|
Text A
|
Class cluster mutual exclusion set 1
|
Text J, text F, text G
|
Cluster core |
2
|
Text D
|
Class cluster mutual exclusion set 2
|
Text F and text G
|
Cluster core |
3
|
Text H
|
Class cluster mutual exclusion set 3
|
Text C, text D, text I, text A |
(8) And judging whether the text L is contained in the class cluster mutual exclusion set of the table 11, and respectively calculating the distances between the text L and three cluster centers if the judgment result shows that the text L is not contained in any class cluster mutual exclusion set. Assuming that text L is closest in distance to cluster center 3 "text H", text L is added to class cluster 3.
Since the text L is the last text in the text set to be clustered, all the texts in the text set to be clustered have been clustered into corresponding clusters, and the result is as shown in table 4.
Comparing the above example with the examples in the steps of S300 and S400, it can be seen that, in the process of traversing the unclassified texts in the text set to be clustered and classifying the texts, the cluster mutual exclusion set is continuously updated by using the previously classified texts and the second preprocessing set, so that the number of cluster centers participating in the calculation of the subsequently classified texts during classification can be reduced, thereby further reducing the computational complexity of clustering. For example, in the above example, after the text F is classified into the cluster 3, the text in the subset 2-1 is added to the cluster exclusive set, so that when the text I is classified subsequently, the distance between the text I and all the cluster centers is not required to be calculated, only the distance between the text I and the cluster center 1 and the cluster center 2 is required to be calculated, and then the text I is classified into the cluster corresponding to the cluster center closest to the text I, thereby further reducing the calculation complexity of clustering.
Referring to fig. 3, in a second embodiment of the present application, a method for clustering k-means texts with built-in constraint rules is provided, which includes steps S100, S801, S200, S300, S802, S400, S500, and S600.
S100: and preprocessing a text set to be clustered by utilizing a second constraint rule to obtain a second preprocessing set corresponding to the second constraint rule, wherein the second constraint rule comprises two sub-rules, texts conforming to one sub-rule and texts conforming to the other sub-rule are required to be clustered into different clusters, the second preprocessing set comprises two sub-sets, and each sub-set comprises texts conforming to one corresponding sub-rule.
Step S100 may refer to the description related to the first embodiment, and is not described herein again.
S801: preprocessing a text set to be clustered by utilizing a first constraint rule to obtain a first preprocessing set corresponding to the first constraint rule, wherein texts conforming to the first constraint rule are required to be clustered into the same cluster, and the first preprocessing set comprises texts conforming to the corresponding first constraint rule.
Here, the step S801 may be before the step S100, or may be after the step S100, and the present application does not limit this.
In step S801, the first constraint rule refers to a preset rule that the texts meeting the rule must be clustered into the same class cluster. For example, if two texts a and B both conform to a certain first constraint rule, when clustering a text set to be clustered including a and B, a and B need to be clustered into the same cluster. Here, the texts in the text set to be clustered, which meet different first constraint rules, are piled up and stored in a first preprocessing set.
Specifically, in one implementation manner of the first constraint rules, each first constraint rule includes at least one aggregation word bag, each aggregation word bag includes at least one preset first keyword, and when the number of the first keywords in the same aggregation word bag is greater than or equal to 2, the aggregation word bag further includes a logical and relationship between the first keywords. When a certain text includes any aggregate bag of words in a certain first constraint rule, then the text conforms to the first constraint rule.
An example of a first constraint rule is given in table 12, where "+" indicates a logical and relationship in table 12. Taking the constraint rule of table 12 as an example, if the text a includes the aggregation bag of words 1, that is, the text a includes the first keyword 1, the first keyword 2, and the first keyword 3 at the same time, the text a conforms to the first constraint rule 1. If the text B comprises an aggregate bag of words 3, i.e. the text B comprises both the first keywords 6 and the first keywords 7, the text B also complies with the first constraint rule 1. Because the text A and the text B both accord with the first constraint rule 1, when clustering is carried out on a text set to be clustered including the text A and the text B, the text A and the text B are clustered into the same cluster.
TABLE 12 example of one implementation of the first constraint rule
Each of the first keywords may be an explicit word, such as "transact card", "transact credit card", etc. In different application scenarios, the first keyword in the aggregation word bag of the first constraint rule may be a word that is closely related to the topic of the class cluster desired by the user, so that the aggregation word bag composed of the first keyword can embody the topic of the class cluster.
For example, for a text set formed by customer service work order texts of the bank credit card department, a user expects that each cluster obtained by clustering the text set can represent different topics, such as "card handling problem", "accounting problem", "data entry error", "unskilled work of staff", and the like, so as to be respectively processed according to different topics subsequently. Therefore, assuming that the first keyword in each clustering word bag in the first constraint rule 1 can represent a "card-handling problem", the first keyword 1 may be "card-handling", the first keyword 2 may be "no until now", the first keyword 3 may be "approval-pass", the clustering word bag 1 is "card-handling + no until now + approval-pass", and the clustering word bag represents a topic of a cluster of "card-handling problem". Besides the aggregation word bag 1, other aggregation word bags such as an aggregation word bag 2 and an aggregation word bag 3 can be arranged to embody the theme of the cluster of the card transaction problem. Similarly, each aggregation word bag in the first constraint rule 2 may embody the subject of the category cluster of "accounting problem"; each aggregate bag in the first constraint rule 3 may then embody the subject of the cluster of "data entry errors".
Further, the first keywords may also be represented by regular expressions, i.e., each first keyword may include several sub-keywords and association relationships between the sub-keywords. The association relationship here can be represented by a meta character in a regular expression. For example "+" represents a logical and "; "|" represents a logical "or"; "()" indicates multiple rounding; "" indicates matching the previous sub-expression any number of times; "? "means zero or one matching the previous sub-expression; is "(. -)? "means the shortest match, i.e. match to" (-)? "the following qualifier character ends the match.
For example, the first keyword 1 in table 1 can be expressed as a regular expression "(office | apply | application) (#)? Card ". Wherein, "do", "transact", "apply for", "card" are all sub-keywords, "()", "|", "(-)? "is a meta character. If any of the 4 sub-keywords "do", "transact", "apply" are included in a certain text and matched to the keyword "card" thereafter, the regular expression can match the text. That is, the text includes a first keyword 1.
The first keyword and the second keyword may be the same or different, and the user may adjust the first keyword and the second keyword according to different focused keywords or different application scenarios, which is not limited in the present application.
The text set to be clustered comprises at least 2 texts to be clustered. When the first constraint rule is utilized to preprocess the text set to be clustered, the texts in the text set to be clustered, which accord with the first constraint rule, are stored in the corresponding first preprocessing set for subsequent steps. And partial texts in the text set to be clustered may not meet any first constraint rule, and the texts are not stored in the first preprocessing set.
When m first constraint rules exist in the k-means text clustering method in the embodiment of the application, preprocessing a text set to be clustered by using the first constraint rules to correspondingly obtain m first preprocessing sets, wherein m is a positive integer larger than or equal to 1.
The step of S801 is further explained below by another example. Assume that the text set to be clustered includes 11 texts, namely, a text a, a text B, a text C … …, a text H, a text I, a text J, and a text L. Table 13 shows the correspondence relationship between the first constraint rule and the first preprocessing set when there are 3 first constraint rules in this example. Wherein the text A and the text B both accord with a first constraint rule 1; the text C, the text D and the text E all accord with a first constraint rule 2; both text F and text G conform to the first constraint rule 3.
TABLE 13 example of the first set of preconditions
S200: and acquiring k texts in the text set to be clustered as cluster centers, wherein k is less than N, and N is the total number of the texts in the text set to be clustered.
S300: if the cluster center is contained in any one of the subsets of the second preprocessing set, the text in the other subset of the second preprocessing set is added to the cluster-like mutually exclusive set corresponding to the cluster center.
Steps S200 and S300 may refer to the description related to the first embodiment, and are not described herein again. Following the example in S100 and the steps of S200-S300 using the same operations as the example of step S300 of the first embodiment, the same results as those shown in table 3 in the first embodiment can be obtained.
S802: if the cluster center is contained in the first preprocessing set, adding the rest texts in the first preprocessing set except the cluster center into the class cluster corresponding to the cluster center, and removing the texts which are added into the class cluster in the text set to be clustered.
In the step S802, all the cluster centers may be sequentially traversed, and for each cluster center, if the cluster center is included in the first preprocessing set, the remaining texts in the same first preprocessing set as the cluster center are added to the class cluster corresponding to the cluster center, and then the texts already added to the class cluster in the text set to be clustered are removed. And if the cluster center is not contained in any first preprocessing set, processing the next cluster center until all the cluster centers are processed.
It should be noted that the order of the step S402 and the step S300 may be exchanged, and the order of the steps is not limited in this application. Since the step of S403 and the step of S300 both need to traverse the cluster center, in an implementation manner, the cluster center may be traversed only once, for example, for each cluster center, it may be determined first whether the cluster center is included in any subset of the second pre-processing set, and if so, the text in another subset of the second pre-processing set is added to the cluster-like exclusive set corresponding to the cluster center; and then judging whether the cluster center is contained in the first preprocessing set, if so, adding the rest texts in the first preprocessing set except the cluster center into the class cluster corresponding to the cluster center, and removing the texts which are added into the class cluster in the text set to be clustered. And performing the subsequent steps of S400 and S500 until all the cluster centers are traversed.
Following the example in the foregoing step of S801 in the present embodiment, it is assumed that the step of S200 acquires 3 texts as cluster centers, which are: text A, text D and text H, then the total of the unclassified texts in the text set to be clustered at this time is 8: text B, text C, text E, text F, text G, text I, text J, text L.
After the step of S802 is performed, the results shown in table 14 can be obtained. Specifically, first, for the cluster center of class cluster 1, text a, text B is added to class cluster 1 because text a is contained in the first preprocessing set 1. Then for the next cluster center, text D, text C and text E are added to class cluster 2 as text D is contained in the first pre-processing set 2. Finally, for the next cluster center, text H, there is no text other than text H in the cluster 3 at this time because text H is not included in any of the first preprocessed sets.
Example of a Cluster-like case after the step of Table 14S802
Cluster number of class
|
Cluster core
|
Remaining text in class clusters
|
Cluster class |
1
|
Text A
|
Text B
|
Cluster class |
2
|
Text D
|
Text C, text E
|
Cluster |
3
|
Text H
|
—— |
Before the step of S802, the text set to be clustered includes 8 texts that have not been classified yet. After the step of adding the rest of texts in the first preprocessing set except the cluster center into the class cluster corresponding to the cluster center, removing the texts already added into the class cluster in the text set to be clustered, namely removing the texts B, C and E. At this time, 5 texts which are not classified yet remain in the text set to be clustered: text F, text G, text I, text J, and text L.
S400: if the current text in the text set to be clustered is contained in the x cluster mutual exclusion sets, calculating the distances between the current text and other (k-x) cluster centers except the cluster center corresponding to the cluster mutual exclusion set, and adding the current text into the cluster corresponding to the cluster center closest to the current text, wherein x is more than 0 and less than k;
s500: and if the current text in the text set to be clustered is not contained in any one cluster mutual exclusion set or the current text is contained in all the cluster mutual exclusion sets, calculating the distances between all cluster centers and the current text, and adding the current text into the cluster corresponding to the cluster center closest to the current text.
S600: and recalculating a new cluster center of each class cluster, and outputting all the class clusters if the new cluster center meets a preset stop condition.
The steps S400 to S600 may refer to the description related to the first embodiment, and are not described herein again.
In the step S600, if the new cluster center does not satisfy the preset stop condition, the text set to be clustered is reset, the cluster mutual exclusion set is reset, the steps S300, S802, S400, and S500 are repeatedly executed with the new cluster center as the cluster center, and then the new cluster center of each cluster is recalculated. Here, the steps of S300, S802, S400, and S500 and the process of calculating a new cluster center for each class cluster are repeated each time, and an iterative process is accumulated. And outputting all the cluster classes until the new cluster core meets the preset stop condition.
The following is further explained by following the example in the aforementioned step of S802 in the present embodiment. At this time, 5 texts which are not classified yet remain in the text set to be clustered: text F, text G, text I, text J, and text L.
(1) And judging whether the text F is contained in the class cluster mutual exclusion set of the table 3, wherein the judgment result is that the text F is contained in the class cluster mutual exclusion set 2, and the cluster center 2 corresponding to the class cluster mutual exclusion set 2 is the text D, so that the distances between the other cluster centers except the cluster center 2 'text D', namely the cluster center 1 and the cluster center 3, and the text F are respectively calculated. Assuming that text F is closest in distance to cluster center 3 "text H", text F is added to class cluster 3.
(2) And judging whether the text G is contained in the class cluster mutual exclusion set of the table 3, wherein the judgment result is that the text G is contained in the class cluster mutual exclusion set 2, and the cluster center 2 corresponding to the class cluster mutual exclusion set 2 is the text D, so that the distances between the other cluster centers except the cluster center 2 'text D', namely the cluster center 1 and the cluster center 3, and the text G are respectively calculated. Assuming that text G is closest in distance to cluster center 3 "text H", text G is added to class cluster 3.
(3) And judging whether the text I is contained in the class cluster mutual exclusion set shown in the table 3, and respectively calculating the distances between the text I and three cluster centers if the judgment result shows that the text I is not contained in any class cluster mutual exclusion set. Assuming that text I is closest in distance to the cluster center 1 "text A", text I is added to class cluster 1.
(4) Judging whether the text J is contained in the class cluster mutual exclusion set shown in the table 3, wherein the judgment result is that the text J is contained in the class cluster mutual exclusion set 1, and the cluster center 1 corresponding to the class cluster mutual exclusion set 1 is the text A, so that the distances between the other cluster centers except the cluster center 1 and the text A, namely the cluster center 2 and the cluster center 3, and the text J are respectively calculated. Assuming that text J is closest in distance to cluster center 3 "text H", text J is added to class cluster 3.
(5) And judging whether the text L is contained in the class cluster mutual exclusion set shown in the table 3, and respectively calculating the distances between the text L and three cluster centers if the judgment result shows that the text L is not contained in any class cluster mutual exclusion set. Assuming that text L is closest in distance to cluster center 3 "text H", text L is added to class cluster 3.
So far, all texts in the text set to be clustered have been clustered into corresponding clusters, and the result is the same as table 4.
After one iteration, it is assumed that a new cluster center of each class cluster is recalculated, which is: class cluster 1-text B, class cluster 2-text C, class cluster 3-text J.
If the distance between the original cluster center text A of the cluster 1 and the new cluster center text B is smaller than the preset threshold value, the distance between the original cluster center text D of the cluster 2 and the new cluster center text C is smaller than the preset threshold value, and the distance between the original cluster center text H of the cluster 3 and the new cluster center text J is smaller than the preset threshold value through calculation, the new cluster center meets the stop condition at the moment, all the clusters are output, and the result is as shown in the table 5.
In the clustering method of this embodiment, first, a first constraint rule and a second constraint rule built in a k-means clustering method are respectively used to preprocess a text set to be clustered, so as to obtain a first preprocessing set and a second preprocessing set. And then classifying the texts which are to be clustered together with the cluster center and accord with a first constraint rule by utilizing the first preprocessing set. In this way, the texts classified into the class clusters do not need to calculate the distance from each cluster center one by one, and therefore the calculation complexity is reduced. Then, for each remaining text which is not classified in the text to be clustered, whether the text is included in the cluster-like mutual exclusion set is judged, so that the cluster centers which need to calculate the distance between the text and the text are excluded, the number of the cluster centers which need to participate in calculation during classification is reduced, and the calculation complexity of text clustering is further reduced. Particularly, under the conditions that the number of texts in a text set to be clustered is large and the number of clustered clusters is large, the k-means clustering method adopting the built-in constraint rule can obviously reduce the calculation complexity.
Optionally, referring to fig. 4, after the step of adding the current text to the cluster class corresponding to the cluster center closest to the current text in S400 or S500, the method may further include:
s803: if the current text is contained in the first preprocessing set, adding the rest texts in the first preprocessing set except the current text into the class cluster where the current text is located, and removing the texts which are already added into the class cluster in the text set to be clustered.
Further explanation follows with an example in the foregoing step of S802. At this time, 5 texts which are not classified yet remain in the text set to be clustered: text F, text G, text I, text J, and text L.
(1) And judging whether the text F is contained in the class cluster mutual exclusion set of the table 3, wherein the judgment result is that the text F is contained in the class cluster mutual exclusion set 2, and the cluster center 2 corresponding to the class cluster mutual exclusion set 2 is the text D, so that the distances between the other cluster centers except the cluster center 2 'text D', namely the cluster center 1 and the cluster center 3, and the text F are respectively calculated. Assuming that text F is closest in distance to cluster center 3 "text H", text F is added to class cluster 3.
Since the text F is included in the first preprocessing set 3 shown in table 13, the remaining texts in the first preprocessing set 3 except for the text F, that is, the text G, are added to the class cluster 3, and then the text G in the text set to be clustered is removed.
So far, 3 texts which are not classified yet are left in the text set to be clustered: text I, text J, text L.
(2) And judging whether the text I is contained in the class cluster mutual exclusion set shown in the table 3, and respectively calculating the distances between the text I and three cluster centers if the judgment result shows that the text I is not contained in any class cluster mutual exclusion set. Assuming that text I is closest in distance to the cluster center 1 "text A", text I is added to class cluster 1.
(3) Judging whether the text J is contained in the class cluster mutual exclusion set shown in the table 3, wherein the judgment result is that the text J is contained in the class cluster mutual exclusion set 1, and the cluster center 1 corresponding to the class cluster mutual exclusion set 1 is the text A, so that the distances between the other cluster centers except the cluster center 1 and the text A, namely the cluster center 2 and the cluster center 3, and the text J are respectively calculated. Assuming that text J is closest in distance to cluster center 3 "text H", text J is added to class cluster 3.
(4) And judging whether the text L is contained in the class cluster mutual exclusion set shown in the table 3, and respectively calculating the distances between the text L and three cluster centers if the judgment result shows that the text L is not contained in any class cluster mutual exclusion set. Assuming that text L is closest in distance to cluster center 3 "text H", text L is added to class cluster 3.
So far, all texts in the text set to be clustered have been clustered into corresponding clusters, and the result is the same as table 4.
Through the above example, it can be found that by adding the step of S803, after the text F has been added to the class cluster 3, since the text G and the text F are both in the first preprocessing set 3, that is, both conform to the first constraint rule 3, the text F is directly added to the class cluster 3, and the step of calculating the distance between the text F and each cluster center one by one is omitted, so that the number of texts to be clustered, which need to calculate the distance between each cluster center one by one, is further reduced, and the calculation complexity of text clustering is further reduced.
In a third embodiment of the present application, a method for clustering k-means texts with built-in constraint rules is provided, which combines the methods in the first and second embodiments, that is, the method includes:
s801: preprocessing a text set to be clustered by utilizing a first constraint rule to obtain a first preprocessing set corresponding to the first constraint rule, wherein texts conforming to the first constraint rule are required to be clustered into the same cluster, and the first preprocessing set comprises texts conforming to the corresponding first constraint rule.
S100: and preprocessing a text set to be clustered by utilizing a second constraint rule to obtain a second preprocessing set corresponding to the second constraint rule, wherein the second constraint rule comprises two sub-rules, texts conforming to one sub-rule and texts conforming to the other sub-rule are required to be clustered into different clusters, the second preprocessing set comprises two sub-sets, and each sub-set comprises texts conforming to one corresponding sub-rule.
S200: and acquiring k texts in the text set to be clustered as cluster centers, wherein k is less than N, and N is the total number of the texts in the text set to be clustered.
And traversing the cluster centers, respectively judging whether the cluster center is contained in any subset of the second preprocessing set or not and judging whether the cluster center is contained in the first preprocessing set or not. If the conditions for performing S300 and S802 are satisfied, the steps of S300 and S802 are performed. Preferably, the step of S300 is performed before the step of S802.
S300: if the cluster center is contained in any one of the subsets of the second preprocessing set, the text in the other subset of the second preprocessing set is added to the cluster-like mutually exclusive set corresponding to the cluster center.
S802: if the cluster center is contained in the first preprocessing set, adding the rest texts in the first preprocessing set except the cluster center into the class cluster corresponding to the cluster center, and removing the texts which are added into the class cluster in the text set to be clustered.
And traversing the text which is not classified in the text set to be clustered. And judging whether the text is contained in the class cluster mutual exclusion set or not for each text. If the cluster is contained in the mutual exclusion set of the partial class clusters, executing the step S400; if the cluster is included in all the cluster mutual exclusion sets or not included in any cluster mutual exclusion set, the step S500 is executed. Then judging whether the current text is contained in any subset of the second preprocessing set, if so, executing the step S700; and judging whether the current text is contained in the first preprocessing set or not, and if so, executing the step S803.
S400: if the current text in the text set to be clustered is contained in the x cluster mutual exclusion sets, calculating the distances between the current text and other (k-x) cluster centers except the cluster center corresponding to the cluster mutual exclusion set, and adding the current text into the cluster corresponding to the cluster center closest to the current text, wherein x is more than 0 and less than k.
S500: and if the current text in the text set to be clustered is not contained in any one cluster mutual exclusion set or the current text is contained in all the cluster mutual exclusion sets, calculating the distances between all cluster centers and the current text, and adding the current text into the cluster corresponding to the cluster center closest to the current text.
S700: if the current text is contained in any one of the subsets of the second preprocessing set, adding the text in the other subset of the second preprocessing set to a cluster-exclusive set of the clusters corresponding to the cluster center of the cluster to which the current text belongs.
S803: if the current text is contained in the first preprocessing set, adding the rest texts in the first preprocessing set except the current text into the class cluster where the current text is located, and removing the texts which are already added into the class cluster in the text set to be clustered.
Through the steps of S400 and S500, all texts in the text set to be clustered are added to the class cluster corresponding to the cluster center closest to the texts to be clustered, and the classification process is completed. Through the steps of S700 and S803, when classifying the subsequent texts in the text set to be clustered, the number of cluster centers that need to participate in the calculation is reduced, and the number of texts that need to participate in the calculation is reduced, thereby reducing the complexity of text clustering, which is the calculation.
And after all the texts in the text set to be clustered are traversed, all the texts are classified, and then the step S600 is executed.
S600: and recalculating a new cluster center of each class cluster, and outputting all the class clusters if the new cluster center meets a preset stop condition.
And if the new cluster center does not meet the preset stop condition, resetting the text set to be clustered, and resetting the cluster-like mutual exclusion set to be empty. Then, the steps of S300, S802, S400, S500, S700, and S803 are repeatedly executed with the new cluster center as a cluster center, and the new cluster center of each class cluster is recalculated. Here, each time the iteration is repeated, the iteration process is accumulated. And outputting all the cluster classes until the new cluster core meets the preset stop condition.
Referring to fig. 5, in a fourth embodiment of the present application, there is provided a k-means text clustering device with built-in constraint rules, including:
the second preprocessing unit 3 is configured to preprocess a to-be-clustered text set by using a second constraint rule to obtain a second preprocessing set corresponding to the second constraint rule, where the second constraint rule includes two sub-rules, a text that conforms to one of the sub-rules and a text that conforms to the other sub-rule must be clustered into different clusters, the second preprocessing set includes two sub-sets, and each sub-set includes a text that conforms to one corresponding sub-rule; optionally, the sub-rule of the second constraint rule includes at least one mutex bag, the mutex bag includes at least one preset second keyword, and when the number of the second keywords in the same mutex bag is greater than or equal to 2, the mutex bag further includes a logical and relationship between the second keywords.
The clustering unit 2 is used for acquiring k texts in the text set to be clustered as cluster centers; adding the text in another subset of the second preprocessing set to a cluster-like mutually exclusive set corresponding to the cluster center if the cluster center is contained in any subset of the second preprocessing set; under the condition that the current text in the text set to be clustered is contained in x cluster mutual exclusion sets, calculating the distances between the current text and other (k-x) cluster centers except the cluster center corresponding to the cluster mutual exclusion set, and adding the current text into the cluster corresponding to the cluster center closest to the current text; calculating the distance between all cluster centers and the current text under the condition that the current text in the text set to be clustered is not contained in any one cluster mutual exclusion set or the current text is contained in all cluster mutual exclusion sets, and adding the current text into the cluster corresponding to the cluster center closest to the current text; and recalculating a new cluster center of each class cluster, and outputting all the class clusters if the new cluster center meets a preset stop condition. Wherein k is less than N, and N is the total number of texts in the text set to be clustered; x is more than 0 and less than k.
Optionally, the clustering unit 2 is further configured to, if the current text is included in any subset of the second pre-processing set, add the text in another subset of the second pre-processing set to a cluster-like mutually exclusive set corresponding to a cluster center of a cluster to which the current text belongs.
Optionally, the k-means text clustering device with built-in constraint rule further includes:
the first preprocessing unit 1 is configured to preprocess a to-be-clustered text set by using a first constraint rule to obtain a first preprocessing set corresponding to the first constraint rule, where texts meeting the first constraint rule must be clustered into the same class cluster, and the first preprocessing set includes texts meeting the corresponding first constraint rule; optionally, the first constraint rule includes at least one aggregation word bag, the aggregation word bag includes at least one preset first keyword, and when the number of the first keywords in the same aggregation word bag is greater than or equal to 2, the aggregation word bag further includes a logical and relationship between the first keywords.
The aggregation unit 2 is further configured to, if a cluster core is included in the first pre-processing set, add the remaining texts in the first pre-processing set except for the cluster core to a class cluster corresponding to the cluster core, and remove the texts already added to the class cluster in the to-be-clustered text set.
Optionally, the aggregating unit 2 is further configured to, if the current text is included in the first preprocessing set, add the remaining texts in the first preprocessing set except for the current text to the class cluster where the current text is located, and remove the text that has been added to the class cluster in the to-be-clustered text set.
The same and similar parts in the various embodiments in this specification may be referred to each other. The above-described embodiments of the present invention should not be construed as limiting the scope of the present invention.