CN115455943A

CN115455943A - Text clustering method and device, nonvolatile storage medium and electronic equipment

Info

Publication number: CN115455943A
Application number: CN202211227973.XA
Authority: CN
Inventors: 阮禄; 冉猛; 危枫; 王晨子
Original assignee: China Telecom Corp Ltd
Current assignee: China Telecom Corp Ltd
Priority date: 2022-10-09
Filing date: 2022-10-09
Publication date: 2022-12-09

Abstract

The application discloses a text clustering method and device, a nonvolatile storage medium and electronic equipment. Wherein, the method comprises the following steps: clustering texts to be clustered according to a first algorithm to obtain a plurality of clustering clusters, and determining keywords of the clustering clusters according to a second algorithm; determining a connection matrix according to the clustering clusters and the keywords, wherein the connection matrix is a symmetric matrix; normalizing the connection matrix to obtain a first target connection matrix, and decomposing the first target connection matrix according to the average value of the first target connection matrix to obtain a second target connection matrix; and merging the second target connection matrixes to generate a target clustering result. The clustering method and the clustering device solve the technical problem that clustering is inaccurate due to the fact that a clustering model divides different expression methods of the same category into different categories due to the fact that the number of clusters is large in the existing clustering algorithm.

Description

Text clustering method and device, nonvolatile storage medium and electronic equipment

Technical Field

The present application relates to the field of text clustering technologies, and in particular, to a text clustering method and apparatus, a non-volatile storage medium, and an electronic device.

Background

One simple and practical clustering algorithm in text clustering is the K-means algorithm (K-means), which has a good practical effect although it cannot be guaranteed that an optimal clustering result can be obtained certainly.

However, in practical experiments, the results of K-means clustering still have the following problems:

1) If the number of clusters is set to be large, the model divides different expressions of problems with a large number of occurrences into different categories. For example, there are some users or direct complaints on the problem of "not enjoying the offer", that is, "not reduce the element on payment"; and other users may say more completely that "i can not enjoy the offer after paying with the wing when i take part in a certain set of meal offers". This is actually the same problem, but due to the high degree of variance in the customer expressions, the model may erroneously view them as different categories.

2) If the number of clusters is set to be slightly smaller, the model can improve the problem of 1) to a certain extent, but another new problem is introduced at this time: since the sentence encoding method based on the word vector mean cannot completely extract the semantic meaning of the sentence, noise or deviation is inevitably introduced. Therefore, if the number of clusters is small, the model may erroneously cluster samples of different classes into one class.

In view of the above problems, no effective solution has been proposed.

Disclosure of Invention

The embodiment of the application provides a text clustering method and device, a nonvolatile storage medium and electronic equipment, and aims to at least solve the technical problem of inaccurate clustering caused by the fact that a clustering model divides different expression methods of the same category into different categories due to the fact that the number of clusters is large in the existing clustering algorithm.

According to an aspect of an embodiment of the present application, there is provided a text clustering method, including: clustering texts to be clustered according to a first algorithm to obtain a plurality of clustering clusters, and determining keywords of the clustering clusters according to a second algorithm; determining a connection matrix according to the clustering cluster and the key words, wherein the connection matrix is a symmetric matrix; normalizing the connection matrix to obtain a first target connection matrix, and decomposing the first target connection matrix according to the average value of the first target connection matrix to obtain a second target connection matrix; and merging the second target connection matrixes to generate a target clustering result.

Optionally, determining a connection matrix according to the cluster and the keyword includes: determining a first connection matrix according to the initialized connection matrix, the first target word frequency, the key degree of the target words in the first cluster and the length of target intersection, wherein the key degree of the target words in the first cluster is determined according to a first algorithm; determining a second connection matrix according to the initialized connection matrix, a second target word frequency, the key degree of the target words in the second cluster and the length of target intersection, wherein the key degree of the target words in the first cluster is determined according to a second algorithm; and determining the connection matrix according to the first connection matrix, the second connection matrix and the length of the target intersection.

Optionally, before determining the first connection matrix, the method further comprises: generating an initialized connection matrix, wherein the initialized connection matrix is a K.K connection matrix with the value of 0, and K is the number of the clustering clusters; acquiring a keyword of a first cluster in the plurality of clusters and a keyword of a second cluster in the plurality of clusters, and determining a target intersection of the keyword of the first cluster and the keyword of the second cluster; obtaining a first target word frequency of a target word according to the occurrence frequency of the target word in the text to be clustered in the first clustering cluster and the occurrence frequency of all words in the text to be clustered, wherein the first target word frequency is the word frequency of the target word in the direction in which the first clustering cluster points to the second clustering cluster; and obtaining a second target word frequency of the target words according to the occurrence frequency of the target words in the text to be clustered in the second clustering cluster and the occurrence frequency of all the words in the text to be clustered, wherein the second target word frequency is the word frequency of the target words in the direction of the second clustering cluster pointing to the first clustering cluster.

Optionally, K is the number of cluster clusters in the target range randomly generated according to the first algorithm.

Optionally, normalizing the connection matrix to obtain a first target connection matrix, including: converting the connection matrix into a first connection matrix, wherein the first connection matrix is an asymmetric matrix; performing first processing on the first connection matrix to obtain a second connection matrix, wherein the first processing includes at least one of: matrix broadcasting and matrix shuffling, wherein a first process is used for removing a first target number in each row in a first connection matrix; and carrying out second processing on the second connection matrix to obtain a third connection matrix, wherein the second processing at least comprises the following steps: computing the exponential power of the second target number of each element in the second connection matrix; and performing third processing on the third connection matrix to obtain a fourth connection matrix, wherein the third processing at least comprises the following steps: summing the elements in each row in the third connection matrix; performing fourth processing on the fourth connection matrix to obtain a first target connection matrix, wherein the fourth processing at least comprises: the dimension of the fourth connection matrix is determined from the dimension of the connection matrix.

Optionally, before merging the second target connection matrix, the method further includes: and removing abnormal cluster in the second target connection matrix, wherein the abnormal cluster is a subset of a set formed by a plurality of cluster clusters.

Optionally, decomposing the first target connection matrix according to the average value of the first target connection matrix to obtain a second target connection matrix, including: traversing all elements in the first target connection matrix, wherein when the value of the element is larger than the average value, the value of the element is a first number, and when the value of the element is not larger than the average value, the value of the element is a second number; and determining a second target connection matrix according to the first number and the second number.

According to still another aspect of the embodiments of the present application, there is provided a non-volatile storage medium, where the storage medium includes a stored program, and the program, when running, controls a device on which the storage medium is located to perform the above text clustering method.

According to still another aspect of the embodiments of the present application, there is also provided an electronic device, including: the device comprises a memory and a processor, wherein the processor is used for running a program stored in the memory, and the program is used for executing the text clustering method.

In the embodiment of the application, clustering is carried out on the text to be clustered according to a first algorithm to obtain a plurality of clustering clusters, and keywords of the clustering clusters are determined according to a second algorithm; determining a connection matrix according to the clustering cluster and the key words, wherein the connection matrix is a symmetric matrix; normalizing the connection matrix to obtain a first target connection matrix, and decomposing the first target connection matrix according to the average value of the first target connection matrix to obtain a second target connection matrix; the second target connection matrixes are combined to generate a target clustering result, and the purpose of optimizing the K-Means clustering method is achieved by constructing the connection matrixes and carrying out normalization processing on the connection matrixes, so that the technical effect of more accurately clustering texts is achieved, and the technical problem of inaccurate clustering caused by the fact that different expression methods of the same type are classified into different types by a clustering model due to the fact that the clustering number is large in the conventional clustering algorithm is solved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:

FIG. 1 is a flow chart of a method of clustering text according to an embodiment of the present application;

FIG. 2 is a diagram of an effect of text clustering according to the related art;

FIG. 3 is a diagram of an effect of another text clustering according to the related art;

FIG. 4 is a graph of the effect of text clustering according to the present application

FIG. 5 is a diagram illustrating the effect of another text clustering implemented in accordance with the present application;

FIG. 6 is a diagram of effects of another text clustering according to the related art;

FIG. 7 is a diagram illustrating the effect of another text clustering implemented in accordance with the present application;

FIG. 8 is a comparison diagram of profile coefficients according to the related art;

FIG. 9 is a profile factor comparison graph implemented in accordance with the present application;

FIG. 10 is a comparison graph of contour coefficients in a low dimensional data space according to the related art;

FIG. 11 is a graph of a comparison of contour coefficients in a low dimensional data space according to an embodiment of the present application;

fig. 12 is a block diagram of a text clustering apparatus according to an embodiment of the present application.

Detailed Description

In order to make the technical solutions of the present application better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only some embodiments of the present application, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

It should be noted that the terms "first," "second," and the like in the description and claims of this application and in the accompanying drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are capable of operation in sequences other than those illustrated or described herein. Moreover, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

For a better understanding of the embodiments of the present application, technical terms referred to in the embodiments of the present application are explained as follows:

clustering: refers to the process of dividing a given set of objects into different subsets with the goal of making the elements within each subset as similar as possible and the elements between different subsets as dissimilar as possible. These subsets, also referred to as clusters, generally do not intersect.

Text clustering: also known as document clustering, refers to cluster analysis of documents, and is widely used in the fields of text mining and information retrieval. Initially, text clustering was used only for text archiving, and later people explored many new uses, such as improving search results, generating synonyms, and so on.

Contour coefficient: an evaluation mode for good and bad clustering effect. The contour coefficient can be used for evaluating the influence of different algorithms or different operation modes of the algorithms on the clustering result on the basis of the same original data by combining two factors of the cohesion degree and the separation degree.

Clustering is commonly used for pre-processing of data, or archiving similar data. The flow is the same as most other known tasks except that data do not need to be marked, and the characteristics are extracted and then submitted to a certain machine learning algorithm. Through clustering, websites can provide users with popular recommendations. Movies such as science fiction material may be automatically grouped in the same cluster, and when one is requested by the user, the website recommends another number for the movie that is most similar to the movie. Such recommendations are not personalized per user recommendations, since clustering occurs without taking into account the personal taste of the user, but rather only extracting features of the movie itself. Popular recommendations are particularly friendly to new users because the users who just registered have little play history and are difficult to predict preferences, and recommending similar movies by clustering is often a smooth "cold start" strategy. With the help of a small amount of manual spot check, the clustering can also automatically screen out samples containing certain common traits. For example, a rated application is generally high in the ratio of rated number to download amount, and the daily activity number and retention rate of the user are low. The specific values of these indices are difficult to determine manually, but should have a fixed interval. After the newly-shelved application programs are classified into several clusters through clustering, several samples are randomly selected for each cluster for manual spot inspection. The cluster in which the applications that are manually identified as being well-rated are likely to contain more similar applications, thereby narrowing the scope of spot checks and reducing labor costs.

Clustering can also play a role in the pre-processing of text. For example, before annotating a corpus, a certain number of representative documents are generally selected from the corpus as a sample. Assuming that N pieces of speech need to be labeled, the raw corpus can be clustered into N clusters, and one piece of speech can be randomly selected from each cluster. With the property that the elements within each cluster are similar, clustering can even be used for text deduplication.

In the related art, a simple and practical clustering algorithm in text clustering is a K-means algorithm (K-means). Although the algorithm cannot guarantee that the optimal clustering result can be obtained, the practical effect is very good. The basic idea of the K-means algorithm is: clustering is performed centering on k points in space, classifying the objects closest to them. And gradually updating the value of each clustering center by an iterative method until the best clustering result is obtained. The algorithm steps of the K-means are as follows:

1. selecting initialized k samples as initial clustering center a = a ₁ ,a ₂ ,...a _k ；

2. Calculating the distance from each sample i to k cluster centers in the data set and dividing the distance into classes c corresponding to the cluster centers with the minimum distance ⁽ⁱ⁾ ∶＝argminμ||c ⁽ⁱ⁾ -μ _j μ|| ² ；

3. For each class a _j Recalculating its cluster center

I.e. the centroid of all samples belonging to the class);

4. repeating the two steps of step 2 and step 3 until a certain termination condition (iteration number, minimum error change, etc.) is reached.

The K-means algorithm has the following disadvantages:

the K value needs to be set manually, and the results obtained by different K values are different;

2. the method is sensitive to the initial cluster center, and different results can be obtained in different selection modes;

3. sensitivity to outliers;

4. samples can only be classified into one class, and are not suitable for multi-classification tasks;

5. not suitable for classification with too discrete, unbalanced sample classes, classification with non-convex shapes.

Because the K-Means clustering method needs to appoint the clustering number in advance and the clustering number has a large influence on the clustering effect, before clustering, multiple value-taking attempts are needed according to business logic, and the clustering number is continuously optimized. Therefore, the result of K-means clustering has the following problems:

1. when the number K is set to be smaller, the model mixes different types of samples together and is difficult to correct. As shown in fig. 2.

2. When the number of K is larger, the model divides the samples of the same class into a plurality of classes. As shown in fig. 3, at this time, the

categories

3 and 4 need to be merged again into new categories, and the difficulty of correction is lower than that in the case of the problem 1 (fig. 2).

In order to solve the problem that a clustering model divides different expression methods of the same category into different categories due to the fact that the number of clusters is large, a related solution is provided in the embodiment of the application, as shown in fig. 4, the embodiment of the application excavates the characteristics of each cluster under the condition that the number of K is large, and combines the clusters with high correlation through a certain mechanism to obtain an ideal clustering result.

In accordance with an embodiment of the present application, there is provided a method embodiment of a text clustering method, it should be noted that the steps illustrated in the flowchart of the figure may be performed in a computer system such as a set of computer executable instructions, and that while a logical order is illustrated in the flowchart, in some cases the steps illustrated or described may be performed in an order different than here.

Fig. 1 is a flowchart of a text clustering method according to an embodiment of the present application, and as shown in fig. 1, the method includes the following steps:

step S102, clustering the texts to be clustered according to a first algorithm to obtain a plurality of clustering clusters, and determining keywords of the clustering clusters according to a second algorithm.

According to an alternative embodiment of the present application, the texts to be clustered are clustered by using a K-means algorithm, which randomly gives a certain range of K values, and in the K-means algorithm, the K value represents the number of classifications, i.e., the number of clusters. And (3) solving the keywords of the clustering clusters generated by the K-means algorithm by using a text-rank algorithm, wherein the keywords can be domain keywords, and for example, in a big data field, the domain keywords can comprise: data reclamation, data governance, data capitalization, data circulation, data security, and the like.

And step S104, determining a connection matrix according to the cluster and the keyword, wherein the connection matrix is a symmetric matrix.

According to another alternative embodiment of the present application, the specific calculation steps for determining the connection matrix are as follows:

first, initializing a K.K connection matrix C with a value of 0, wherein K represents the number of initialized class clusters of the K-means algorithm.

The initial connection matrix is then populated by calculating the degree of connection between classes according to the following formula

I＝W _S ∩W _T (2)

(2) Wherein I represents the intersection of the keywords in the S-th cluster and the T-th cluster, and W _S And W _T Respectively representing the keywords in the corresponding clustering clusters;

(3) In the formula

Representing the word frequency of the key word in the direction T pointed to by the category S,

representing the occurrence frequency of a certain keyword in the clustering cluster S in the text to be clustered, wherein N represents the sum of the occurrence frequency of all words in the text, and the formula (3) is a normalization process and aims to eliminate the difference of the length of the document;

(4) Is of the formula

Indicating that the ith word in the direction of pointing to S brings influence on the connectivity of the two cluster clusters, wherein

And representing the criticality of the ith word in the cluster S, which can be obtained according to a K-means algorithm. L is _I Is the length of the intersection of the keywords of the two cluster clusters, and the same principle is shown in formula (5)

Representing the influence of j-th word in the direction of T pointed by S on the connectivity of the two clustering clusters;

(6) The expression shows that the connection degrees of the words in the two directions of the S pointing to the T and the T pointing to the S are firstly superposed and then respectively accumulated in various clustering clusters to obtain the final connection degree between the S pointing to the T of the two clustering clusters, C _TS I.e. the connection matrix determined according to the cluster and the keyword in step S104. (6) The calculation of the formula has the advantage of simultaneously considering word units in the intersection of the keywords of the two clustering clusters and the clustering cluster unitsThe functions on the word units are mainly the influence of the words on the link degree in the corresponding direction and the normalized constraint, and the normalized constraint is calculated, and meanwhile, the mutual influence among the cluster clusters is also considered.

And S106, carrying out normalization processing on the connection matrix to obtain a first target connection matrix, and decomposing the first target connection matrix according to the average value of the first target connection matrix to obtain a second target connection matrix.

In some optional embodiments of the present application, the connection matrix is normalized to a space of 0-1 using symmetric softmax to facilitate subsequent calculations, and the calculation formula of softmax is as follows:

C _TS (K,K)→C _TS (1,K·K) (7)

m_softmax＝(m _{_exp} /m _{_sum} )[1,K·K]→[K,K] (11)

(7) The function of the formula is to firstly make a symmetrical matrix C of K and K _TS C with conversion of (K, K) to 1 x K.K _TS (1, K.K) facilitates subsequent calculation, because if the original matrix is calculated, the symmetry of the matrix cannot be ensured under the action of softmax, and meanwhile, if the symmetry of the matrix is ensured, the softmax cannot be ensured, so that the shape of the matrix needs to be converted first;

(8) The function of the formula is to subtract the maximum number of the own row from each row of the transformed matrix in order to prevent the occurrence of infimum in the calculation process, wherein matrix broadcasting, matrix recombination and other operations are used;

(9) Exponentiation of e to each value in the matrix, (10) summation of each row in the matrix, (11) division of each original data in the matrix by the sum of each row and reduction of the matrix shape to the original shape.

In some optional embodiments of the present application, the normalized connection matrix is decomposed according to a threshold, where the threshold used in this embodiment is an average value of the matrix, and the decomposition is performed according to the following formula and matrix:

the conversion results for the above equation are as follows:

and step S108, merging the second target connection matrixes to generate a target clustering result.

As an alternative embodiment of the present application, according to step S106, the second target connection matrix L _ST As follows:

as another alternative embodiment of the present application, the merging algorithm pseudo-code is as follows:

input is a normalized symmetric matrix L _ST ＝[K,K],L _ST And (3) representing a symmetrical matrix after the decomposition of the normalized matrix, wherein the size of the matrix is the shape of K x K, and K represents the number of clustering centers set by a front K-means clustering algorithm.

Output-clustering center array after initial merging, each element represents a set.

01:for i←1to Input.shape do

02:{row_i.set()and row_i.add(i)}

03:for j←1to Input.shape do

05:if Input[i][j]＝1do

06:{row_i.add(j)}

07:end if

08:if len(Output)＝0do

09:{Output.append(row_i))

10:else

11:for k←1to Output

12:if row_i.issubset(k)do

13:{break}；

14:elif k.issubset(row_i)do

15:{k＝value}

16:else

17:if index＝len(Output)-1do

18:{Output.append(row_i)}

19:end if

20:end if

21:end for

22:end for

23:end for

24:return Output

Connecting matrix L through second target _ST Therefore, the first cluster, the fifth cluster and the sixth cluster can be combined into a new cluster; the second cluster is a separate category; the third cluster and the fourth cluster can be combined into another new cluster. The final clustering effect graph is shown in fig. 5.

According to the steps, the purpose of optimizing the K-Means clustering method is achieved by constructing the connection matrix and carrying out normalization processing on the connection matrix, and therefore the technical effect of more accurately clustering the texts is achieved.

According to an optional embodiment of the present application, the connection matrix is determined according to the cluster and the keyword, and the method is implemented by: determining a first connection matrix according to the initialized connection matrix, the first target word frequency, the key degree of the target words in the first clustering cluster and the length of target intersection, wherein the key degree of the target words in the first clustering cluster is determined according to a first algorithm; determining a second connection matrix according to the initialized connection matrix, a second target word frequency, the key degree of the target words in the second cluster and the length of target intersection, wherein the key degree of the target words in the first cluster is determined according to a second algorithm; and determining the connection matrix according to the lengths of the first connection matrix, the second connection matrix and the target intersection.

According to another alternative embodiment of the present application, the initialized connection matrix is a connection matrix of K.K with a value of 0, where K represents the number of initialized class clusters of the K-means algorithm. The first target word frequency is the ratio of the occurrence frequency of the target words in the text to be clustered in the first clustering cluster and the occurrence frequency of all words in the text to be clustered. The criticality of the target vocabulary in the first cluster may be determined according to a K-means algorithm. And the target intersection is the intersection of target vocabularies in the S-th cluster and the T-th cluster.

In some optional embodiments of the present application, before determining the first connection matrix, the following method may be implemented: generating an initialized connection matrix, wherein the initialized connection matrix is a K & K connection matrix with a value of 0, and K is the number of the cluster clusters; acquiring a keyword of a first cluster in the plurality of clusters and a keyword of a second cluster in the plurality of clusters, and determining a target intersection of the keyword of the first cluster and the keyword of the second cluster; obtaining a first target word frequency of a target word according to the occurrence frequency of the target word in the text to be clustered in the first clustering cluster and the occurrence frequency of all words in the text to be clustered, wherein the first target word frequency is the word frequency of the target word in the direction in which the first clustering cluster points to the second clustering cluster; and obtaining a second target word frequency of the target words according to the occurrence frequency of the target words in the text to be clustered in the second clustering cluster and the occurrence frequency of all the words in the text to be clustered, wherein the second target word frequency is the word frequency of the target words in the direction of the second clustering cluster pointing to the first clustering cluster.

In some optional embodiments of the present application, K is the number of cluster clusters in the target range randomly generated according to the first algorithm.

According to an optional embodiment of the present application, normalizing the connection matrix to obtain the first target connection matrix includes the following steps: converting the connection matrix into a first connection matrix, wherein the first connection matrix is an asymmetric matrix; performing first processing on the first connection matrix to obtain a second connection matrix, wherein the first processing includes at least one of: matrix broadcasting and matrix shuffling, wherein a first process is used for removing a first target number in each row in a first connection matrix; and carrying out second processing on the second connection matrix to obtain a third connection matrix, wherein the second processing at least comprises the following steps: computing the exponential power of the second target number of each element in the second connection matrix; and performing third processing on the third connection matrix to obtain a fourth connection matrix, wherein the third processing at least comprises the following steps: summing the elements in each row in the third connection matrix; performing fourth processing on the fourth connection matrix to obtain a first target connection matrix, wherein the fourth processing at least comprises: the dimension of the fourth connection matrix is determined from the dimension of the connection matrix.

According to another alternative embodiment of the present application, the connection matrix is normalized to a space of 0-1 to facilitate subsequent calculations using a symmetric softmax, which literally can be divided into two parts, soft and max. Max means the maximum value as the name implies. The core of softmax is soft, as opposed to hard. In many scenarios we need to find the element with the largest value among all elements in the array, which is essentially the hardmax of the following formula:

the greatest feature of hardmax is that only one of the largest values, i.e., non-black or white, is selected. However, this approach is often not justified in practice, for example, for text classification, where an article contains more or less various topic information, we prefer to obtain a probability value (confidence) for each possible text category of the article, which can be simply understood as the confidence of belonging to the corresponding category. So, the concept of soft is used, and softmax means that a certain maximum value is not uniquely determined any more, but a probability value is given to each output classification result to represent the possibility of belonging to each class, and the expression of softmax is as follows:

wherein z is _i And C is the output value of the ith node, and the number of output nodes, namely the number of classified categories. The output values of the multi-classification can be converted into the range of [0,1 ] by the softmax function]And a probability distribution of 1. softmax can thereby normalize the output space to a 0,1 space, delaying computational difficulties and data inefficiencies due to the gap above the data level.

In some optional embodiments of the present application, before merging the second target connection matrices, the method further comprises: and removing abnormal cluster clusters in the second target connection matrix, wherein the abnormal cluster clusters are subsets of a set formed by a plurality of cluster clusters.

As an alternative embodiment of the present application, the abnormal cluster is a cluster that can be classified separately and is located in a set formed by a plurality of clusters. For example, the original classification is 1;2;3;4;5;6;7;8;9;10, 10 categories (cluster clusters) in total, and 123 categories are obtained after reclassification; 2;456;789;10, the category (cluster) 2 at this time is an abnormal cluster, and in order to ensure the clustering accuracy, the abnormal cluster needs to be removed.

According to an alternative embodiment of the present application, the pseudo code for removing abnormal cluster (cluster island) is as follows:

and Input, clustering the central array after initial combination, wherein each element represents a set.

Output is the cluster center array after data islanding is removed, and each element represents a set.

01:define big_set,Output

02:for i←1to Input

03:if len(i)>1do

04:big_set＝set.union(i,big_set)

05:end if

06:end for

07:for j←1to Input

08:if len(j)＝1do

09:if j not∈big_set

10:Output.append(j)

12:end if

13:else:

14:Output.append(j)

15:end if

16:end for

17:return Output

In some optional embodiments of the present application, decomposing the first target connection matrix according to the average value of the first target connection matrix to obtain a second target connection matrix includes: traversing all elements in the first target connection matrix, wherein when the value of the element is larger than the average value, the value of the element is a first number, and when the value of the element is not larger than the average value, the value of the element is a second number; and determining a second target connection matrix according to the first number and the second number.

As another optional embodiment of the present application, all elements of all positions in the first target connection matrix are traversed, and when the value of an element exceeds the average value of the first target connection matrix, the value of the element is 1; when the value of the element does not exceed the average value of the first target connection matrix, the value of the element is 0. The decomposition process is as follows:

wherein, the matrix C _ST The value of the element in the first row and the second column is 0.22 and is smaller than that of the matrix C _ST The value of the element in the first row and the second column of the decomposed matrix is 0.

According to an optional embodiment of the present application, in order to evaluate the effectiveness of the present application, two clustering evaluation manners are introduced below, where the first manner directly visualizes the features of the clustered data and the theme of the clustering center, so that the distribution space of the whole data can be directly observed and the clustering effect of the data can be observed at the same time, and the evaluation manner is shown in fig. 6 and fig. 7; the second way of evaluating is to introduce an index of contour coefficient to evaluate the present application and related techniques, as shown in fig. 8,9,10,11.

Fig. 6 shows the result of clustering by the conventional K-means algorithm, and as shown in fig. 6, 100 initialized clustering centers are provided, and it can be seen that many categories are close to each other and need to be merged, but because of the K-means algorithm, the number of clustering centers needs to be fixed every time K-means is initialized, and because the clustering algorithm is an unsupervised algorithm and the number of clustering centers of the data to be processed is uncertain, the K value after random initialization cannot accurately conform to the true distribution of the data, and many data clustered together are divided into a plurality of clustering clusters.

Fig. 7 is a graph of the clustering effect of the present application, and as shown in fig. 7, the present application effectively merges a plurality of redundancy categories, and on the premise of effectively merging the original K-means redundant clustering clusters, the technical effect of linearly separable data clustering effect without affecting the original reasonably distributed clustering clusters is also achieved.

According to another alternative embodiment of the present application, the core idea of the contour coefficient is to judge: and (3) the relative size of the inter-class distance and the intra-class distance, wherein if the inter-class distance is greater than the intra-class distance, the clustering result is good, otherwise, the clustering result is not good. The idea of the contour coefficient is similar to Fisher linear discrimination, and the relative size of the distance between classes and the distance in class is judged. The difference is that the contour coefficient is used for measuring the quality of a ' clustering result ', the ' Fisher linear discrimination ' middle ' inter-class distance comparison ' is used for reducing original dimension data to a one-dimensional linear space ', and the dimension reduction behavior has a premise that: so that the space after dimension reduction can be well distinguished between the categories.

1. The average distance a (i) from sample i to other samples in the same cluster is calculated. The smaller a (i) is, the more the sample i should be clustered to the cluster. Let a (i) be referred to as intra-cluster dissimilarity of sample i. The a (i) mean of all samples in cluster C is referred to as the cluster dissimilarity for cluster C.

2. Calculating the sample i to some other cluster C _j Average distance b of all samples _ij Called sample i and cluster C _j Degree of dissimilarity of (c). Define as inter-cluster dissimilarity for sample i:

b(i)＝min{b(i) ₁ ,b(i) ₂ ,...,b(i) _k }b(i)

the larger b (i) indicates that sample i belongs to less other clusters.

3. According to the intra-cluster dissimilarity a (i) and the inter-cluster dissimilarity a (i) of the sample i, defining the contour coefficient of the sample i:

4. and (3) judging:

s (i) is close to 1, which indicates that the sample i is reasonably clustered;

s (i) is close to-1, indicating that sample i should be more classified into another cluster;

if s (i) is approximately 0, it indicates that sample i is on the boundary of two clusters.

In some optional embodiments of the present application, the contour coefficients are standard evaluation modes of the clustering model, and the contour coefficients can describe the clustering effect from the following two directions:

1. when the contour coefficient is-1, the clustering result is not good, when the contour coefficient is +1, the cluster internal instances are compact, and when the contour coefficient is 0, the cluster overlapping is shown.

2. The larger the contour coefficient is, the more compact the intra-cluster examples are, and the larger the inter-cluster distance is, which is the standard concept of clustering.

The profile coefficient comparison graph in the related art is shown in fig. 8, and it can be known from the definition of the profile coefficient and the basis for judging the clustering effect that the k value set by the initial model in fig. 8 is larger than 100, the proportion of the profile coefficient smaller than 0 is larger and the distribution is scattered, the clustering effect is not ideal at this time, the final clustering result does not accord with the real distribution space of the data, and a plurality of clustering clusters can be actually merged in one space.

The model of fig. 9 after the decomposition and combination of the keyword normalized connection matrix merges many clustered data smaller than 0 together to form a new cluster, the clustering effect at this time is significantly improved, and the 11 th cluster in fig. 9 effectively merges many scattered data smaller than zero originally belonging to other clusters.

Fig. 10 shows a case of clustering in which the cluster center is set to 20 in the related art, and in this case, many data do not belong to this category by themselves, so that the contour coefficients of many data are all smaller than 0.

Fig. 11 is an effect diagram after optimization and correction of the present application (Key-means model), the clustering result after the Key-means optimization adjusts the original clustering cluster to 8, and at this time, other categories except that some data in the clustering cluster 0 is smaller than 0 are correctly classified into correct clustering clusters, and it can be known from fig. 11 that the Key-means model effectively optimizes the deficiency of the related art (K-means model), and the problem of cluster center fixation caused by the fact that the K-means cannot be determined by correct correction can be solved, and the data distribution of the clustering cluster after the Key-means optimization better conforms to the real data distribution space.

Fig. 12 is a block diagram of a text clustering device according to an embodiment of the present application, and as shown in fig. 12, the device includes:

the first determining module 1202 is configured to cluster the texts to be clustered according to a first algorithm to obtain a plurality of clustering clusters, and determine keywords of the clustering clusters according to a second algorithm;

a second determining module 1204, configured to determine a connection matrix according to the cluster and the keyword, where the connection matrix is a symmetric matrix;

the first generation module 1206 is configured to perform normalization processing on the connection matrix to obtain a first target connection matrix, and decompose the first target connection matrix according to an average value of the first target connection matrix to obtain a second target connection matrix;

and a second generating module 1208, configured to merge the second target connection matrices to generate a target clustering result.

It should be noted that, reference may be made to the description related to the embodiment shown in fig. 1 for a preferred implementation of the embodiment shown in fig. 12, and details are not repeated here.

The embodiment of the application also provides a nonvolatile storage medium, which comprises a stored program, wherein when the program runs, the device where the storage medium is located is controlled to execute the text clustering method.

The nonvolatile storage medium executes a program for: clustering texts to be clustered according to a first algorithm to obtain a plurality of clustering clusters, and determining keywords of the clustering clusters according to a second algorithm; determining a connection matrix according to the clustering cluster and the key words, wherein the connection matrix is a symmetric matrix; normalizing the connection matrix to obtain a first target connection matrix, and decomposing the first target connection matrix according to the average value of the first target connection matrix to obtain a second target connection matrix; and merging the second target connection matrixes to generate a target clustering result.

An embodiment of the present application further provides an electronic device, including: the device comprises a memory and a processor, wherein the processor is used for running a program stored in the memory, and the program is used for executing the text clustering method.

The processor is used for running a program for executing the following functions: clustering texts to be clustered according to a first algorithm to obtain a plurality of clustering clusters, and determining keywords of the clustering clusters according to a second algorithm; determining a connection matrix according to the clustering cluster and the key words, wherein the connection matrix is a symmetric matrix; performing normalization processing on the connection matrix to obtain a first target connection matrix, and decomposing the first target connection matrix according to the average value of the first target connection matrix to obtain a second target connection matrix; and merging the second target connection matrixes to generate a target clustering result.

The above-mentioned serial numbers of the embodiments of the present application are merely for description, and do not represent the advantages and disadvantages of the embodiments.

In the embodiments of the present application, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to the related descriptions of other embodiments.

In the embodiments provided in the present application, it should be understood that the disclosed technology can be implemented in other ways. The above-described apparatus embodiments are merely illustrative, and for example, the division of the units may be a logical division, and in actual implementation, there may be another division, for example, multiple units or components may be combined or may be integrated into another system, or some features may be omitted, or may not be executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solutions of the present application, in essence or part of the technical solutions contributing to the related art, or all or part of the technical solutions, may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic disk, or an optical disk, and various media capable of storing program codes.

The foregoing is only a preferred embodiment of the present application and it should be noted that those skilled in the art can make several improvements and modifications without departing from the principle of the present application, and these improvements and modifications should also be considered as the protection scope of the present application.

Claims

1. A text clustering method, comprising:

clustering texts to be clustered according to a first algorithm to obtain a plurality of clustering clusters, and determining keywords of the clustering clusters according to a second algorithm;

determining a connection matrix according to the clustering cluster and the key words, wherein the connection matrix is a symmetric matrix;

performing normalization processing on the connection matrix to obtain a first target connection matrix, and decomposing the first target connection matrix according to the average value of the first target connection matrix to obtain a second target connection matrix;

and merging the second target connection matrixes to generate a target clustering result.

2. The method of claim 1, wherein determining a connection matrix based on the cluster and the keyword comprises:

determining a first connection matrix according to an initialized connection matrix, a first target word frequency, the key degree of a target word in a first cluster and the length of a target intersection, wherein the key degree of the target word in the first cluster is determined according to the first algorithm;

determining a second connection matrix according to the initialized connection matrix, a second target word frequency, the key degree of the target words in a second cluster and the length of the target intersection, wherein the key degree of the target words in the first cluster is determined according to the second algorithm;

and determining the connection matrix according to the first connection matrix, the second connection matrix and the length of the target intersection.

3. The method of text clustering according to claim 2, wherein prior to determining the first connection matrix, the method further comprises:

generating the initialized connection matrix, wherein the initialized connection matrix is a K.K connection matrix with a value of 0, and K is the number of the cluster clusters;

acquiring a keyword of a first cluster in the plurality of clusters and a keyword of a second cluster in the plurality of clusters, and determining the target intersection of the keyword of the first cluster and the keyword of the second cluster;

obtaining the first target word frequency of the target vocabulary according to the occurrence frequency of the target vocabulary in the text to be clustered in the first clustering cluster and the occurrence frequency of all vocabularies in the text to be clustered, wherein the first target word frequency is the word frequency of the target vocabulary in the direction in which the first clustering cluster points to the second clustering cluster;

and obtaining a second target word frequency of the target vocabulary according to the occurrence frequency of the target vocabulary in the text to be clustered in the second clustering cluster and the occurrence frequency of all vocabularies in the text to be clustered, wherein the second target word frequency is the word frequency of the target vocabulary in the direction pointing to the first clustering cluster in the second clustering cluster.

4. The text clustering method according to claim 3, characterized in that K is the number of the cluster clusters in the target range randomly generated according to the first algorithm.

5. The text clustering method according to claim 1, wherein the normalizing the connection matrix to obtain a first target connection matrix comprises:

converting the connection matrix into a first connection matrix, wherein the first connection matrix is an asymmetric matrix;

performing first processing on the first connection matrix to obtain a second connection matrix, wherein the first processing includes at least one of: matrix broadcasting and matrix shuffling, the first process being used to remove the first target number in each row of the first connection matrix;

and performing second processing on the second connection matrix to obtain a third connection matrix, wherein the second processing at least comprises: computing an exponential power of a second target number for each element in the second connection matrix;

performing third processing on the third connection matrix to obtain a fourth connection matrix, wherein the third processing at least comprises: summing elements in each row in the third connection matrix;

performing fourth processing on the fourth connection matrix to obtain the first target connection matrix, wherein the fourth processing at least includes: and determining the dimension of the fourth connection matrix according to the dimension of the connection matrix.

6. The method of text clustering according to claim 1, wherein before merging the second target connection matrices, the method further comprises:

and removing abnormal cluster in the second target connection matrix, wherein the abnormal cluster is a subset of a set formed by a plurality of cluster clusters.

7. The text clustering method according to claim 1, wherein decomposing the first target connection matrix according to the average value of the first target connection matrix to obtain a second target connection matrix comprises:

traversing all elements in the first target connection matrix, wherein when the value of the element is larger than the average value, the value of the element is a first number, and when the value of the element is not larger than the average value, the value of the element is a second number;

and determining the second target connection matrix according to the first number and the second number.

8. A text clustering apparatus, comprising:

the first determining module is used for clustering texts to be clustered according to a first algorithm to obtain a plurality of clustering clusters and determining keywords of the clustering clusters according to a second algorithm;

the second determining module is used for determining a connection matrix according to the clustering cluster and the keyword, wherein the connection matrix is a symmetric matrix;

the first generation module is used for carrying out normalization processing on the connection matrix to obtain a first target connection matrix, and decomposing the first target connection matrix according to the average value of the first target connection matrix to obtain a second target connection matrix;

and the second generation module is used for merging the second target connection matrixes to generate a target clustering result.

9. A non-volatile storage medium, comprising a stored program, wherein the program, when executed, controls a device in which the non-volatile storage medium is located to perform the text clustering method according to any one of claims 1 to 7.

10. An electronic device, comprising: a memory and a processor for executing a program stored in the memory, wherein the program when executed performs the method of text clustering of any one of claims 1 to 7.