CN110647919A

CN110647919A - Text clustering method and system based on K-means clustering and capsule network

Info

Publication number: CN110647919A
Application number: CN201910794559.9A
Authority: CN
Inventors: 张伟; 汤旭东
Original assignee: East China Normal University
Current assignee: East China Normal University
Priority date: 2019-08-27
Filing date: 2019-08-27
Publication date: 2020-01-03

Abstract

The invention provides a text clustering method based on K-means clustering and a capsule network, which comprises the following steps: acquiring text data, preprocessing the text data, and training word2vec on a data set to be used as expression of words. Averaging word vectors of all words in one document to serve as vector representation of the corresponding document; the pseudo-tags are generated using Kmeans clustering on the vector representations of these documents. And finally, taking the word sequence, the word vector and the pseudo label as training data, training a classifier based on the capsule network, controlling certain training loss, and finally clustering by using the classifier. The method combines the K-nearest neighbor and the capsule network, converts the unsupervised text clustering problem into the supervised classification problem, and further improves the clustering effect on the basis of the traditional clustering method. The invention also provides a text clustering system based on the K-means clustering and the capsule network.

Description

Text clustering method and system based on K-means clustering and capsule network

Technical Field

The invention relates to the field of natural language processing, in particular to a text clustering method and a text clustering system for converting an unsupervised task into a supervised task by utilizing K-means clustering and a capsule network.

Background

In recent years, with the rapid development of internet technology, massive network data is continuously generated, and in information storage, text is the most widely used form, and massive information is stored in text form. Text mining techniques investigate how to mine interesting, valuable information from various forms of text data. One branch of text mining is text clustering, and the method is widely applied to the directions of pattern recognition, topic recognition, recommendation systems and the like.

Text clustering applies a clustering algorithm to texts, which is an important component of text mining technology. The method is applied to a search engine, can enable a user to quickly and effectively search information wanted by the user, and can extract hot spot information of the day from news information collected from various channels or recommend content interested by the user by combining with history records. The text clustering belongs to an unsupervised machine learning method, is different from a supervised machine learning method, and has higher flexibility and automatic processing capability.

Disclosure of Invention

The invention innovatively provides a method for applying a capsule network to text clustering for the first time, and the unsupervised problem is converted into a supervised problem. The pseudo label is generated by adopting K-nearest neighbor, and then the capsule network training is used to control the training loss, so that the clustering method can achieve better effect than the pseudo label generating clustering method.

The method utilizes the potential characteristics of the data to redistribute the fuzzy edge part of the pseudo label generated by the K-nearest neighbor algorithm in the characteristic learning process through the capsule network so as to achieve better clustering effect.

The text clustering method provided by the invention comprises the following steps:

firstly, selecting a text data set, and preprocessing text data in the text data set;

secondly, converting the text sequence into vector characteristic representation by using word vectors;

thirdly, averaging word vectors of each document to serve as vector feature representation of the document, and carrying out K-means clustering on the representation of the document to generate a pseudo label of the document;

fourthly, training a classifier based on a capsule network by taking the word vectors of the documents and the pseudo labels generated in the third step as training data without perfectly training, and keeping certain training loss;

and fifthly, clustering the text data set by using the trained capsule network classifier.

In the present invention, text data refers to data including but not limited to twitter, microblog, news and other network platforms.

In the first step, the text data preprocessing refers to: because some words or characters without information content exist in the text, the text needs to be subjected to operations of deactivating words, special symbols and links.

The stop words refer to words or phrases which have a high frequency of use in English but are removed without affecting the overall understanding, and are often articles, prepositions, adverbs or conjunctions.

Wherein, the special symbols are basic comma periods, mathematical symbols, emoticons and the like.

The links are website links describing objects, and the links are removed in the data preprocessing process.

In the second step, the word vector is used to convert the text sequence into vector feature representation, specifically:

training the preprocessed text data by using a word vector model word2vec, and learning a word vector representation of each word in the whole data set; dimension of the token vector is D_e；

In the third step, a vector representation of the document is generated as follows:

for a certain document of the data set, performing average pooling operation on each dimension of word vectors of all words of the document according to the acquired word vectors, namely, performing average pooling operation on N in the document i_iDimension of each word is N_i*D_eUsing the average pooling of the first dimension to obtain a D_eA textual representation of the dimension;

in the third step, the pseudo tag is generated as follows:

setting M documents in the data set in total, and vectorizing the documents to obtain M x D_eAnd carrying out K-means clustering on the vectors, wherein the K value can be selected according to actual needs. Recording the results of K-means clustering corresponding to each documentAs a pseudo tag for the document.

In the fourth step, the word vector of the document refers to: for a data set, the maximum length N of the document is specified, for N_iWord-free document d_iIf N is present_iIntercepting the document to the Nth word if the number is more than or equal to N, and if the number is more than or equal to N, intercepting the document to the Nth word_iFilling N-N for document_iA special character e representing a space. And finally, sequentially replacing each word in the document with a corresponding word vector trained in the second step to represent, and replacing the epsilon with a full 0 vector. Thus each document corresponds to an N x D_eThe word vector matrix of (2).

In the fourth step, based on a classifier of the capsule network, convolution is used in a shallow layer; the deep layer uses a dynamic routing mechanism, and the module length of each capsule output by the last layer represents the probability of each category, and the method comprises the following steps:

(1) input is N x D_eWhere N is the maximum length of the sentence, D_eIs the dimension of the word vector;

(2) n-gram convolutional layer: let W^aIs a size of K₁*D_eThen convolving with it can result in a mapping of features:

wherein the content of the first and second substances,

denotes the multiplication of element-wise, b₀Is the bias term, f is the activation function ReLU; thus, assuming B sliding windows of the same size, a size of (L-K) can be obtained₁+1) × B feature matrix:

M＝[m₁，m₂，...，m_B]

(3) main capsule layer: here, the introduction of a capsule, i-th capsule:

p_i＝g((W^b)^TM_i+b₁)

wherein, W^bIs a weight matrixDimension B x d, d is the dimension of the capsule, M_iAnd the vector with dimension B is the ith component output by the previous layer. g is the square function. The output of this layer can then be written as:

P＝[p₁，P₂，...，p_c]

namely (L-K)₁+1) C capsules of d vitamins;

(4) fully connecting capsule layers:

wherein the content of the first and second substances,

is a shared weight matrix, and then uses a dynamic routing algorithm to find the upper capsule v_j；

In the fourth step, based on the classifier of the capsule network, the loss function adopted in the training process is as follows:

L_k＝T_kmax(0，m⁺-||v_k||)²+λ(1-T_k)max(0，||v_k||-m^-)²

wherein, T_kIf and only if the label corresponding to the text is a category k, | | v_kI is the module length of the kth capsule of the output layer, m⁺，m^-λ are all adjustable hyper-parameters, e.g. take m⁺＝0.9，m^-＝0.1，λ＝0.5；

In the fifth step, when clustering the texts by using the trained capsule network, the subscript of the capsule with the largest length in the output capsules is taken, namely:

prediction(x)＝argmax(||v_{j|x}||)。

based on the method, the invention also provides a text clustering system based on K-means clustering and a capsule network, which comprises the following steps:

the input representation unit is used for preprocessing the text data and serializing the text data by using the word vectors;

the pseudo label generating unit is used for clustering the preprocessed data by adopting a K-nearest neighbor algorithm to obtain a pseudo label;

and the class label generating unit is used for training the classifier based on the capsule network by adopting the serialized text data and the pseudo label, controlling the training loss and acquiring the network output.

Compared with the prior art, the beneficial effects of the invention comprise: by combining K-nearest neighbor and a capsule network, the unsupervised text clustering problem is converted into a supervised classification problem, and the clustering effect is further improved on the basis of the traditional clustering method.

Drawings

FIG. 1 is a flow chart of the text clustering method according to the present invention.

FIG. 2 is a flow chart of data processing in an example of the present invention.

FIG. 3 is a diagram of the model architecture of the capsule network classifier in an example of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the following specific examples and the accompanying drawings. The procedures, conditions, experimental methods and the like for carrying out the present invention are general knowledge and common general knowledge in the art except for the contents specifically mentioned below, and the present invention is not particularly limited. It should be noted that variations and modifications can be made by persons skilled in the art without departing from the spirit of the invention. All falling within the scope of the present invention.

The text clustering method provided by the invention, as shown in fig. 1, comprises the following steps:

The specific flow of this embodiment is shown in fig. 1.

Firstly, selecting a text data set Google News;

for the selected original text data, the following describes the conversion manner of the data:

vectorized representation of text:

(a) for the preprocessed text, a word vector representation of each word in the data set is learned using a word vector model word2 vec. Dimension of the token vector is D_e；

(b) Aiming at a certain document of the data set, carrying out average pooling operation on each dimension of the word vectors of all words of the document according to the acquired word vectors, namely, carrying out N in the document i_iDimension of each word is N_i*D_eUsing the average pooling of the first dimension to obtain a D_eA textual representation of the dimension;

then, a Kmeans module in scinit-lean is utilized to designate the number K of clusters needing to be aggregated, K-nearest neighbor clustering is carried out on the vector quantized text, and a pseudo label of the document is generated;

then specifying the maximum length N of the document for which there is N_iWord-free document d_iIf N is present_iIntercepting the document to the Nth word if the number is more than or equal to N, and if the number is more than or equal to N, intercepting the document to the Nth word_iFilling N-N for document_iA special character e representing a space. And finally, sequentially replacing each word in the document with a corresponding word vector trained in the second step to represent, and replacing the epsilon with a full 0 vector. Thus each document corresponds to an N x D_eThe word vector matrix of (2).

And taking the word vector matrix of the document and the corresponding pseudo label as training data of the capsule network classifier, and setting the number of capsules of the network output layer to be equal to K of the K-neighbor cluster. The structure of the capsule network in this example is shown in figure 3. Controlling the training loss within a certain range, such as 0.2 +/-0.01, and finishing the training of the network.

And finally, the trained capsule network classifier is used for marking each document of the data set with a category, and according to the definition of the capsule network, the category corresponding to each document is a subscript corresponding to the capsule with the largest length in the output layer.

The method can also be applied to other various text data sets, and the specific process is not described in detail.

The invention provides a text clustering system based on K-means clustering and a capsule network, which comprises the following steps:

The parameters in the above embodiments of the present invention are determined according to experimental results, i.e., different parameter combinations are tested, and a group of parameters with better accuracy is selected. In the above tests, the above parameters can be adjusted appropriately according to the requirements, and the purpose of the present invention can also be achieved.

The protection of the present invention is not limited to the above embodiments. Variations and advantages that may occur to those skilled in the art may be incorporated into the invention without departing from the spirit and scope of the inventive concept, which is set forth in the following claims.

Claims

1. A text clustering method based on K-means clustering and a capsule network is characterized by comprising the following steps:

selecting a text data set, and preprocessing text data in the text data set;

converting the text sequence into vector characteristic representation by using the word vector;

averaging word vectors of each document to serve as vector feature representation of the document, and carrying out K-means clustering on the representation of the document to generate a pseudo label of the document;

step four, taking the word vector of the document and the pseudo label as training data to train a classifier based on a capsule network;

and step five, clustering the text data set by using the trained capsule network classifier.

2. The method according to claim 1, wherein in the first step, the preprocessing the text data comprises: and operations of words, special symbols and links are stopped.

3. The text clustering method according to claim 1, wherein the second step specifically comprises: training the preprocessed text data by using a word vector model word2vec to learn the word vector representation of each word in the whole text data set; the dimension of the token vector is De.

4. The text clustering method according to claim 1, wherein in the third step, the pseudo label of the document is generated according to the following steps:

(1) for a certain document of the text data set, performing average pooling operation on each dimension of word vectors of all words of the document according to the acquired word vectors, namely performing average pooling on N in the document i_iDimension of each word is N_i*D_eUsing the average pooling of the first dimension to obtain a D_eA social text representation of the dimension;

(2) setting the total M documents in the text data set, and comparing the M x D obtained in the step (1)_eAnd performing K-means clustering on the vectors, and recording the K-means clustering result corresponding to each document as a pseudo label of the document.

5. The method of text clustering according to claim 1, characterized in thatIn the fourth step, the word vector of the document refers to: for a data set, the maximum length N of the document is specified, for N_iWord-free document d_iIf N is present_iIntercepting the document to the Nth word if the number is more than or equal to N, and if the number is more than or equal to N, intercepting the document to the Nth word_iFilling N-N for document_iA special character epsilon representing a blank; finally, each word in the document is sequentially replaced by the corresponding word vector trained in the second step to represent, the epsilon is replaced by a full 0 vector, and each document corresponds to one N x D_eThe word vector matrix of (2).

6. The text clustering method according to claim 1, wherein in the fourth step, based on the classifier of the capsule network, the shallow layer uses convolution; the deep layer uses a mechanism of dynamic routing, with the modular length of each capsule output by the last layer representing the probability of the respective class.

7. The text clustering method according to claim 6, wherein the probability calculation for each category comprises the steps of:

(2) n-gram convolutional layer: let W^aIs a size of K₁*D_eThe sliding window of (2) with which convolution is performed to obtain a mapping of features:

wherein the content of the first and second substances,

M＝[m₁，m₂，...，m_B]

(3) main capsule layer: here, the introduction of a capsule, i-th capsule:

p_i＝g((W^b)^TM_i+b₁)

wherein, W^bIs a weight matrix with dimension B x d, d is the dimension of the capsule, M_iA vector with dimension B for the ith component output by the previous layer; g is a square function; the output of the main capsule layer is then:

P＝[p₁，P₂，...，P_c]

namely (L-K)₁+1) C capsules of d vitamins;

(4) fully connecting capsule layers:

wherein the content of the first and second substances,

is a shared weight matrix, and uses a dynamic routing algorithm to calculate the upper capsule v_j；

L_k＝T_kmax(0，m⁺-||v_k||)²+λ(1-T_k)max(0，||v_k||-m^-)²

wherein, T_kIf and only if the label corresponding to the text is a category k, | | v_kI is the module length of the kth capsule of the output layer, m⁺，m^-λ is an adjustable hyper-parameter;

in the fifth step, when clustering the texts by using the trained capsule network, the subscript of the capsule with the largest length in the output capsules is taken, that is:

prediction(x)＝argmax(||v_{j|x}||)。

8. a text clustering system based on K-means clustering and capsule networks, characterized in that the text clustering method according to any one of claims 1 to 7 is used, the system comprising the following: