CN114625879A

CN114625879A - Short text clustering method based on self-adaptive variational encoder

Info

Publication number: CN114625879A
Application number: CN202210299111.1A
Authority: CN
Inventors: 范青武; 王子栋
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2022-03-13
Filing date: 2022-03-13
Publication date: 2022-06-14

Abstract

A short text clustering method based on a self-adaptive variational encoder relates to the technical field of text clustering. Firstly, carrying out text representation on a short text by using a content-Bert method; secondly, converting the vector into a low-dimensional characteristic vector by using a self-encoder, and extracting a clustering center by using a K-means method; then, pre-training an input vector by using the clustering center as an expected mean value of a variational self-encoder, and converting the input vector into a feature vector meeting the distribution taking the clustering center as the expected mean value; and constructing a classifier according to the feature vector by a K-means algorithm, and finely adjusting the weight of the classifier and the encoder according to the classified distribution. And finally, obtaining a clustering result according to the encoder and the classifier after fine adjustment. The method can well solve the problem of high-dimensional sparsity of the text vectors in the short text clusters, and provides a new feature depth embedding algorithm for the short text clusters.

Description

Short text clustering method based on self-adaptive variational encoder

Technical Field

The invention relates to a text clustering technology, in particular to clustering of short texts and construction of a corresponding depth algorithm.

Background

With the rapid development of information technology, massive short text data are generated on each media platform, and valuable information is required to be mined from the short text data in various fields such as news recommendation, user survey, event detection and the like. Compared with long texts, the short texts have the characteristics of few words, ambiguity and irregular information, so that the text features are difficult to extract and express.

Most words in short texts only appear once, so that the traditional vectorization method based on word frequency cannot well express text characteristics, and sparse characteristic representation brings problems of word co-occurrence deficiency and lack of context information. To address these problems, a number of Word embedding models were proposed, Word2vec, GloVe, ELMo, BERT. The word embedding models are trained based on a large corpus, and high-dimensional vectors are used for representing texts, so that the features of short texts are enriched to a certain extent, the problem of sparse text vectors is solved, and higher requirements are provided for a clustering algorithm.

Clustering methods established for a long time, such as K-means and Gaussian Mixture Models (GMM), perform well in low-dimensional data spaces, but have poor effects on high-dimensional data spaces. On the other hand, the deep neural network is used as an effective feature embedding method, vectorized text data can be mapped to a low-dimensional separable representation space, and the difficulty of a text clustering algorithm is reduced.

Deep embedded clustering DEC combines clustering with deep embedded learning. The method is a method for simultaneously learning feature representation and cluster allocation by a deep neural network proposed by Xie in 2016. And the DEC maps the data from an original space to a low-dimensional characteristic space, and after clustering soft distribution is carried out, a clustering target is iteratively optimized, so that the method also becomes a beseline algorithm of deep embedded clustering. But DEC typically uses an auto-encoder (AE) when doing deep embedding, which optimizes network parameters by minimizing Mean Squared Error (MSE) loss of output to input. The low-dimensional feature vector containing the original data information can be obtained, but because the representation space is not normalized, the distribution of data is easily disturbed, and the phenomena of intersection and overlapping of different classes in the representation space are caused.

Disclosure of Invention

In order to solve the problems mentioned above, the invention provides a short text clustering method based on an adaptive variational encoder, aiming at converting high-dimensional features capable of vectorizing texts into low-dimensional separable features, thereby accurately clustering the short texts with similar semantics.

The short text clustering algorithm based on the self-adaptive variational encoder comprises the following steps:

s1, collecting data;

s2, inputting the text into the presence-Bert, and converting the text into a word vector;

s3, pre-training the word vector by using an auto-encoder to obtain a dimension reduction encoder;

s4, clustering the data after dimensionality reduction by using K-means to obtain a clustering label and a clustering center of each text;

s5 pre-training the text word vectors by using a variational self-encoder, and training the encoder network parameters by using the clustering center as an expected mean value;

s6 clustering the feature vectors generated by the pre-training encoder by using K-means to obtain an initial clustering center;

s7 soft-distributing the vector by using the clustering center;

s8 learning to update the pre-trained encoder and redefine the cluster centroid from the current high confidence distribution using the auxiliary target distribution;

and S9 repeating S7 and S8, and outputting a clustering result when the convergence criterion or the iteration times is met.

Drawings

FIG. 1 is a schematic diagram showing the details of a short text clustering algorithm based on an adaptive variational self-encoder.

Fig. 2 is a flow chart of a short text clustering algorithm based on an adaptive variational self-encoder.

Detailed Description

The invention provides a short text clustering algorithm based on a self-adaptive variational self-encoder, which mainly comprises the following steps of:

the detailed description of the present invention is provided with reference to the accompanying figure 1:

in step S1, a text data set is extracted. Extracting microblog source texts from a microblog platform, and constructing a corpus D {(s) of short texts_i，l_i) I is more than or equal to 1 and less than or equal to n, wherein n is the number of texts in the corpus D. S ═ S₁，s₂，...，s_nDenotes a textual representation of all text. L ═ L₁，l₂，...，l_nIndicates the true label with s_nAnd correspondingly. Because an unsupervised clustering mode is adopted, the label is only used for evaluating the final result and does not participate in the model training process.

Step S2, without preprocessing the text data, using sequence-BERT to perform vector space representation on the text, and using the ith short text D_iFor example, the text is denoted as D_i＝{x_i：x_i∈R^mWhere m is the translated sentence vector dimension, the resulting sentence vector dimension is determined by the model employed, here model dimension 384.

Step S3, training text vector by using automatic encoder, for the sentence vector x converted_i∈R^m. Constructing an encoder to encode the original data:

Z_i＝f_φ(x)＝σ_e(W_ex_i+b_e)∈R^l#(1)

after decoding the original data using a decoder:

the loss function is to minimize reconstruction error:

wherein x_i、

And z_iInput data, output data and latent variables, respectively, f_φAnd g_ψRepresenting the conversion functions of the encoder and decoder, respectively. σ is the activation function chosen here as ReLU (x), W_eAnd b_eAre weights and offsets, where e and d represent the encoder and decoder, respectively.

The autoencoder updates the network weights W, often by minimizing reconstruction errors_eAnd deviation b_eAfter the set iteration number t is finished, an encoder f is obtained_φ(x)：X∈R^m→Z∈R^l. t depends on the complexity of the network, the invention sets t to 10, where Z is a potential feature space, where m is 384 dimensions of the dimension of the input sentence vector mentioned above, and l is the dimension of the hidden layer and the clustering target class k of the clustering text are the same, and the dimension reduction encoder f is obtained because the clustering class k is smaller than the input dimension d_φ(x)。

Step S4, using K-means as clustering algorithm to reduce the dimension of the text z_iAnd (6) clustering. Euclidean distances are employed here as distance measures for the K-means algorithm, the goal of which is to select the centroid μ in a cluster_kThe intra-cluster square sum can be minimized:

the purpose of this step of clustering is to find the centroid

And a text category k corresponding to each text. This text by class k and centroid

Get the expected mean of each text as

And step S5, the network layer number of the variational self-encoder is increased by adopting the configuration of the network of the depth self-encoder. Using the desired mean value mu^*The variational autocoder VAE is trained. VAEs aim to learn the generative model p (X, Z ') to maximize the edge likelihood log p (X) of the dataset, Z' is used to represent the characterized space in VAEs, to distinguish it from the space in AEs. The edge likelihood cannot be calculated directly because the latent variables are integrated with difficulty. To understandTo address this problem, the VAE introduces a variational distribution q_φ(Z' | X), the variational distribution approximating the true posterior distribution by a complex neural network parameter, optimizing logp (X) of the Lower Bound of Evidence (elibo):

where φ is the inference loss, θ is the decoder, the first term is the reconstruction loss, and the second term is the KL divergence between the approximate A posteriori and the prior. Gaussian distribution in VAE for most of p (Z `)

Is a common choice of prior, approximating the posterior distribution q_φ(Z '| X) and a priori p (Z') can be calculated as:

wherein mu_iAnd σ_iThe mean and variance of the approximated posterior distribution of the space vector are characterized for the ith dimension, respectively.

The invention uses the cluster center μ in step 4^*As a desired mean value of the characteristic distribution of VAE, let p (Z') become

The KL divergence can therefore be calculated as:

the second term of the VAE loss function is different from the ordinary self-encoder, and the improvement according to the data set is mainly directed to the KL divergence part, if the term is not included, the change is basically degraded to the conventional AE, the improvement of the invention is lost, and the phenomenon that the KL divergence disappears appears.

The invention relates to aOver-inference network mu_iThe output of (1) applies a fixed Batch Normalization (BN), which is a regularization technique widely used in deep learning, not only enables the neuron output to change normally, but also is an effective method for preventing gradient explosion, which is different from other tasks of applying BN to a hidden layer and seeking rapid and stable training, wherein the BN is used as a tool to apply mu_iConverted to a distribution of fixed means and variances. Mathematically regularized μ_iIs composed of

Wherein

And

representing an approximate posterior before and after BN. Mu.s_BiAnd σ_BiRepresents μ_iThe mean and standard deviation of (a), are biased estimates for each dimension of the sample, and gamma and beta are scaled displacement parameters, where a fixed gamma is used.

Where τ e (0, 1) is a constant, the invention takes τ 0.5 and θ is a trainable parameter. Thus mu_iHas a mean value of beta and a variance of gamma². β is a learnable parameter, making the distribution more flexible, and is set to 0 in the present invention.

The invention relates to an improved variational self-encoder, which is called as an adaptive variational self-encoder SVAE (singular value analysis algorithm), and after the pre-training of the SVAE, the invention takes the process from input to sampling as an encoder of the SVAE. With this encoder, the present invention performs a non-linear mapping of data f (x): x is formed by R^d→Z′∈R^cWhere the data dimension c is the same as the data dimension in the hidden layer of the self-encoder, where the setting isThe number of cluster categories k is referenced.

Step S6: the features Z' of the token space are clustered using K-means to obtain cluster centers, which are used as initial weights for the DEC cluster layer.

Step S7: the soft cluster assignment for each data point of Z' in the feature space is calculated using the cluster center of K-means. Using a single degree of freedom t distribution q_ijTo measure the insertion point z_i' and centroid k_jSimilarity between:

wherein z is_i′＝f(x_i) e.Z' corresponds to x after embedding_ie.X, where α is the degree of freedom of the t distribution of student, α takes 1. And q is_ijWhich may be interpreted as a probability (i.e., soft assignment) of the sample i to the cluster j.

Step S8: using auxiliary distributions p_ijImproving the purity of the clusters and emphasizing data points with high confidence. Probability P in auxiliary distribution P_ijThe calculation is as follows:

fine-tuning by soft allocation and target distribution matching, for which the target is defined herein as the KL divergence between soft allocation and secondary classification, as follows:

step S9: joint optimization of cluster center k using Stochastic Gradient Descent (SGD)_jWith the encoder parameters θ in SVAE, the gradient at each sample and each cluster center is calculated as follows:

using K-means as the weight for initializing the clustering layer, then using high confidence prediction to determine the encoder and assign the class cluster, repeating the steps S7 and S8, when the iteration number t1 reaches 2000, or after the class label change rate theta is less than 0.001, soft-assigning q to the centroid by taking each sample_ijThe distribution and clustering of the samples can be completed by the maximum value of the total number of the samples. And obtaining a clustering result. The class label change rate θ is calculated as follows

L_iAnd

the label of the ith text and the label of the last text are respectively, and n is the total amount of the samples.

Experimental setup: the invention verifies that the adopted hardware environment is but not limited to: the CPU is InterXeon 4210R, the main frequency is 2.4GHz, the memory uses 64GB memory, the display card uses two NVIDIA Geform RTX 3060, and the operating system is windows 10. The invention uses a general Sentence conversion library in the Sennce-BERT to realize the text vector representation (parahrase-multilingual-MiniLM-L12-v 2) of a general data set, the maximum sequence degree of the model is 128, the text can be converted into 384-dimensional vectors, the size of the model is 384MB, and the model can be used for training the parallel data of more than 50 languages based on the multi-language version of the parahrase-MiniLM-L12-v 2 model, and is mainly trained on a plurality of data sets such as AllNLI, content-compression, SimpleWiki and the like. The invention uses Adam optimization (Kingma and Ba, 2015) in the pre-training process, the batch _ size is set to 64, the pre-training algebra epoch is set to 15, the encoder on SAE used in the pre-training uses the network structure of [500, 500, 2000, 20], and the decoder is opposite to the former, and the same part is also in VAE in the formal training. And in the training process in DEC, SGD optimizer is adopted, learning rate of 0.1 is used, attenuation rate of 0.9 is set, the maximum iteration number is 1500, and the same configuration is adopted for batch size and pre-training.

The experimental use data set contains four english data sets clustered with short text, and one chinese data set:

(1) SearchSnippets: a text collection of a web search snippet contains 8 categories of different topics. (2) Stackoverflow: a collection of web site posts was used as a data set for the kaggle challenge. The data set contains selected problem titles in 20 different categories. (3) Biomedical: a subset of the PubMed data set, wherein 20000 paper titles are randomly chosen from 20 groups. (4) Sweet: there are 89 categories from 2472 Tweet data. (5) The Chinese public opinion data set of Chinesews is used in the text, is used in actual projects, 6 typical events in 2017-2019 are searched, corresponding microblogs are crawled through event keywords, 2000 pieces of data are randomly extracted from each event, and the total amount of 12000 pieces of data are obtained. In order to ensure the reliability of the data set, the data is screened manually, and the marking extracts the 7 th event, namely the text which is not related to the event and is 'night of microblog', as the 8 th category, so that the data set of 8 categories is obtained in total.

The invention is compared with the following clustering algorithm:

(1) bow & TF-IDF: vectorization calculation is carried out on the sentences through the frequency of words, the sentences are converted into vectors with 1500 dimensions, and clustering evaluation is carried out by applying a K-means algorithm; (2) Sence-BERT (SBERT): converting the text into 384-dimensional vectors through the sequence-BERT, which is also a text vectorization method used herein, and applying a K-means algorithm for cluster evaluation; (3) VaDE: a product of combining a gaussian mixture model with a variational self-coder. The method has similar ideas with the invention, and the self-encoder is firstly used for carrying out and training, but the difference is that the initial prior of the data is obtained through GMM, and finally the characteristic embedding and clustering of the data are completed through encoding, decoding and reverse updating of parameters; (4) STC 2: there are three separate phases that make up, for each dataset, it first pre-trains the words embedded in the large corpus using the word2Vec method. Then optimizing the convolutional neural network to further enrich the representation, and putting the statements into K-means for clustering at the final stage; (5) Self-Train: SIF is utilized to enhance the embedding of Xu and the like with training words, a deep embedding clustering algorithm is used by following Xie and the like, the method is an automatic encoder obtained through layering and training, and then the automatic encoder is further optimized; (6) SCCL: the model consists of three components, for each data set, the data comprises original data and extended data, the input data is mapped to a representation space through a neural network, and parameters of the encoder are optimized by respectively applying contrast loss and clustering loss to finish the classification work of the text.

The results of the experiment are shown in table 1.

TABLE 1 text clustering results

The present invention presents the results of the algorithm in 5 datasets in table 1. For the Chinese public opinion data set used in the invention, SVAE obtains the optimal result on the basis, which is 3.1% ahead of ACC in the excellent clustering algorithm Self-Train, because VAE can better improve the distribution of features in the hidden space than AE.

For the other 4 universal datasets used in the present invention, it can be seen that SVAE has data-good results in all three standard datasets. The huge improvement brought on the StackOverflow is that the accuracy of the SBERT on the data set is high due to the improvement brought by the text vectorization model on one hand, and the SVAE improves the feature embedding quality on the other hand, so that higher improvement is brought on the basis. Before this, the StackOverflow hardly has a good clustering effect because of the large number and the large number of the categories. The method mentioned herein brings a great improvement in accuracy because it makes good use of the clustering center of the clustering algorithm.

The SVAE mainly refers to a Self-Train algorithm, ACC is respectively improved by 4.5% and 22.4% compared with Self-Train on SearchSnippets and StackOverflow, and Biomedical index is reduced mainly because the universal training model used in the method contains less contents in the field, which has the same conclusion in SCCL. The SCCL method leads ACC by 3.6% on SearchSnippets because SCCL trains the model through enhancement of the dataset in addition to the original dataset, optimizes the encoder with contrast loss in addition to cluster optimization, using a more complex architecture than SVAE, but SVAE improves ACC by 6.7% and 2% in StackOverflow and Biomedical, respectively, than SCCL. The result verifies the effectiveness and importance of the framework provided by the method, and the priori clustering information is fully utilized in the background of the universal language library, so that the adaptability of the algorithm to different data sets is improved.

Claims

1. The short text clustering algorithm based on the self-adaptive variational encoder is characterized by comprising the following steps of:

s1, collecting data;

s3, pre-training the word vectors by using an autoencoder to obtain a dimension reduction encoder;

s4, clustering the data subjected to dimensionality reduction by using K-means to obtain a clustering label and a clustering center of each text;

s7 soft-distributing the vector by using the clustering center;

2. The adaptive variational encoder based short text clustering algorithm according to claim 1, characterized in that:

in step S2, vector space representation is performed on the text using the sequence-BERT without performing a preprocessing operation on the data;

in step S3, the text vector is trained using an auto-encoder, for the translated sentence vector x_i∈R^m(ii) a Constructing an encoder to encode the original data:

z_i＝f_φ(x)＝σ_e(W_ex_i+b_e)∈R^l#(1)

after decoding the original data using a decoder:

the loss function is to minimize the reconstruction error:

wherein x_i、

And z_iRespectively input data, output data and latent variables, f_φAnd g_ψRespectively representing the conversion functions of the encoder and the decoder; σ is the activation function chosen here as ReLU (x), W_eAnd b_eAre weights and offsets, where e and d represent the encoder and decoder, respectively;

the autoencoder updates the network weights W, often by minimizing reconstruction errors_eAnd deviation b_eAfter the set iteration number t is finished, an encoder f is obtained_φ(x)：X∈R^m→Z∈R^l(ii) a t is set to t of 10, where Z is potentialA feature space, where m is 384 dimensions of the input sentence vector mentioned above, and l is the same dimension of the hidden layer as the clustering target class k of the clustering text, and the obtained dimension reduction encoder f is obtained because the clustering class k is smaller than the input dimension d_φ(x)；

In step S4, the reduced-dimension text z is subjected to clustering algorithm by using K-means_iClustering is carried out; euclidean distances are employed here as distance measures for the K-means algorithm, the goal of which is to select the centroid μ in a cluster_kThe intra-cluster square sum can be minimized:

the purpose of this one-step clustering is to find the centroid

And a text category k corresponding to each text; by class k and centroid

Get the expected mean of each text as

Preprocessing to obtain vectorization X of the text and a clustering center mu of the dimension reduction text^*。

3. The adaptive variational encoder based short text clustering algorithm of claim 2, characterized in that:

step S5, using the desired mean value μ^*Training a variational self-encoder VAE, and adding a BN layer in the variational self-encoder VAE to prevent KL divergence in a VAE loss function from disappearing, so that an SVAE frame is formed together;

using the clustering center μ in step S4^*As a desired mean value of the feature distribution of VAE, let p (Z') become

The KL divergence is therefore calculated as:

using BN as a tool_iConversion to a distribution of fixed means and variances; mathematically regularized μ_iIs composed of

Wherein

And

represents an approximate posterior before and after BN; mu.s_BiAnd σ_BiRepresents μ_iThe mean and standard deviation of (a), a biased estimate for each dimension of the sample, gamma and beta are displacement parameters of the scale,

where τ ∈ (0, 1) is a constant and θ is a trainable parameter; thus mu_iHas a mean value of beta and a variance of gamma²(ii) a β is a learnable parameter.

4. The adaptive variational encoder based short text clustering algorithm according to claim 1, characterized in that:

clustering the characteristic Z' of the SVAE representation space by using K-means in step S6 to obtain a clustering center, wherein the step is used as an initial weight of a DEC clustering layer;

in step S7Calculating soft cluster allocation of each data point of Z' in the feature space according to the cluster center; using a single degree of freedom t distribution q_ijTo measure the insertion point z_i' and centroid k_jSimilarity between:

wherein z is_i′＝f(x_i) e.Z' corresponds to x after SVAE embedding_ie.X, where α is the degree of freedom of the t distribution of the student, and q_ijIn order to assign the probability of the sample i to the cluster j, namely soft assignment, alpha is taken as 1;

step S8: using auxiliary distributions p_ijThe purity of clustering is improved, and data points with high reliability are emphasized; probability P in auxiliary distribution P_ijThe calculation is as follows:

fine-tuning by soft allocation and target distribution matching, for which the target is defined as the KL divergence between soft allocation and secondary classification, as follows:

step S9: joint optimization of cluster center k using stochastic gradient descent_cntjWith the encoder parameters θ in SVAE, the gradient at each sample and each cluster center is calculated as follows:

using K-means as the weight for initializing the clustering layer, then using high confidence prediction to determine the encoder and assign the class cluster, repeating the steps S7 and S8, when the iteration number t1 reaches 2000, or after the class label change rate theta is less than 0.001, soft-assigning q to the centroid by taking each sample_ijThe distribution and clustering of the samples can be completed by the maximum value of the total number of the samples; obtaining a clustering result; the class label change rate θ is calculated as follows

L_iAnd