CN114625879A - Short text clustering method based on self-adaptive variational encoder - Google Patents

Short text clustering method based on self-adaptive variational encoder Download PDF

Info

Publication number
CN114625879A
CN114625879A CN202210299111.1A CN202210299111A CN114625879A CN 114625879 A CN114625879 A CN 114625879A CN 202210299111 A CN202210299111 A CN 202210299111A CN 114625879 A CN114625879 A CN 114625879A
Authority
CN
China
Prior art keywords
clustering
encoder
text
distribution
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210299111.1A
Other languages
Chinese (zh)
Inventor
范青武
王子栋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Technology
Original Assignee
Beijing University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Technology filed Critical Beijing University of Technology
Priority to CN202210299111.1A priority Critical patent/CN114625879A/en
Publication of CN114625879A publication Critical patent/CN114625879A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

A short text clustering method based on a self-adaptive variational encoder relates to the technical field of text clustering. Firstly, carrying out text representation on a short text by using a content-Bert method; secondly, converting the vector into a low-dimensional characteristic vector by using a self-encoder, and extracting a clustering center by using a K-means method; then, pre-training an input vector by using the clustering center as an expected mean value of a variational self-encoder, and converting the input vector into a feature vector meeting the distribution taking the clustering center as the expected mean value; and constructing a classifier according to the feature vector by a K-means algorithm, and finely adjusting the weight of the classifier and the encoder according to the classified distribution. And finally, obtaining a clustering result according to the encoder and the classifier after fine adjustment. The method can well solve the problem of high-dimensional sparsity of the text vectors in the short text clusters, and provides a new feature depth embedding algorithm for the short text clusters.

Description

Short text clustering method based on self-adaptive variational encoder
Technical Field
The invention relates to a text clustering technology, in particular to clustering of short texts and construction of a corresponding depth algorithm.
Background
With the rapid development of information technology, massive short text data are generated on each media platform, and valuable information is required to be mined from the short text data in various fields such as news recommendation, user survey, event detection and the like. Compared with long texts, the short texts have the characteristics of few words, ambiguity and irregular information, so that the text features are difficult to extract and express.
Most words in short texts only appear once, so that the traditional vectorization method based on word frequency cannot well express text characteristics, and sparse characteristic representation brings problems of word co-occurrence deficiency and lack of context information. To address these problems, a number of Word embedding models were proposed, Word2vec, GloVe, ELMo, BERT. The word embedding models are trained based on a large corpus, and high-dimensional vectors are used for representing texts, so that the features of short texts are enriched to a certain extent, the problem of sparse text vectors is solved, and higher requirements are provided for a clustering algorithm.
Clustering methods established for a long time, such as K-means and Gaussian Mixture Models (GMM), perform well in low-dimensional data spaces, but have poor effects on high-dimensional data spaces. On the other hand, the deep neural network is used as an effective feature embedding method, vectorized text data can be mapped to a low-dimensional separable representation space, and the difficulty of a text clustering algorithm is reduced.
Deep embedded clustering DEC combines clustering with deep embedded learning. The method is a method for simultaneously learning feature representation and cluster allocation by a deep neural network proposed by Xie in 2016. And the DEC maps the data from an original space to a low-dimensional characteristic space, and after clustering soft distribution is carried out, a clustering target is iteratively optimized, so that the method also becomes a beseline algorithm of deep embedded clustering. But DEC typically uses an auto-encoder (AE) when doing deep embedding, which optimizes network parameters by minimizing Mean Squared Error (MSE) loss of output to input. The low-dimensional feature vector containing the original data information can be obtained, but because the representation space is not normalized, the distribution of data is easily disturbed, and the phenomena of intersection and overlapping of different classes in the representation space are caused.
Disclosure of Invention
In order to solve the problems mentioned above, the invention provides a short text clustering method based on an adaptive variational encoder, aiming at converting high-dimensional features capable of vectorizing texts into low-dimensional separable features, thereby accurately clustering the short texts with similar semantics.
The short text clustering algorithm based on the self-adaptive variational encoder comprises the following steps:
s1, collecting data;
s2, inputting the text into the presence-Bert, and converting the text into a word vector;
s3, pre-training the word vector by using an auto-encoder to obtain a dimension reduction encoder;
s4, clustering the data after dimensionality reduction by using K-means to obtain a clustering label and a clustering center of each text;
s5 pre-training the text word vectors by using a variational self-encoder, and training the encoder network parameters by using the clustering center as an expected mean value;
s6 clustering the feature vectors generated by the pre-training encoder by using K-means to obtain an initial clustering center;
s7 soft-distributing the vector by using the clustering center;
s8 learning to update the pre-trained encoder and redefine the cluster centroid from the current high confidence distribution using the auxiliary target distribution;
and S9 repeating S7 and S8, and outputting a clustering result when the convergence criterion or the iteration times is met.
Drawings
FIG. 1 is a schematic diagram showing the details of a short text clustering algorithm based on an adaptive variational self-encoder.
Fig. 2 is a flow chart of a short text clustering algorithm based on an adaptive variational self-encoder.
Detailed Description
The invention provides a short text clustering algorithm based on a self-adaptive variational self-encoder, which mainly comprises the following steps of:
the detailed description of the present invention is provided with reference to the accompanying figure 1:
in step S1, a text data set is extracted. Extracting microblog source texts from a microblog platform, and constructing a corpus D {(s) of short textsi,li) I is more than or equal to 1 and less than or equal to n, wherein n is the number of texts in the corpus D. S ═ S1,s2,...,snDenotes a textual representation of all text. L ═ L1,l2,...,lnIndicates the true label with snAnd correspondingly. Because an unsupervised clustering mode is adopted, the label is only used for evaluating the final result and does not participate in the model training process.
Step S2, without preprocessing the text data, using sequence-BERT to perform vector space representation on the text, and using the ith short text DiFor example, the text is denoted as Di={xi:xi∈RmWhere m is the translated sentence vector dimension, the resulting sentence vector dimension is determined by the model employed, here model dimension 384.
Step S3, training text vector by using automatic encoder, for the sentence vector x convertedi∈Rm. Constructing an encoder to encode the original data:
Zi=fφ(x)=σe(Wexi+be)∈Rl#(1)
after decoding the original data using a decoder:
Figure BDA0003544229260000031
the loss function is to minimize reconstruction error:
Figure BDA0003544229260000032
wherein xi
Figure BDA0003544229260000033
And ziInput data, output data and latent variables, respectively, fφAnd gψRepresenting the conversion functions of the encoder and decoder, respectively. σ is the activation function chosen here as ReLU (x), WeAnd beAre weights and offsets, where e and d represent the encoder and decoder, respectively.
The autoencoder updates the network weights W, often by minimizing reconstruction errorseAnd deviation beAfter the set iteration number t is finished, an encoder f is obtainedφ(x):X∈Rm→Z∈Rl. t depends on the complexity of the network, the invention sets t to 10, where Z is a potential feature space, where m is 384 dimensions of the dimension of the input sentence vector mentioned above, and l is the dimension of the hidden layer and the clustering target class k of the clustering text are the same, and the dimension reduction encoder f is obtained because the clustering class k is smaller than the input dimension dφ(x)。
Step S4, using K-means as clustering algorithm to reduce the dimension of the text ziAnd (6) clustering. Euclidean distances are employed here as distance measures for the K-means algorithm, the goal of which is to select the centroid μ in a clusterkThe intra-cluster square sum can be minimized:
Figure BDA0003544229260000041
the purpose of this step of clustering is to find the centroid
Figure BDA0003544229260000042
And a text category k corresponding to each text. This text by class k and centroid
Figure BDA0003544229260000043
Get the expected mean of each text as
Figure BDA0003544229260000044
And step S5, the network layer number of the variational self-encoder is increased by adopting the configuration of the network of the depth self-encoder. Using the desired mean value mu*The variational autocoder VAE is trained. VAEs aim to learn the generative model p (X, Z ') to maximize the edge likelihood log p (X) of the dataset, Z' is used to represent the characterized space in VAEs, to distinguish it from the space in AEs. The edge likelihood cannot be calculated directly because the latent variables are integrated with difficulty. To understandTo address this problem, the VAE introduces a variational distribution qφ(Z' | X), the variational distribution approximating the true posterior distribution by a complex neural network parameter, optimizing logp (X) of the Lower Bound of Evidence (elibo):
Figure BDA0003544229260000045
where φ is the inference loss, θ is the decoder, the first term is the reconstruction loss, and the second term is the KL divergence between the approximate A posteriori and the prior. Gaussian distribution in VAE for most of p (Z `)
Figure BDA0003544229260000046
Is a common choice of prior, approximating the posterior distribution qφ(Z '| X) and a priori p (Z') can be calculated as:
Figure BDA0003544229260000047
wherein muiAnd σiThe mean and variance of the approximated posterior distribution of the space vector are characterized for the ith dimension, respectively.
The invention uses the cluster center μ in step 4*As a desired mean value of the characteristic distribution of VAE, let p (Z') become
Figure BDA0003544229260000048
The KL divergence can therefore be calculated as:
Figure BDA0003544229260000049
the second term of the VAE loss function is different from the ordinary self-encoder, and the improvement according to the data set is mainly directed to the KL divergence part, if the term is not included, the change is basically degraded to the conventional AE, the improvement of the invention is lost, and the phenomenon that the KL divergence disappears appears.
The invention relates to aOver-inference network muiThe output of (1) applies a fixed Batch Normalization (BN), which is a regularization technique widely used in deep learning, not only enables the neuron output to change normally, but also is an effective method for preventing gradient explosion, which is different from other tasks of applying BN to a hidden layer and seeking rapid and stable training, wherein the BN is used as a tool to apply muiConverted to a distribution of fixed means and variances. Mathematically regularized μiIs composed of
Figure BDA0003544229260000051
Wherein
Figure BDA0003544229260000052
And
Figure BDA0003544229260000053
representing an approximate posterior before and after BN. Mu.sBiAnd σBiRepresents μiThe mean and standard deviation of (a), are biased estimates for each dimension of the sample, and gamma and beta are scaled displacement parameters, where a fixed gamma is used.
Figure BDA0003544229260000054
Where τ e (0, 1) is a constant, the invention takes τ 0.5 and θ is a trainable parameter. Thus muiHas a mean value of beta and a variance of gamma2. β is a learnable parameter, making the distribution more flexible, and is set to 0 in the present invention.
The invention relates to an improved variational self-encoder, which is called as an adaptive variational self-encoder SVAE (singular value analysis algorithm), and after the pre-training of the SVAE, the invention takes the process from input to sampling as an encoder of the SVAE. With this encoder, the present invention performs a non-linear mapping of data f (x): x is formed by Rd→Z′∈RcWhere the data dimension c is the same as the data dimension in the hidden layer of the self-encoder, where the setting isThe number of cluster categories k is referenced.
Step S6: the features Z' of the token space are clustered using K-means to obtain cluster centers, which are used as initial weights for the DEC cluster layer.
Step S7: the soft cluster assignment for each data point of Z' in the feature space is calculated using the cluster center of K-means. Using a single degree of freedom t distribution qijTo measure the insertion point zi' and centroid kjSimilarity between:
Figure BDA0003544229260000055
wherein z isi′=f(xi) e.Z' corresponds to x after embeddingie.X, where α is the degree of freedom of the t distribution of student, α takes 1. And q isijWhich may be interpreted as a probability (i.e., soft assignment) of the sample i to the cluster j.
Step S8: using auxiliary distributions pijImproving the purity of the clusters and emphasizing data points with high confidence. Probability P in auxiliary distribution PijThe calculation is as follows:
Figure BDA0003544229260000061
fine-tuning by soft allocation and target distribution matching, for which the target is defined herein as the KL divergence between soft allocation and secondary classification, as follows:
Figure BDA0003544229260000062
step S9: joint optimization of cluster center k using Stochastic Gradient Descent (SGD)jWith the encoder parameters θ in SVAE, the gradient at each sample and each cluster center is calculated as follows:
Figure BDA0003544229260000063
Figure BDA0003544229260000064
using K-means as the weight for initializing the clustering layer, then using high confidence prediction to determine the encoder and assign the class cluster, repeating the steps S7 and S8, when the iteration number t1 reaches 2000, or after the class label change rate theta is less than 0.001, soft-assigning q to the centroid by taking each sampleijThe distribution and clustering of the samples can be completed by the maximum value of the total number of the samples. And obtaining a clustering result. The class label change rate θ is calculated as follows
Figure BDA0003544229260000065
LiAnd
Figure BDA0003544229260000066
the label of the ith text and the label of the last text are respectively, and n is the total amount of the samples.
Experimental setup: the invention verifies that the adopted hardware environment is but not limited to: the CPU is InterXeon 4210R, the main frequency is 2.4GHz, the memory uses 64GB memory, the display card uses two NVIDIA Geform RTX 3060, and the operating system is windows 10. The invention uses a general Sentence conversion library in the Sennce-BERT to realize the text vector representation (parahrase-multilingual-MiniLM-L12-v 2) of a general data set, the maximum sequence degree of the model is 128, the text can be converted into 384-dimensional vectors, the size of the model is 384MB, and the model can be used for training the parallel data of more than 50 languages based on the multi-language version of the parahrase-MiniLM-L12-v 2 model, and is mainly trained on a plurality of data sets such as AllNLI, content-compression, SimpleWiki and the like. The invention uses Adam optimization (Kingma and Ba, 2015) in the pre-training process, the batch _ size is set to 64, the pre-training algebra epoch is set to 15, the encoder on SAE used in the pre-training uses the network structure of [500, 500, 2000, 20], and the decoder is opposite to the former, and the same part is also in VAE in the formal training. And in the training process in DEC, SGD optimizer is adopted, learning rate of 0.1 is used, attenuation rate of 0.9 is set, the maximum iteration number is 1500, and the same configuration is adopted for batch size and pre-training.
The experimental use data set contains four english data sets clustered with short text, and one chinese data set:
(1) SearchSnippets: a text collection of a web search snippet contains 8 categories of different topics. (2) Stackoverflow: a collection of web site posts was used as a data set for the kaggle challenge. The data set contains selected problem titles in 20 different categories. (3) Biomedical: a subset of the PubMed data set, wherein 20000 paper titles are randomly chosen from 20 groups. (4) Sweet: there are 89 categories from 2472 Tweet data. (5) The Chinese public opinion data set of Chinesews is used in the text, is used in actual projects, 6 typical events in 2017-2019 are searched, corresponding microblogs are crawled through event keywords, 2000 pieces of data are randomly extracted from each event, and the total amount of 12000 pieces of data are obtained. In order to ensure the reliability of the data set, the data is screened manually, and the marking extracts the 7 th event, namely the text which is not related to the event and is 'night of microblog', as the 8 th category, so that the data set of 8 categories is obtained in total.
The invention is compared with the following clustering algorithm:
(1) bow & TF-IDF: vectorization calculation is carried out on the sentences through the frequency of words, the sentences are converted into vectors with 1500 dimensions, and clustering evaluation is carried out by applying a K-means algorithm; (2) Sence-BERT (SBERT): converting the text into 384-dimensional vectors through the sequence-BERT, which is also a text vectorization method used herein, and applying a K-means algorithm for cluster evaluation; (3) VaDE: a product of combining a gaussian mixture model with a variational self-coder. The method has similar ideas with the invention, and the self-encoder is firstly used for carrying out and training, but the difference is that the initial prior of the data is obtained through GMM, and finally the characteristic embedding and clustering of the data are completed through encoding, decoding and reverse updating of parameters; (4) STC 2: there are three separate phases that make up, for each dataset, it first pre-trains the words embedded in the large corpus using the word2Vec method. Then optimizing the convolutional neural network to further enrich the representation, and putting the statements into K-means for clustering at the final stage; (5) Self-Train: SIF is utilized to enhance the embedding of Xu and the like with training words, a deep embedding clustering algorithm is used by following Xie and the like, the method is an automatic encoder obtained through layering and training, and then the automatic encoder is further optimized; (6) SCCL: the model consists of three components, for each data set, the data comprises original data and extended data, the input data is mapped to a representation space through a neural network, and parameters of the encoder are optimized by respectively applying contrast loss and clustering loss to finish the classification work of the text.
The results of the experiment are shown in table 1.
TABLE 1 text clustering results
Figure BDA0003544229260000081
The present invention presents the results of the algorithm in 5 datasets in table 1. For the Chinese public opinion data set used in the invention, SVAE obtains the optimal result on the basis, which is 3.1% ahead of ACC in the excellent clustering algorithm Self-Train, because VAE can better improve the distribution of features in the hidden space than AE.
For the other 4 universal datasets used in the present invention, it can be seen that SVAE has data-good results in all three standard datasets. The huge improvement brought on the StackOverflow is that the accuracy of the SBERT on the data set is high due to the improvement brought by the text vectorization model on one hand, and the SVAE improves the feature embedding quality on the other hand, so that higher improvement is brought on the basis. Before this, the StackOverflow hardly has a good clustering effect because of the large number and the large number of the categories. The method mentioned herein brings a great improvement in accuracy because it makes good use of the clustering center of the clustering algorithm.
The SVAE mainly refers to a Self-Train algorithm, ACC is respectively improved by 4.5% and 22.4% compared with Self-Train on SearchSnippets and StackOverflow, and Biomedical index is reduced mainly because the universal training model used in the method contains less contents in the field, which has the same conclusion in SCCL. The SCCL method leads ACC by 3.6% on SearchSnippets because SCCL trains the model through enhancement of the dataset in addition to the original dataset, optimizes the encoder with contrast loss in addition to cluster optimization, using a more complex architecture than SVAE, but SVAE improves ACC by 6.7% and 2% in StackOverflow and Biomedical, respectively, than SCCL. The result verifies the effectiveness and importance of the framework provided by the method, and the priori clustering information is fully utilized in the background of the universal language library, so that the adaptability of the algorithm to different data sets is improved.

Claims (4)

1. The short text clustering algorithm based on the self-adaptive variational encoder is characterized by comprising the following steps of:
s1, collecting data;
s2, inputting the text into the presence-Bert, and converting the text into a word vector;
s3, pre-training the word vectors by using an autoencoder to obtain a dimension reduction encoder;
s4, clustering the data subjected to dimensionality reduction by using K-means to obtain a clustering label and a clustering center of each text;
s5 pre-training the text word vectors by using a variational self-encoder, and training the encoder network parameters by using the clustering center as an expected mean value;
s6 clustering the feature vectors generated by the pre-training encoder by using K-means to obtain an initial clustering center;
s7 soft-distributing the vector by using the clustering center;
s8 learning to update the pre-trained encoder and redefine the cluster centroid from the current high confidence distribution using the auxiliary target distribution;
and S9 repeating S7 and S8, and outputting a clustering result when the convergence criterion or the iteration times is met.
2. The adaptive variational encoder based short text clustering algorithm according to claim 1, characterized in that:
in step S2, vector space representation is performed on the text using the sequence-BERT without performing a preprocessing operation on the data;
in step S3, the text vector is trained using an auto-encoder, for the translated sentence vector xi∈Rm(ii) a Constructing an encoder to encode the original data:
zi=fφ(x)=σe(Wexi+be)∈Rl#(1)
after decoding the original data using a decoder:
Figure FDA0003544229250000011
the loss function is to minimize the reconstruction error:
Figure FDA0003544229250000012
wherein xi
Figure FDA0003544229250000013
And ziRespectively input data, output data and latent variables, fφAnd gψRespectively representing the conversion functions of the encoder and the decoder; σ is the activation function chosen here as ReLU (x), WeAnd beAre weights and offsets, where e and d represent the encoder and decoder, respectively;
the autoencoder updates the network weights W, often by minimizing reconstruction errorseAnd deviation beAfter the set iteration number t is finished, an encoder f is obtainedφ(x):X∈Rm→Z∈Rl(ii) a t is set to t of 10, where Z is potentialA feature space, where m is 384 dimensions of the input sentence vector mentioned above, and l is the same dimension of the hidden layer as the clustering target class k of the clustering text, and the obtained dimension reduction encoder f is obtained because the clustering class k is smaller than the input dimension dφ(x);
In step S4, the reduced-dimension text z is subjected to clustering algorithm by using K-meansiClustering is carried out; euclidean distances are employed here as distance measures for the K-means algorithm, the goal of which is to select the centroid μ in a clusterkThe intra-cluster square sum can be minimized:
Figure FDA0003544229250000021
the purpose of this one-step clustering is to find the centroid
Figure FDA0003544229250000022
And a text category k corresponding to each text; by class k and centroid
Figure FDA0003544229250000023
Get the expected mean of each text as
Figure FDA0003544229250000024
Preprocessing to obtain vectorization X of the text and a clustering center mu of the dimension reduction text*
3. The adaptive variational encoder based short text clustering algorithm of claim 2, characterized in that:
step S5, using the desired mean value μ*Training a variational self-encoder VAE, and adding a BN layer in the variational self-encoder VAE to prevent KL divergence in a VAE loss function from disappearing, so that an SVAE frame is formed together;
using the clustering center μ in step S4*As a desired mean value of the feature distribution of VAE, let p (Z') become
Figure FDA0003544229250000025
The KL divergence is therefore calculated as:
Figure FDA0003544229250000026
using BN as a tooliConversion to a distribution of fixed means and variances; mathematically regularized μiIs composed of
Figure FDA0003544229250000027
Wherein
Figure FDA0003544229250000028
And
Figure FDA0003544229250000029
represents an approximate posterior before and after BN; mu.sBiAnd σBiRepresents μiThe mean and standard deviation of (a), a biased estimate for each dimension of the sample, gamma and beta are displacement parameters of the scale,
Figure FDA00035442292500000210
where τ ∈ (0, 1) is a constant and θ is a trainable parameter; thus muiHas a mean value of beta and a variance of gamma2(ii) a β is a learnable parameter.
4. The adaptive variational encoder based short text clustering algorithm according to claim 1, characterized in that:
clustering the characteristic Z' of the SVAE representation space by using K-means in step S6 to obtain a clustering center, wherein the step is used as an initial weight of a DEC clustering layer;
in step S7Calculating soft cluster allocation of each data point of Z' in the feature space according to the cluster center; using a single degree of freedom t distribution qijTo measure the insertion point zi' and centroid kjSimilarity between:
Figure FDA0003544229250000031
wherein z isi′=f(xi) e.Z' corresponds to x after SVAE embeddingie.X, where α is the degree of freedom of the t distribution of the student, and qijIn order to assign the probability of the sample i to the cluster j, namely soft assignment, alpha is taken as 1;
step S8: using auxiliary distributions pijThe purity of clustering is improved, and data points with high reliability are emphasized; probability P in auxiliary distribution PijThe calculation is as follows:
Figure FDA0003544229250000032
fine-tuning by soft allocation and target distribution matching, for which the target is defined as the KL divergence between soft allocation and secondary classification, as follows:
Figure FDA0003544229250000033
step S9: joint optimization of cluster center k using stochastic gradient descentcntjWith the encoder parameters θ in SVAE, the gradient at each sample and each cluster center is calculated as follows:
Figure FDA0003544229250000041
Figure FDA0003544229250000042
using K-means as the weight for initializing the clustering layer, then using high confidence prediction to determine the encoder and assign the class cluster, repeating the steps S7 and S8, when the iteration number t1 reaches 2000, or after the class label change rate theta is less than 0.001, soft-assigning q to the centroid by taking each sampleijThe distribution and clustering of the samples can be completed by the maximum value of the total number of the samples; obtaining a clustering result; the class label change rate θ is calculated as follows
Figure FDA0003544229250000043
LiAnd
Figure FDA0003544229250000044
the label of the ith text and the label of the last text are respectively, and n is the total amount of the samples.
CN202210299111.1A 2022-03-13 2022-03-13 Short text clustering method based on self-adaptive variational encoder Pending CN114625879A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210299111.1A CN114625879A (en) 2022-03-13 2022-03-13 Short text clustering method based on self-adaptive variational encoder

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210299111.1A CN114625879A (en) 2022-03-13 2022-03-13 Short text clustering method based on self-adaptive variational encoder

Publications (1)

Publication Number Publication Date
CN114625879A true CN114625879A (en) 2022-06-14

Family

ID=81903320

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210299111.1A Pending CN114625879A (en) 2022-03-13 2022-03-13 Short text clustering method based on self-adaptive variational encoder

Country Status (1)

Country Link
CN (1) CN114625879A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115344678A (en) * 2022-07-11 2022-11-15 北京容联易通信息技术有限公司 Clustering method based on fusion of multiple algorithms
CN116010603A (en) * 2023-01-31 2023-04-25 浙江中电远为科技有限公司 Feature clustering dimension reduction method for commercial text classification

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115344678A (en) * 2022-07-11 2022-11-15 北京容联易通信息技术有限公司 Clustering method based on fusion of multiple algorithms
CN116010603A (en) * 2023-01-31 2023-04-25 浙江中电远为科技有限公司 Feature clustering dimension reduction method for commercial text classification

Similar Documents

Publication Publication Date Title
Zhang et al. Watch, attend and parse: An end-to-end neural network based approach to handwritten mathematical expression recognition
CN108319686B (en) Antagonism cross-media retrieval method based on limited text space
CN110929030B (en) Text abstract and emotion classification combined training method
CN109241255B (en) Intention identification method based on deep learning
CN106407333B (en) Spoken language query identification method and device based on artificial intelligence
CN109829299B (en) Unknown attack identification method based on depth self-encoder
CN111160467B (en) Image description method based on conditional random field and internal semantic attention
CN112732916B (en) BERT-based multi-feature fusion fuzzy text classification system
CN111026869B (en) Method for predicting multi-guilty names by using sequence generation network based on multilayer attention
CN107273913B (en) Short text similarity calculation method based on multi-feature fusion
CN111859935B (en) Method for constructing cancer-related biomedical event database based on literature
Gupta et al. Integration of textual cues for fine-grained image captioning using deep CNN and LSTM
CN109858015B (en) Semantic similarity calculation method and device based on CTW (computational cost) and KM (K-value) algorithm
CN114625879A (en) Short text clustering method based on self-adaptive variational encoder
CN114743020A (en) Food identification method combining tag semantic embedding and attention fusion
CN113051399A (en) Small sample fine-grained entity classification method based on relational graph convolutional network
Vu et al. Investigating the learning effect of multilingual bottle-neck features for ASR
CN111145914B (en) Method and device for determining text entity of lung cancer clinical disease seed bank
CN106021402A (en) Multi-modal multi-class Boosting frame construction method and device for cross-modal retrieval
CN115270752A (en) Template sentence evaluation method based on multilevel comparison learning
CN114239612A (en) Multi-modal neural machine translation method, computer equipment and storage medium
CN113076744A (en) Cultural relic knowledge relation extraction method based on convolutional neural network
CN112560440A (en) Deep learning-based syntax dependence method for aspect-level emotion analysis
CN111914084A (en) Deep learning-based emotion label text generation and evaluation system
US20240104353A1 (en) Sequence-to sequence neural network systems using look ahead tree search

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination