CN117808104B

CN117808104B - Viewpoint mining method based on self-supervision expression learning and oriented to hot topics

Info

Publication number: CN117808104B
Application number: CN202410226614.5A
Authority: CN
Inventors: 王睿; 刘星; 任鹏; 王延安; 常舒予; 黄海平
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2024-02-29
Filing date: 2024-02-29
Publication date: 2024-04-30
Anticipated expiration: 2044-02-29
Also published as: CN117808104A

Abstract

The invention belongs to the technical field of natural language processing, and discloses a viewpoint mining method based on self-supervision expression learning for hot topics, which comprises the following steps: acquiring a text corpus and performing data preprocessing; the text in the corpus is represented by a bag-of-word model; carrying out data enhancement on the word bag representation of the document to obtain paired similar document vector representations; inputting the paired similar document vector representations into an encoder network to be output as vector representations of viewpoint distribution of the input document; sampling from the dirichlet distribution to obtain a priori of the viewpoint distribution; model training is performed by minimizing invariance, variance, covariance regularization loss and prior loss of dirichlet prior distribution alignment of the encoder network output. The invention utilizes the self-supervision learning advantage to obtain the viewpoint representation of the document, obtains high-quality viewpoints and digs various viewpoint representations.

Description

Viewpoint mining method based on self-supervision expression learning and oriented to hot topics

Technical Field

The invention belongs to the field of natural language processing, and particularly relates to a viewpoint mining method based on self-supervision expression learning and oriented to hot topics.

Background

The topic model is used as a data mining tool and has the capability of automatically mining potential topics from a large number of unstructured corpuses. These corpora are typically unlabeled and often contain various noise, such as grammar mistakes and spelling problems. These features present a series of challenges for topic mining. Researchers have focused on designing a model that overcomes the above problems and is expected to achieve high topic consistency and topic diversity across different domain data sets. One of the research directions is to effectively preprocess the corpus before model training so as to eliminate noise, process spelling problems and improve text quality, which is helpful to improve the understanding and modeling capability of the topic model on the text; on the other hand, researchers also innovate in terms of model architecture and algorithms to better accommodate unstructured, noise-rich corpora.

The goal of topic modeling is to identify these potential topics by automatically analyzing word co-occurrence relationships in the documents and assign each document a related topic weight. The traditional topic model based on probability represents implicit dirichlet Allocation (LDA) and is assumed to be formed by topic distribution and word distribution in the assumption that the generation of the documents is composed of the topic distribution and the word distribution, and the method effectively digs out the implicit topics in the corpus. However, in the case of models, complex mathematical derivation is required in the solving process, and there is a problem in that the model is not easily expanded. With the proposal of the neural topic model, two main research directions of the topic model exist: based on the VAE and GAN models. The former, because of the use of unsuitable prior constraints on topic distribution, often results in poor interpretability of the learned topic representation; the method is used for countermeasure training, the model optimization direction is unstable, and the problem of topic collapse is easy to occur, such as insufficient topic diversity, so that key information of the original corpus is lost.

Disclosure of Invention

In order to solve the problems in the existing research, the invention provides a viewpoint mining method based on self-supervision representation learning for hot topics, which is used for mining viewpoints under hot events based on a topic model, and the learned representation can capture multimodal semantics in texts by using dirichlet allocation as a priori constraint, and the diversity of viewpoint representation is improved by adopting a self-supervision learning mode and combining loss optimization in the training process.

In order to achieve the above purpose, the invention is realized by the following technical scheme:

the invention relates to a viewpoint mining method based on self-supervision expression learning for hot topics, which is characterized by comprising the following steps of: the viewpoint mining method comprises the following steps:

Step 1, preprocessing data of an obtained social media comment text, and obtaining a word bag model representation of a document by using a TF-IDF representation method according to the word bag model ；

Step 2, representing the bag-of-words model obtained in the step 1Data enhancement to obtain pairwise similar document vector representations/>；

Step 3, representing the enhanced paired similar document vectors obtained in the step 2As input to the encoder network, an output of the encoder network is derived, the output being represented as a vector representation of the viewpoint distribution of the input document;

and 4, restraining parameter changes of the model by minimizing invariance, variance, covariance regularization loss and priori loss of dirichlet priors aligned by the network output of the encoder, and continuously iterating until a loss function converges so as to ensure stability of the model and accuracy of viewpoint mining.

The invention further improves that: the step 1 specifically comprises the following steps:

Step 1-1, data preprocessing: collecting the content structure of public comments from a social media platform, analyzing and collecting meaningful comment entities in the content, removing texts which do not meet the requirements of language categories, performing morphological reduction and spell check on words of the texts, removing stop words in the texts, screening out texts smaller than a set document length threshold value, and removing the texts;

Step 1-2, obtaining a document representation: for a word t in a document, calculating the ratio of the number of occurrences of the word t in the document d to the total number of all words in the document d, i.e. the word frequency of the word t in the document The importance of the word to the whole corpus, namely the inverse document frequency/>, is calculatedThe ratio of the total number of the documents in the corpus to the total number of the documents containing the word t is self-increased by one is used for taking the logarithm, and the/>, of the word in the documents and the corpusThe weight is calculated by word frequency/>And inverse document frequency/>For a given document, one is obtained that is covered by all words t and its corresponding/>Bag-of-word model representation of weight formation/>。

The invention further improves that: step 2, representing the bag-of-words model obtained in step 1The data enhancement is specifically: assume that the bag of words model representation/>Is/>Dimension vector, vector dimension/>The size of the word list is equal to that of the corpus, and the probability/> issetAnd get the vector/>The numerically smallest word representation, for the bag of words model representation/>According to random probabilistic selection of three data enhancement modes:

a) With probability Reducing the/>, of the word t value%；

B) With probabilityIncreasing the/>, of the word t value%；

C) With probabilityThe t-value of the word is set to zero.

The invention further improves that: the encoder network in step 3 uses the following full-join layer transform, enhanced pairwise similar document vector representationsAs input, deducing a perspective representation of the text, the specific implementation steps comprising:

step 3.1, randomly sampling the corpus obtained in the step 2 to obtain Vitamin pair-wise similar document vector representation/>The input encoder network is mapped to/>, via two layers of linear transforms as followsDimension implicit semantic space:

Wherein, Representing a weight matrix of a layer,/>Is a weight matrix representing two layers,/>And/>Is an offset term,/>And/>Representing the hidden state of a layer,/>And/>Representation vector after the activation of the representation layer,/>Is spectral normalization,/>Is an activation function;

Step 3.2, representing the vector in step 3.1 Through full connection layer transformation, it is mapped as/>Document view distribution of dimensions:

Wherein, And/>Is the weight matrix and bias term of this layer,/>Is the hidden state of the document view distribution layer,/>Representing/>, for pairs of similar document vectorsCorresponding/>Dimension document perspective distribution, and (ii)Polynomial distribution/>Similar document vector representation/>, representing the kth point of view in pairsThe specific weight of the material.

The invention further improves that: the prior loss aligned by minimizing invariance, variance, covariance regularization loss and dirichlet priors of the encoder network output in the step 4 specifically comprises the following steps:

step 4.1, representing the enhanced paired similar document vectors Is similar after mapping, is constrained by calculating invariance regularization loss, and is given out the deduced viewpoint distribution/>AndLoss/>The calculation mode of (2) is as follows:

Wherein, Representing batch size,/>Traversal index for paired similar document vector representations for summing process in the above equation,/>And/>Representing pairs of similar document vector representations/>Is distributed in pairs;

And 4.2, in order to prevent identity of viewpoint mapping, solving the model collapse problem by using a variance loss function, wherein the calculation mode is as follows:

Wherein, ，/>Is distributed from the viewpoint/>Vector of each value in the kth dimension in all view distributions,/>Is a tiny scalar for data stability,/>Representing the number of points of view of a document,/>Representing general data;

and 4.3, constraint is carried out by utilizing covariance loss, and the calculation mode is as follows:

Wherein, ，/>Representing the/>, in the matrixColumn/>Representing a matrix transposition operation;

Step 4.4, synthesizing the calculation formulas of step 4.1, step 4.2 and step 4.3 to obtain three regularization losses as follows:

Wherein, Is a different hyper-parameter;

and 4.5, constraining the output distribution of the encoder network by calculating the maximum average deviation of the deduced document view distribution and the dirichlet priors distribution.

The invention further improves that: step 4.5, constraining the output distribution of the encoder network by calculating the maximum average deviation of the deduced document view distribution and dirichlet priors distribution, comprising the following steps:

Step 4.5.1, set of perspective distributions given the inference From the parameters/>Random sampling is performed in dirichlet distribution to obtain/>Prior distribution/>The formula specifically used is as follows:

Wherein k is set as the number of views used for model training, Is defined as/>, by compliance parameterPrior samples sampled in dirichlet distribution,/>Is a parameter vector/>Is the ith value of (2);

step 4.5.2, obtaining a set of viewpoint distributions And a priori distribution/>The maximum average deviation between the two distributions was then calculated using the following formula:

Wherein, Representing a kernel function,/>，/>And/>Distribution of views obtained for encoder encoding,/>And/>A view distribution obtained by sampling from the prior distribution;

According to the calculation formula, the total loss of model training is obtained as follows:

Wherein the method comprises the steps of Is a super-parameter used in model training,/>The a priori matching loss is calculated for using the maximum average offset distance.

The invention further improves that: and (4) constraining the parameter change of the model in the step, and continuously iterating until the loss function converges, wherein the model training specifically comprises the following steps:

step 4.2.1, constructing an encoder network, optimizing the model by using a proper optimizer, and designating the hyper-parameters of the model Wherein the superparameter/>The value of (2) is greater than zero;

step 4.2.2, randomly sampling the input encoder network from the corpus, wherein the compliance parameter is as follows The dirichlet distribution is sampled to obtain prior distribution, and the perspective distribution of the text is sampled from the network output of the encoder;

Step 4.2.3, carrying out random gradient descent by optimizing the three regularization losses and the dirichlet priors matching loss, and updating parameters of an encoder network;

Step 4.2.4, repeating step 4.2.2 and step 4.2.3 until the model converges.

The beneficial effects of the invention are as follows: according to the invention, through modeling the views of the text, formalizing the views by adopting the subject words, fully utilizing the advantages of self-supervision learning, and obtaining efficient view representation;

According to the invention, through three regularization losses and dirichlet priors output by the optimization model are aligned, a high-quality viewpoint is obtained, and various viewpoint representations are mined.

The invention designs a specific loss function, obtains high-quality and various views through different optimization angles, improves the quality of learned view representation, and solves common problems in self-supervision learning such as model collapse.

The invention provides an innovative and effective solution for viewpoint analysis under hot topics.

Drawings

FIG. 1 is a specific flow chart of an embodiment of the present invention.

Fig. 2 is a diagram of a neural network model structure of the present invention.

Detailed Description

The application will be further illustrated with reference to the drawings and detailed description, it being understood that the following specific examples are intended to illustrate the application and not to limit the scope of the application, and that various equivalents thereof will be modified by those skilled in the art after reading the application to fall within the scope of the application as defined by the appended claims.

As shown in fig. 1-2, the invention is a self-supervision representation learning-based viewpoint mining method facing hot topics, which comprises the following steps:

Step 1, preprocessing data of an obtained social media comment text, and obtaining a word bag model representation of a document by using a TF-IDF representation method according to the word bag model The method specifically comprises the following steps:

Step 1-2, obtaining a document representation: for the word t in a document, the ratio of the number of occurrences of the word t in the document d to the total number of all the words in the document d, i.e. the word frequency of the word t in the document, is calculated The importance of the word to the whole corpus, namely the inverse document frequency/>, is calculatedThe ratio of the total number of the documents in the corpus to the total number of the documents containing the word t is self-increased by one is used for taking the logarithm, and the/>, of the word in the documents and the corpusThe weight is calculated by word frequency/>And inverse document frequency/>For a given document, one is obtained that is covered by all words t and its corresponding/>Bag-of-word model representation of weight formation/>。

Step 2, representing the bag-of-words model obtained in the step 1Data enhancement to obtain pairwise similar document vector representations/>The bag-of-words model representation obtained in step 1/>The data enhancement is specifically: presuming a bag of words model representationIs/>Dimension vector, vector dimension/>The size of the word list is equal to that of the corpus, and the probability/> issetAnd get the vector/>The numerically smallest word representation, for the bag of words model representation/>According to random probabilistic selection of three data enhancement modes:

a) With probability Reducing the/>, of the word t value%；

B) With probabilityIncreasing the/>, of the word t value%；

C) With probabilityThe t-value of the word is set to zero.

Step 3, representing the enhanced paired similar document vectors obtained in the step 2The representation is used as an input of the encoder network, and an output of the encoder network is obtained, wherein the output is expressed as a vector representation of the viewpoint distribution of the input document, and the specific implementation steps comprise:

Wherein, Representing a weight matrix of a layer,/>Is a weight matrix representing two layers,And/>Is an offset term,/>And/>Representing the hidden state of a layer,/>And/>Representation vector after the activation of the representation layer,/>Is spectral normalization,/>Is an activation function;

Specifically, by minimizing invariance, variance, covariance regularization loss and prior loss of dirichlet priors alignment of the encoder network output, the method specifically comprises the following steps:

step 4.1, representing the enhanced paired similar document vectors Is similar after mapping, is constrained by calculating invariance regularization loss, and is given out the deduced viewpoint distribution/>And/>Loss/>The calculation mode of (2) is as follows:

Wherein, Is a different hyper-parameter;

Step 4.5, constraining the output distribution of the encoder network by calculating the maximum average deviation of the deduced document viewpoint distribution and the dirichlet priors distribution, specifically:

given inferred point of view distribution collection From the parameters/>Random sampling is performed in dirichlet distribution to obtain/>Prior distribution/>The formula specifically used is as follows:

In obtaining a set of view distributions And a priori distribution/>The maximum average deviation between the two distributions was then calculated using the following formula:

And 4, constraining the parameter change of the model, and continuously iterating until the loss function converges, wherein the model training specifically comprises the following steps:

Step 4.2.4, repeating step 4.2.2 and step 4.2.3 until the model converges.

In order to verify the benefits of the invention, experiments are performed on comment texts of a recent hot topic large language model ChatGPT, and because the invention is based on a topic model technology, the quality of the mined viewpoint representation is evaluated by using topic consistency and diversity indexes, and the average topic consistency values tested on a User Query dataset are as follows: the CP is 0.2983, the NPMI is 0.0581, the UCI is 0.4084, the UT is 0.9870, the indexes are higher than those of a comparison experiment, and the effect is better as the numerical value is higher, wherein the highest CP in the comparison experiment is 0.2834, the NPMI is 0.0397, the UCI is 0.3096 and the UT is 0.960. The comparative models used in this experiment are comparative example 1, comparative example 2, comparative example 3, wherein:

Comparative example 1: according to the LDA method in （Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. J. Mach. Learn. Res., 3 , 993–1022）;

Comparative example 2: according to the ETM method in （Dieng, A. B., Ruiz, F. J. R., & Blei, D. M. (2020). Topic modeling in embedding spaces. Trans. Assoc. Comput. Linguistics, 8 , 439–453）;

comparative example 3: according to the CNTM in （Nguyen, T., & Luu, A. T. (2021). Contrastive learning for neural topic model. In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual (pp. 11974–11986)）.

In the comparison of the comparison experimental results of the invention, the UT index calculation mode is thatWherein/>Representing the sum number in the uniquely occurring subject term; the used CP, NPMI and UCI indexes are public evaluation indexes for evaluating the quality of the theme, namely the consistency and consistency degree of the theme semantic, and are widely used in scientific experiments in the field of theme modeling and are all proposed from the research paper of Roder （Röder M, Both A, Hinneburg A. Exploring the space of topic coherence measures[C]. In Proceedings of the eighth ACM international conference on Web search and data mining. 2015: 399-408.）.

The foregoing is only a preferred embodiment of the invention, it being noted that: it will be apparent to those skilled in the art that various modifications and adaptations can be made without departing from the principles of the present invention, and such modifications and adaptations are intended to be comprehended within the scope of the invention.

Claims

1. A viewpoint mining method based on self-supervision expression learning for hot topics is characterized by comprising the following steps of: the viewpoint mining method comprises the following steps:

Step 1, preprocessing data of an obtained social media comment text, and obtaining a word bag model representation of a document by using a TF-IDF representation method according to the word bag model

Step 2, representing the bag-of-words model obtained in the step 1Data enhancement to obtain pairwise similar document vector representations/>

Step 4, restraining parameter changes of a model by minimizing invariance, variance, covariance regularization loss and dirichlet priors aligned prior loss of the encoder network output, and continuously iterating until a loss function converges to ensure stability of the model and accuracy of view mining, wherein the method specifically comprises the following steps of:

step 4.1, representing the enhanced paired similar document vectors Is similar after mapping, constraint is carried out by calculating invariance regularization loss, and inferred viewpoint distribution/> isgivenAnd/>Loss/>The calculation mode of (2) is as follows:

where M represents the batch size, i is the traversal index of the summing process over pairs of similar document vector representations in the above equation, And/>Representing pairs of similar document vector representations/>Is distributed in pairs;

Wherein, Is a vector composed of each value in the kth dimension in all view distributions in view distribution Θ _a, e is a tiny scalar for data stability, K represents the number of document views,/>Representing general data;

Wherein, J represents the j-th column in the matrix, and T represents matrix transposition operation;

wherein λ, μ, η are different hyper-parameters;

2. The hot topic-oriented self-supervised representation learning-based viewpoint mining method as claimed in claim 1, wherein the method comprises the following steps: the step 1 specifically comprises the following steps:

Step 1-2, obtaining a document representation: for a word t in a document, calculating the ratio of the occurrence number of the word t in the document D to the total number of all words in the document D, namely the word frequency TF (t, D) of the occurrence of the word t in the document, calculating the importance of the word for the whole corpus, namely the inverse document frequency IDF (t, D), taking the logarithm of the ratio of the total number of words in the corpus to the total number of documents containing the word t after self-increasing, the calculation mode of the TF-IDF weights of the word in the document and the corpus is the product of the word frequency TF (t, D) and the inverse document frequency IDF (t, D), and obtaining a word bag model representation consisting of the TF-IDF weights corresponding to all the words t for a given document

3. The hot topic-oriented self-supervised representation learning-based viewpoint mining method as claimed in claim 1, wherein the method comprises the following steps: step 2, representing the bag-of-words model obtained in step 1The data enhancement is specifically: assume that the bag of words model representation/>Is a V-dimensional vector, the vector dimension V is equal to the word list size of the corpus, the probability p is set, the minimum word representation on the L numerical values in the vector is obtained, and the word bag model representation/>According to random probabilistic selection of three data enhancement modes:

a) Reducing the q-value of the word t with the probability p;

b) Increasing the q of the word t value with the probability p;

c) The t-value of the word is set to zero with probability p.

4. The hot topic-oriented self-supervised representation learning-based viewpoint mining method as claimed in claim 1, wherein the method comprises the following steps: the encoder network in step 3 uses the following full-join layer transform, enhanced pairwise similar document vector representationsAs input, deducing a perspective representation of the text, the specific implementation steps comprising:

step 3.1, randomly sampling the corpus obtained in the step 2 to obtain a V-dimensional paired similar document vector representation The input encoder network is mapped to the S-dimensional implicit semantic space through the following two layers of linear transformation:

Wherein, Representing a weight matrix of a layer,/>Is a weight matrix representing two layers,/>And/>Is an offset term,/>And/>Representing the hidden state of a layer,/>And/>The representation vector after the representation layer is activated, SN (·) is the spectrum normalization, HARDSWISH (·) is the activation function;

Step 3.2, representing the vector in step 3.1 Through full connection layer transformation, the document view distribution of K dimension is mapped:

Wherein, And/>Is the weight matrix and bias term of this layer,/>Is the hidden state of the document view distribution layer,Representing/>, for pairs of similar document vectorsThe corresponding K-dimensional document views are distributed, and the kth e {1,2,., K } polynomial distribution/>Similar document vector representation/>, representing the kth point of view in pairsThe specific weight of the material.

5. The hot topic-oriented self-supervised representation learning-based viewpoint mining method as claimed in claim 1, wherein the method comprises the following steps: the step 4.5 is to restrict the output distribution of the encoder network by calculating the maximum average deviation of the deduced document view distribution and dirichlet priors distribution, and specifically comprises the following steps:

Step 4.5.1, set of perspective distributions given the inference From the parameters/>Random sampling is carried out in dirichlet distribution to obtain prior distribution/>, of thetaThe formula specifically used is as follows:

Wherein k is set as the number of views used for model training, Is defined as/>, by compliance parameterA _i is the parameter vector/>, a priori samples sampled from the dirichlet distributionIs the ith value of (2);

step 4.5.2, after obtaining the set of view distributions Θ and the a priori distribution Θ', calculating the maximum average deviation between the two distributions using the following formula:

where k represents the kernel function, n=2m, And/>Distribution of views obtained for encoder encoding,/>And/>A view distribution obtained by sampling from the prior distribution;

wherein beta is a super-parameter used in model training, The a priori matching loss is calculated for using the maximum average offset distance.

6. The hot topic-oriented self-supervised representation learning-based viewpoint mining method as claimed in claim 1, wherein the method comprises the following steps: and (4) constraining the parameter change of the model in the step, and continuously iterating until the loss function converges, wherein the model training specifically comprises the following steps:

4.2.1, constructing an encoder network, optimizing a model by using an optimizer, and designating super parameters lambda, mu, eta and beta of the model, wherein the values of the super parameters lambda, mu, eta and beta are larger than zero;

Step 4.2.4, repeating step 4.2.2 and step 4.2.3 until the model converges.