CN117808104B - Viewpoint mining method based on self-supervision expression learning and oriented to hot topics - Google Patents
Viewpoint mining method based on self-supervision expression learning and oriented to hot topics Download PDFInfo
- Publication number
- CN117808104B CN117808104B CN202410226614.5A CN202410226614A CN117808104B CN 117808104 B CN117808104 B CN 117808104B CN 202410226614 A CN202410226614 A CN 202410226614A CN 117808104 B CN117808104 B CN 117808104B
- Authority
- CN
- China
- Prior art keywords
- document
- distribution
- word
- representation
- model
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 38
- 238000005065 mining Methods 0.000 title claims abstract description 21
- 238000009826 distribution Methods 0.000 claims abstract description 84
- 239000013598 vector Substances 0.000 claims abstract description 53
- 238000012549 training Methods 0.000 claims abstract description 16
- 238000005070 sampling Methods 0.000 claims abstract description 13
- 238000007781 pre-processing Methods 0.000 claims abstract description 7
- 238000004364 calculation method Methods 0.000 claims description 17
- 230000006870 function Effects 0.000 claims description 16
- 239000011159 matrix material Substances 0.000 claims description 15
- 238000013507 mapping Methods 0.000 claims description 6
- 230000008569 process Effects 0.000 claims description 6
- 230000004913 activation Effects 0.000 claims description 5
- 230000009466 transformation Effects 0.000 claims description 4
- 230000008859 change Effects 0.000 claims description 3
- 239000000463 material Substances 0.000 claims description 3
- 230000000877 morphologic effect Effects 0.000 claims description 3
- 238000010606 normalization Methods 0.000 claims description 3
- 230000009467 reduction Effects 0.000 claims description 3
- 230000000452 restraining effect Effects 0.000 claims description 3
- 238000012216 screening Methods 0.000 claims description 3
- 230000002194 synthesizing effect Effects 0.000 claims description 3
- 230000017105 transposition Effects 0.000 claims description 3
- 238000001228 spectrum Methods 0.000 claims 1
- 230000008901 benefit Effects 0.000 abstract description 3
- 238000003058 natural language processing Methods 0.000 abstract description 2
- 230000000052 comparative effect Effects 0.000 description 7
- 238000002474 experimental method Methods 0.000 description 5
- 230000001537 neural effect Effects 0.000 description 4
- 238000005457 optimization Methods 0.000 description 4
- 238000011160 research Methods 0.000 description 4
- JCJIZBQZPSZIBI-UHFFFAOYSA-N 2-[2,6-di(propan-2-yl)phenyl]benzo[de]isoquinoline-1,3-dione Chemical compound CC(C)C1=CC=CC(C(C)C)=C1N(C1=O)C(=O)C2=C3C1=CC=CC3=CC=C2 JCJIZBQZPSZIBI-UHFFFAOYSA-N 0.000 description 3
- 239000013256 coordination polymer Substances 0.000 description 3
- 230000006978 adaptation Effects 0.000 description 2
- 230000015572 biosynthetic process Effects 0.000 description 2
- 238000007418 data mining Methods 0.000 description 2
- 230000010365 information processing Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000003595 spectral effect Effects 0.000 description 2
- 229940088594 vitamin Drugs 0.000 description 2
- 229930003231 vitamin Natural products 0.000 description 2
- 235000013343 vitamin Nutrition 0.000 description 2
- 239000011782 vitamin Substances 0.000 description 2
- 150000003722 vitamin derivatives Chemical class 0.000 description 2
- 238000013256 Gubra-Amylin NASH model Methods 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000009795 derivation Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 238000003062 neural network model Methods 0.000 description 1
Landscapes
- Machine Translation (AREA)
Abstract
The invention belongs to the technical field of natural language processing, and discloses a viewpoint mining method based on self-supervision expression learning for hot topics, which comprises the following steps: acquiring a text corpus and performing data preprocessing; the text in the corpus is represented by a bag-of-word model; carrying out data enhancement on the word bag representation of the document to obtain paired similar document vector representations; inputting the paired similar document vector representations into an encoder network to be output as vector representations of viewpoint distribution of the input document; sampling from the dirichlet distribution to obtain a priori of the viewpoint distribution; model training is performed by minimizing invariance, variance, covariance regularization loss and prior loss of dirichlet prior distribution alignment of the encoder network output. The invention utilizes the self-supervision learning advantage to obtain the viewpoint representation of the document, obtains high-quality viewpoints and digs various viewpoint representations.
Description
Technical Field
The invention belongs to the field of natural language processing, and particularly relates to a viewpoint mining method based on self-supervision expression learning and oriented to hot topics.
Background
The topic model is used as a data mining tool and has the capability of automatically mining potential topics from a large number of unstructured corpuses. These corpora are typically unlabeled and often contain various noise, such as grammar mistakes and spelling problems. These features present a series of challenges for topic mining. Researchers have focused on designing a model that overcomes the above problems and is expected to achieve high topic consistency and topic diversity across different domain data sets. One of the research directions is to effectively preprocess the corpus before model training so as to eliminate noise, process spelling problems and improve text quality, which is helpful to improve the understanding and modeling capability of the topic model on the text; on the other hand, researchers also innovate in terms of model architecture and algorithms to better accommodate unstructured, noise-rich corpora.
The goal of topic modeling is to identify these potential topics by automatically analyzing word co-occurrence relationships in the documents and assign each document a related topic weight. The traditional topic model based on probability represents implicit dirichlet Allocation (LDA) and is assumed to be formed by topic distribution and word distribution in the assumption that the generation of the documents is composed of the topic distribution and the word distribution, and the method effectively digs out the implicit topics in the corpus. However, in the case of models, complex mathematical derivation is required in the solving process, and there is a problem in that the model is not easily expanded. With the proposal of the neural topic model, two main research directions of the topic model exist: based on the VAE and GAN models. The former, because of the use of unsuitable prior constraints on topic distribution, often results in poor interpretability of the learned topic representation; the method is used for countermeasure training, the model optimization direction is unstable, and the problem of topic collapse is easy to occur, such as insufficient topic diversity, so that key information of the original corpus is lost.
Disclosure of Invention
In order to solve the problems in the existing research, the invention provides a viewpoint mining method based on self-supervision representation learning for hot topics, which is used for mining viewpoints under hot events based on a topic model, and the learned representation can capture multimodal semantics in texts by using dirichlet allocation as a priori constraint, and the diversity of viewpoint representation is improved by adopting a self-supervision learning mode and combining loss optimization in the training process.
In order to achieve the above purpose, the invention is realized by the following technical scheme:
the invention relates to a viewpoint mining method based on self-supervision expression learning for hot topics, which is characterized by comprising the following steps of: the viewpoint mining method comprises the following steps:
Step 1, preprocessing data of an obtained social media comment text, and obtaining a word bag model representation of a document by using a TF-IDF representation method according to the word bag model ;
Step 2, representing the bag-of-words model obtained in the step 1Data enhancement to obtain pairwise similar document vector representations/>;
Step 3, representing the enhanced paired similar document vectors obtained in the step 2As input to the encoder network, an output of the encoder network is derived, the output being represented as a vector representation of the viewpoint distribution of the input document;
and 4, restraining parameter changes of the model by minimizing invariance, variance, covariance regularization loss and priori loss of dirichlet priors aligned by the network output of the encoder, and continuously iterating until a loss function converges so as to ensure stability of the model and accuracy of viewpoint mining.
The invention further improves that: the step 1 specifically comprises the following steps:
Step 1-1, data preprocessing: collecting the content structure of public comments from a social media platform, analyzing and collecting meaningful comment entities in the content, removing texts which do not meet the requirements of language categories, performing morphological reduction and spell check on words of the texts, removing stop words in the texts, screening out texts smaller than a set document length threshold value, and removing the texts;
Step 1-2, obtaining a document representation: for a word t in a document, calculating the ratio of the number of occurrences of the word t in the document d to the total number of all words in the document d, i.e. the word frequency of the word t in the document The importance of the word to the whole corpus, namely the inverse document frequency/>, is calculatedThe ratio of the total number of the documents in the corpus to the total number of the documents containing the word t is self-increased by one is used for taking the logarithm, and the/>, of the word in the documents and the corpusThe weight is calculated by word frequency/>And inverse document frequency/>For a given document, one is obtained that is covered by all words t and its corresponding/>Bag-of-word model representation of weight formation/>。
The invention further improves that: step 2, representing the bag-of-words model obtained in step 1The data enhancement is specifically: assume that the bag of words model representation/>Is/>Dimension vector, vector dimension/>The size of the word list is equal to that of the corpus, and the probability/> issetAnd get the vector/>The numerically smallest word representation, for the bag of words model representation/>According to random probabilistic selection of three data enhancement modes:
a) With probability Reducing the/>, of the word t value%;
B) With probabilityIncreasing the/>, of the word t value%;
C) With probabilityThe t-value of the word is set to zero.
The invention further improves that: the encoder network in step 3 uses the following full-join layer transform, enhanced pairwise similar document vector representationsAs input, deducing a perspective representation of the text, the specific implementation steps comprising:
step 3.1, randomly sampling the corpus obtained in the step 2 to obtain Vitamin pair-wise similar document vector representation/>The input encoder network is mapped to/>, via two layers of linear transforms as followsDimension implicit semantic space:
Wherein, Representing a weight matrix of a layer,/>Is a weight matrix representing two layers,/>And/>Is an offset term,/>And/>Representing the hidden state of a layer,/>And/>Representation vector after the activation of the representation layer,/>Is spectral normalization,/>Is an activation function;
Step 3.2, representing the vector in step 3.1 Through full connection layer transformation, it is mapped as/>Document view distribution of dimensions:
Wherein, And/>Is the weight matrix and bias term of this layer,/>Is the hidden state of the document view distribution layer,/>Representing/>, for pairs of similar document vectorsCorresponding/>Dimension document perspective distribution, and (ii)Polynomial distribution/>Similar document vector representation/>, representing the kth point of view in pairsThe specific weight of the material.
The invention further improves that: the prior loss aligned by minimizing invariance, variance, covariance regularization loss and dirichlet priors of the encoder network output in the step 4 specifically comprises the following steps:
step 4.1, representing the enhanced paired similar document vectors Is similar after mapping, is constrained by calculating invariance regularization loss, and is given out the deduced viewpoint distribution/>AndLoss/>The calculation mode of (2) is as follows:
Wherein, Representing batch size,/>Traversal index for paired similar document vector representations for summing process in the above equation,/>And/>Representing pairs of similar document vector representations/>Is distributed in pairs;
And 4.2, in order to prevent identity of viewpoint mapping, solving the model collapse problem by using a variance loss function, wherein the calculation mode is as follows:
Wherein, ,/>Is distributed from the viewpoint/>Vector of each value in the kth dimension in all view distributions,/>Is a tiny scalar for data stability,/>Representing the number of points of view of a document,/>Representing general data;
and 4.3, constraint is carried out by utilizing covariance loss, and the calculation mode is as follows:
Wherein, ,/>Representing the/>, in the matrixColumn/>Representing a matrix transposition operation;
Step 4.4, synthesizing the calculation formulas of step 4.1, step 4.2 and step 4.3 to obtain three regularization losses as follows:
Wherein, Is a different hyper-parameter;
and 4.5, constraining the output distribution of the encoder network by calculating the maximum average deviation of the deduced document view distribution and the dirichlet priors distribution.
The invention further improves that: step 4.5, constraining the output distribution of the encoder network by calculating the maximum average deviation of the deduced document view distribution and dirichlet priors distribution, comprising the following steps:
Step 4.5.1, set of perspective distributions given the inference From the parameters/>Random sampling is performed in dirichlet distribution to obtain/>Prior distribution/>The formula specifically used is as follows:
Wherein k is set as the number of views used for model training, Is defined as/>, by compliance parameterPrior samples sampled in dirichlet distribution,/>Is a parameter vector/>Is the ith value of (2);
step 4.5.2, obtaining a set of viewpoint distributions And a priori distribution/>The maximum average deviation between the two distributions was then calculated using the following formula:
Wherein, Representing a kernel function,/>,/>And/>Distribution of views obtained for encoder encoding,/>And/>A view distribution obtained by sampling from the prior distribution;
According to the calculation formula, the total loss of model training is obtained as follows:
Wherein the method comprises the steps of Is a super-parameter used in model training,/>The a priori matching loss is calculated for using the maximum average offset distance.
The invention further improves that: and (4) constraining the parameter change of the model in the step, and continuously iterating until the loss function converges, wherein the model training specifically comprises the following steps:
step 4.2.1, constructing an encoder network, optimizing the model by using a proper optimizer, and designating the hyper-parameters of the model Wherein the superparameter/>The value of (2) is greater than zero;
step 4.2.2, randomly sampling the input encoder network from the corpus, wherein the compliance parameter is as follows The dirichlet distribution is sampled to obtain prior distribution, and the perspective distribution of the text is sampled from the network output of the encoder;
Step 4.2.3, carrying out random gradient descent by optimizing the three regularization losses and the dirichlet priors matching loss, and updating parameters of an encoder network;
Step 4.2.4, repeating step 4.2.2 and step 4.2.3 until the model converges.
The beneficial effects of the invention are as follows: according to the invention, through modeling the views of the text, formalizing the views by adopting the subject words, fully utilizing the advantages of self-supervision learning, and obtaining efficient view representation;
According to the invention, through three regularization losses and dirichlet priors output by the optimization model are aligned, a high-quality viewpoint is obtained, and various viewpoint representations are mined.
The invention designs a specific loss function, obtains high-quality and various views through different optimization angles, improves the quality of learned view representation, and solves common problems in self-supervision learning such as model collapse.
The invention provides an innovative and effective solution for viewpoint analysis under hot topics.
Drawings
FIG. 1 is a specific flow chart of an embodiment of the present invention.
Fig. 2 is a diagram of a neural network model structure of the present invention.
Detailed Description
The application will be further illustrated with reference to the drawings and detailed description, it being understood that the following specific examples are intended to illustrate the application and not to limit the scope of the application, and that various equivalents thereof will be modified by those skilled in the art after reading the application to fall within the scope of the application as defined by the appended claims.
As shown in fig. 1-2, the invention is a self-supervision representation learning-based viewpoint mining method facing hot topics, which comprises the following steps:
Step 1, preprocessing data of an obtained social media comment text, and obtaining a word bag model representation of a document by using a TF-IDF representation method according to the word bag model The method specifically comprises the following steps:
Step 1-1, data preprocessing: collecting the content structure of public comments from a social media platform, analyzing and collecting meaningful comment entities in the content, removing texts which do not meet the requirements of language categories, performing morphological reduction and spell check on words of the texts, removing stop words in the texts, screening out texts smaller than a set document length threshold value, and removing the texts;
Step 1-2, obtaining a document representation: for the word t in a document, the ratio of the number of occurrences of the word t in the document d to the total number of all the words in the document d, i.e. the word frequency of the word t in the document, is calculated The importance of the word to the whole corpus, namely the inverse document frequency/>, is calculatedThe ratio of the total number of the documents in the corpus to the total number of the documents containing the word t is self-increased by one is used for taking the logarithm, and the/>, of the word in the documents and the corpusThe weight is calculated by word frequency/>And inverse document frequency/>For a given document, one is obtained that is covered by all words t and its corresponding/>Bag-of-word model representation of weight formation/>。
Step 2, representing the bag-of-words model obtained in the step 1Data enhancement to obtain pairwise similar document vector representations/>The bag-of-words model representation obtained in step 1/>The data enhancement is specifically: presuming a bag of words model representationIs/>Dimension vector, vector dimension/>The size of the word list is equal to that of the corpus, and the probability/> issetAnd get the vector/>The numerically smallest word representation, for the bag of words model representation/>According to random probabilistic selection of three data enhancement modes:
a) With probability Reducing the/>, of the word t value%;
B) With probabilityIncreasing the/>, of the word t value%;
C) With probabilityThe t-value of the word is set to zero.
Step 3, representing the enhanced paired similar document vectors obtained in the step 2The representation is used as an input of the encoder network, and an output of the encoder network is obtained, wherein the output is expressed as a vector representation of the viewpoint distribution of the input document, and the specific implementation steps comprise:
step 3.1, randomly sampling the corpus obtained in the step 2 to obtain Vitamin pair-wise similar document vector representation/>The input encoder network is mapped to/>, via two layers of linear transforms as followsDimension implicit semantic space:
Wherein, Representing a weight matrix of a layer,/>Is a weight matrix representing two layers,And/>Is an offset term,/>And/>Representing the hidden state of a layer,/>And/>Representation vector after the activation of the representation layer,/>Is spectral normalization,/>Is an activation function;
Step 3.2, representing the vector in step 3.1 Through full connection layer transformation, it is mapped as/>Document view distribution of dimensions:
Wherein, And/>Is the weight matrix and bias term of this layer,/>Is the hidden state of the document view distribution layer,/>Representing/>, for pairs of similar document vectorsCorresponding/>Dimension document perspective distribution, and (ii)Polynomial distribution/>Similar document vector representation/>, representing the kth point of view in pairsThe specific weight of the material.
And 4, restraining parameter changes of the model by minimizing invariance, variance, covariance regularization loss and priori loss of dirichlet priors aligned by the network output of the encoder, and continuously iterating until a loss function converges so as to ensure stability of the model and accuracy of viewpoint mining.
Specifically, by minimizing invariance, variance, covariance regularization loss and prior loss of dirichlet priors alignment of the encoder network output, the method specifically comprises the following steps:
step 4.1, representing the enhanced paired similar document vectors Is similar after mapping, is constrained by calculating invariance regularization loss, and is given out the deduced viewpoint distribution/>And/>Loss/>The calculation mode of (2) is as follows:
Wherein, Representing batch size,/>Traversal index for paired similar document vector representations for summing process in the above equation,/>And/>Representing pairs of similar document vector representations/>Is distributed in pairs;
And 4.2, in order to prevent identity of viewpoint mapping, solving the model collapse problem by using a variance loss function, wherein the calculation mode is as follows:
Wherein, ,/>Is distributed from the viewpoint/>Vector of each value in the kth dimension in all view distributions,/>Is a tiny scalar for data stability,/>Representing the number of points of view of a document,/>Representing general data;
and 4.3, constraint is carried out by utilizing covariance loss, and the calculation mode is as follows:
Wherein, ,/>Representing the/>, in the matrixColumn/>Representing a matrix transposition operation;
Step 4.4, synthesizing the calculation formulas of step 4.1, step 4.2 and step 4.3 to obtain three regularization losses as follows:
Wherein, Is a different hyper-parameter;
Step 4.5, constraining the output distribution of the encoder network by calculating the maximum average deviation of the deduced document viewpoint distribution and the dirichlet priors distribution, specifically:
given inferred point of view distribution collection From the parameters/>Random sampling is performed in dirichlet distribution to obtain/>Prior distribution/>The formula specifically used is as follows:
Wherein k is set as the number of views used for model training, Is defined as/>, by compliance parameterPrior samples sampled in dirichlet distribution,/>Is a parameter vector/>Is the ith value of (2);
In obtaining a set of view distributions And a priori distribution/>The maximum average deviation between the two distributions was then calculated using the following formula:
Wherein, Representing a kernel function,/>,/>And/>Distribution of views obtained for encoder encoding,/>And/>A view distribution obtained by sampling from the prior distribution;
According to the calculation formula, the total loss of model training is obtained as follows:
Wherein the method comprises the steps of Is a super-parameter used in model training,/>The a priori matching loss is calculated for using the maximum average offset distance.
And 4, constraining the parameter change of the model, and continuously iterating until the loss function converges, wherein the model training specifically comprises the following steps:
step 4.2.1, constructing an encoder network, optimizing the model by using a proper optimizer, and designating the hyper-parameters of the model Wherein the superparameter/>The value of (2) is greater than zero;
step 4.2.2, randomly sampling the input encoder network from the corpus, wherein the compliance parameter is as follows The dirichlet distribution is sampled to obtain prior distribution, and the perspective distribution of the text is sampled from the network output of the encoder;
Step 4.2.3, carrying out random gradient descent by optimizing the three regularization losses and the dirichlet priors matching loss, and updating parameters of an encoder network;
Step 4.2.4, repeating step 4.2.2 and step 4.2.3 until the model converges.
In order to verify the benefits of the invention, experiments are performed on comment texts of a recent hot topic large language model ChatGPT, and because the invention is based on a topic model technology, the quality of the mined viewpoint representation is evaluated by using topic consistency and diversity indexes, and the average topic consistency values tested on a User Query dataset are as follows: the CP is 0.2983, the NPMI is 0.0581, the UCI is 0.4084, the UT is 0.9870, the indexes are higher than those of a comparison experiment, and the effect is better as the numerical value is higher, wherein the highest CP in the comparison experiment is 0.2834, the NPMI is 0.0397, the UCI is 0.3096 and the UT is 0.960. The comparative models used in this experiment are comparative example 1, comparative example 2, comparative example 3, wherein:
Comparative example 1: according to the LDA method in (Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. J. Mach. Learn. Res., 3 , 993–1022);
Comparative example 2: according to the ETM method in (Dieng, A. B., Ruiz, F. J. R., & Blei, D. M. (2020). Topic modeling in embedding spaces. Trans. Assoc. Comput. Linguistics, 8 , 439–453);
comparative example 3: according to the CNTM in (Nguyen, T., & Luu, A. T. (2021). Contrastive learning for neural topic model. In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual (pp. 11974–11986)).
In the comparison of the comparison experimental results of the invention, the UT index calculation mode is thatWherein/>Representing the sum number in the uniquely occurring subject term; the used CP, NPMI and UCI indexes are public evaluation indexes for evaluating the quality of the theme, namely the consistency and consistency degree of the theme semantic, and are widely used in scientific experiments in the field of theme modeling and are all proposed from the research paper of Roder (Röder M, Both A, Hinneburg A. Exploring the space of topic coherence measures[C]. In Proceedings of the eighth ACM international conference on Web search and data mining. 2015: 399-408.).
The foregoing is only a preferred embodiment of the invention, it being noted that: it will be apparent to those skilled in the art that various modifications and adaptations can be made without departing from the principles of the present invention, and such modifications and adaptations are intended to be comprehended within the scope of the invention.
Claims (6)
1. A viewpoint mining method based on self-supervision expression learning for hot topics is characterized by comprising the following steps of: the viewpoint mining method comprises the following steps:
Step 1, preprocessing data of an obtained social media comment text, and obtaining a word bag model representation of a document by using a TF-IDF representation method according to the word bag model
Step 2, representing the bag-of-words model obtained in the step 1Data enhancement to obtain pairwise similar document vector representations/>
Step 3, representing the enhanced paired similar document vectors obtained in the step 2As input to the encoder network, an output of the encoder network is derived, the output being represented as a vector representation of the viewpoint distribution of the input document;
Step 4, restraining parameter changes of a model by minimizing invariance, variance, covariance regularization loss and dirichlet priors aligned prior loss of the encoder network output, and continuously iterating until a loss function converges to ensure stability of the model and accuracy of view mining, wherein the method specifically comprises the following steps of:
step 4.1, representing the enhanced paired similar document vectors Is similar after mapping, constraint is carried out by calculating invariance regularization loss, and inferred viewpoint distribution/> isgivenAnd/>Loss/>The calculation mode of (2) is as follows:
where M represents the batch size, i is the traversal index of the summing process over pairs of similar document vector representations in the above equation, And/>Representing pairs of similar document vector representations/>Is distributed in pairs;
And 4.2, in order to prevent identity of viewpoint mapping, solving the model collapse problem by using a variance loss function, wherein the calculation mode is as follows:
Wherein, Is a vector composed of each value in the kth dimension in all view distributions in view distribution Θ a, e is a tiny scalar for data stability, K represents the number of document views,/>Representing general data;
and 4.3, constraint is carried out by utilizing covariance loss, and the calculation mode is as follows:
Wherein, J represents the j-th column in the matrix, and T represents matrix transposition operation;
Step 4.4, synthesizing the calculation formulas of step 4.1, step 4.2 and step 4.3 to obtain three regularization losses as follows:
wherein λ, μ, η are different hyper-parameters;
and 4.5, constraining the output distribution of the encoder network by calculating the maximum average deviation of the deduced document view distribution and the dirichlet priors distribution.
2. The hot topic-oriented self-supervised representation learning-based viewpoint mining method as claimed in claim 1, wherein the method comprises the following steps: the step 1 specifically comprises the following steps:
Step 1-1, data preprocessing: collecting the content structure of public comments from a social media platform, analyzing and collecting meaningful comment entities in the content, removing texts which do not meet the requirements of language categories, performing morphological reduction and spell check on words of the texts, removing stop words in the texts, screening out texts smaller than a set document length threshold value, and removing the texts;
Step 1-2, obtaining a document representation: for a word t in a document, calculating the ratio of the occurrence number of the word t in the document D to the total number of all words in the document D, namely the word frequency TF (t, D) of the occurrence of the word t in the document, calculating the importance of the word for the whole corpus, namely the inverse document frequency IDF (t, D), taking the logarithm of the ratio of the total number of words in the corpus to the total number of documents containing the word t after self-increasing, the calculation mode of the TF-IDF weights of the word in the document and the corpus is the product of the word frequency TF (t, D) and the inverse document frequency IDF (t, D), and obtaining a word bag model representation consisting of the TF-IDF weights corresponding to all the words t for a given document
3. The hot topic-oriented self-supervised representation learning-based viewpoint mining method as claimed in claim 1, wherein the method comprises the following steps: step 2, representing the bag-of-words model obtained in step 1The data enhancement is specifically: assume that the bag of words model representation/>Is a V-dimensional vector, the vector dimension V is equal to the word list size of the corpus, the probability p is set, the minimum word representation on the L numerical values in the vector is obtained, and the word bag model representation/>According to random probabilistic selection of three data enhancement modes:
a) Reducing the q-value of the word t with the probability p;
b) Increasing the q of the word t value with the probability p;
c) The t-value of the word is set to zero with probability p.
4. The hot topic-oriented self-supervised representation learning-based viewpoint mining method as claimed in claim 1, wherein the method comprises the following steps: the encoder network in step 3 uses the following full-join layer transform, enhanced pairwise similar document vector representationsAs input, deducing a perspective representation of the text, the specific implementation steps comprising:
step 3.1, randomly sampling the corpus obtained in the step 2 to obtain a V-dimensional paired similar document vector representation The input encoder network is mapped to the S-dimensional implicit semantic space through the following two layers of linear transformation:
Wherein, Representing a weight matrix of a layer,/>Is a weight matrix representing two layers,/>And/>Is an offset term,/>And/>Representing the hidden state of a layer,/>And/>The representation vector after the representation layer is activated, SN (·) is the spectrum normalization, HARDSWISH (·) is the activation function;
Step 3.2, representing the vector in step 3.1 Through full connection layer transformation, the document view distribution of K dimension is mapped:
Wherein, And/>Is the weight matrix and bias term of this layer,/>Is the hidden state of the document view distribution layer,Representing/>, for pairs of similar document vectorsThe corresponding K-dimensional document views are distributed, and the kth e {1,2,., K } polynomial distribution/>Similar document vector representation/>, representing the kth point of view in pairsThe specific weight of the material.
5. The hot topic-oriented self-supervised representation learning-based viewpoint mining method as claimed in claim 1, wherein the method comprises the following steps: the step 4.5 is to restrict the output distribution of the encoder network by calculating the maximum average deviation of the deduced document view distribution and dirichlet priors distribution, and specifically comprises the following steps:
Step 4.5.1, set of perspective distributions given the inference From the parameters/>Random sampling is carried out in dirichlet distribution to obtain prior distribution/>, of thetaThe formula specifically used is as follows:
Wherein k is set as the number of views used for model training, Is defined as/>, by compliance parameterA i is the parameter vector/>, a priori samples sampled from the dirichlet distributionIs the ith value of (2);
step 4.5.2, after obtaining the set of view distributions Θ and the a priori distribution Θ', calculating the maximum average deviation between the two distributions using the following formula:
where k represents the kernel function, n=2m, And/>Distribution of views obtained for encoder encoding,/>And/>A view distribution obtained by sampling from the prior distribution;
According to the calculation formula, the total loss of model training is obtained as follows:
wherein beta is a super-parameter used in model training, The a priori matching loss is calculated for using the maximum average offset distance.
6. The hot topic-oriented self-supervised representation learning-based viewpoint mining method as claimed in claim 1, wherein the method comprises the following steps: and (4) constraining the parameter change of the model in the step, and continuously iterating until the loss function converges, wherein the model training specifically comprises the following steps:
4.2.1, constructing an encoder network, optimizing a model by using an optimizer, and designating super parameters lambda, mu, eta and beta of the model, wherein the values of the super parameters lambda, mu, eta and beta are larger than zero;
step 4.2.2, randomly sampling the input encoder network from the corpus, wherein the compliance parameter is as follows The dirichlet distribution is sampled to obtain prior distribution, and the perspective distribution of the text is sampled from the network output of the encoder;
Step 4.2.3, carrying out random gradient descent by optimizing the three regularization losses and the dirichlet priors matching loss, and updating parameters of an encoder network;
Step 4.2.4, repeating step 4.2.2 and step 4.2.3 until the model converges.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410226614.5A CN117808104B (en) | 2024-02-29 | 2024-02-29 | Viewpoint mining method based on self-supervision expression learning and oriented to hot topics |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410226614.5A CN117808104B (en) | 2024-02-29 | 2024-02-29 | Viewpoint mining method based on self-supervision expression learning and oriented to hot topics |
Publications (2)
Publication Number | Publication Date |
---|---|
CN117808104A CN117808104A (en) | 2024-04-02 |
CN117808104B true CN117808104B (en) | 2024-04-30 |
Family
ID=90420377
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202410226614.5A Active CN117808104B (en) | 2024-02-29 | 2024-02-29 | Viewpoint mining method based on self-supervision expression learning and oriented to hot topics |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117808104B (en) |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110941721A (en) * | 2019-09-28 | 2020-03-31 | 国家计算机网络与信息安全管理中心 | Short text topic mining method and system based on variational self-coding topic model |
CN113051932A (en) * | 2021-04-06 | 2021-06-29 | 合肥工业大学 | Method for detecting category of network media event of semantic and knowledge extension topic model |
CN115099188A (en) * | 2022-06-22 | 2022-09-23 | 南京邮电大学 | Topic mining method based on word embedding and generating type neural network |
CN116150669A (en) * | 2022-12-02 | 2023-05-23 | 大连海事大学 | Mashup service multi-label classification method based on double-flow regularized width learning |
CN117236330A (en) * | 2023-11-16 | 2023-12-15 | 南京邮电大学 | Mutual information and antagonistic neural network based method for enhancing theme diversity |
-
2024
- 2024-02-29 CN CN202410226614.5A patent/CN117808104B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110941721A (en) * | 2019-09-28 | 2020-03-31 | 国家计算机网络与信息安全管理中心 | Short text topic mining method and system based on variational self-coding topic model |
CN113051932A (en) * | 2021-04-06 | 2021-06-29 | 合肥工业大学 | Method for detecting category of network media event of semantic and knowledge extension topic model |
CN115099188A (en) * | 2022-06-22 | 2022-09-23 | 南京邮电大学 | Topic mining method based on word embedding and generating type neural network |
CN116150669A (en) * | 2022-12-02 | 2023-05-23 | 大连海事大学 | Mashup service multi-label classification method based on double-flow regularized width learning |
CN117236330A (en) * | 2023-11-16 | 2023-12-15 | 南京邮电大学 | Mutual information and antagonistic neural network based method for enhancing theme diversity |
Non-Patent Citations (2)
Title |
---|
基于对抗训练的文本表示和分类算法;张晓辉;于双元;王全新;徐保民;;计算机科学;20200615(第S1期);全文 * |
基于狄利克雷多项分配模型的多源文本主题挖掘模型;徐立洋;黄瑞章;陈艳平;钱志森;黎万英;;计算机应用;20181110(第11期);全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN117808104A (en) | 2024-04-02 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Liu et al. | Attention-based BiGRU-CNN for Chinese question classification | |
Dharma et al. | The accuracy comparison among word2vec, glove, and fasttext towards convolution neural network (cnn) text classification | |
Van de Cruys | A non-negative tensor factorization model for selectional preference induction | |
Alami et al. | Using unsupervised deep learning for automatic summarization of Arabic documents | |
CN111737453B (en) | Unsupervised multi-model fusion extraction-type text abstract method | |
Muaad et al. | Arabic document classification: performance investigation of preprocessing and representation techniques | |
O'Neill et al. | An analysis of topic modelling for legislative texts | |
Liu et al. | Research on multi-label text classification method based on tALBERT-CNN | |
Chang et al. | A METHOD OF FINE-GRAINED SHORT TEXT SENTIMENT ANALYSIS BASED ON MACHINE LEARNING. | |
Al-Shenak et al. | Aqas: Arabic question answering system based on svm, svd, and lsi | |
CN113779246A (en) | Text clustering analysis method and system based on sentence vectors | |
Johnson et al. | A detailed review on word embedding techniques with emphasis on word2vec | |
CN114265936A (en) | Method for realizing text mining of science and technology project | |
Truică et al. | TLATR: Automatic topic labeling using automatic (domain-specific) term recognition | |
Kushwaha et al. | Textual data dimensionality reduction-a deep learning approach | |
CN117808104B (en) | Viewpoint mining method based on self-supervision expression learning and oriented to hot topics | |
Saifullah et al. | Automated Text Annotation Using a Semi-Supervised Approach with Meta Vectorizer and Machine Learning Algorithms for Hate Speech Detection | |
Li et al. | A discriminative approach to sentiment classification | |
CN110674293A (en) | Text classification method based on semantic migration | |
Zhai et al. | TRIZ technical contradiction extraction method based on patent semantic space mapping | |
Yang et al. | Robust anchor-based multi-view clustering via spectral embedded concept factorization | |
Lei et al. | Hierarchical recurrent and convolutional neural network based on attention for Chinese document classification | |
CN113111288A (en) | Web service classification method fusing unstructured and structured information | |
Guo et al. | Chinese text classification model based on bert and capsule network structure | |
CN112836014A (en) | Multi-field interdisciplinary-oriented expert selection method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |