CN117808104B - Viewpoint mining method based on self-supervision expression learning and oriented to hot topics - Google Patents

Viewpoint mining method based on self-supervision expression learning and oriented to hot topics Download PDF

Info

Publication number
CN117808104B
CN117808104B CN202410226614.5A CN202410226614A CN117808104B CN 117808104 B CN117808104 B CN 117808104B CN 202410226614 A CN202410226614 A CN 202410226614A CN 117808104 B CN117808104 B CN 117808104B
Authority
CN
China
Prior art keywords
document
distribution
word
representation
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202410226614.5A
Other languages
Chinese (zh)
Other versions
CN117808104A (en
Inventor
王睿
刘星
任鹏
王延安
常舒予
黄海平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Posts and Telecommunications
Original Assignee
Nanjing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Posts and Telecommunications filed Critical Nanjing University of Posts and Telecommunications
Priority to CN202410226614.5A priority Critical patent/CN117808104B/en
Publication of CN117808104A publication Critical patent/CN117808104A/en
Application granted granted Critical
Publication of CN117808104B publication Critical patent/CN117808104B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Machine Translation (AREA)

Abstract

The invention belongs to the technical field of natural language processing, and discloses a viewpoint mining method based on self-supervision expression learning for hot topics, which comprises the following steps: acquiring a text corpus and performing data preprocessing; the text in the corpus is represented by a bag-of-word model; carrying out data enhancement on the word bag representation of the document to obtain paired similar document vector representations; inputting the paired similar document vector representations into an encoder network to be output as vector representations of viewpoint distribution of the input document; sampling from the dirichlet distribution to obtain a priori of the viewpoint distribution; model training is performed by minimizing invariance, variance, covariance regularization loss and prior loss of dirichlet prior distribution alignment of the encoder network output. The invention utilizes the self-supervision learning advantage to obtain the viewpoint representation of the document, obtains high-quality viewpoints and digs various viewpoint representations.

Description

Viewpoint mining method based on self-supervision expression learning and oriented to hot topics
Technical Field
The invention belongs to the field of natural language processing, and particularly relates to a viewpoint mining method based on self-supervision expression learning and oriented to hot topics.
Background
The topic model is used as a data mining tool and has the capability of automatically mining potential topics from a large number of unstructured corpuses. These corpora are typically unlabeled and often contain various noise, such as grammar mistakes and spelling problems. These features present a series of challenges for topic mining. Researchers have focused on designing a model that overcomes the above problems and is expected to achieve high topic consistency and topic diversity across different domain data sets. One of the research directions is to effectively preprocess the corpus before model training so as to eliminate noise, process spelling problems and improve text quality, which is helpful to improve the understanding and modeling capability of the topic model on the text; on the other hand, researchers also innovate in terms of model architecture and algorithms to better accommodate unstructured, noise-rich corpora.
The goal of topic modeling is to identify these potential topics by automatically analyzing word co-occurrence relationships in the documents and assign each document a related topic weight. The traditional topic model based on probability represents implicit dirichlet Allocation (LDA) and is assumed to be formed by topic distribution and word distribution in the assumption that the generation of the documents is composed of the topic distribution and the word distribution, and the method effectively digs out the implicit topics in the corpus. However, in the case of models, complex mathematical derivation is required in the solving process, and there is a problem in that the model is not easily expanded. With the proposal of the neural topic model, two main research directions of the topic model exist: based on the VAE and GAN models. The former, because of the use of unsuitable prior constraints on topic distribution, often results in poor interpretability of the learned topic representation; the method is used for countermeasure training, the model optimization direction is unstable, and the problem of topic collapse is easy to occur, such as insufficient topic diversity, so that key information of the original corpus is lost.
Disclosure of Invention
In order to solve the problems in the existing research, the invention provides a viewpoint mining method based on self-supervision representation learning for hot topics, which is used for mining viewpoints under hot events based on a topic model, and the learned representation can capture multimodal semantics in texts by using dirichlet allocation as a priori constraint, and the diversity of viewpoint representation is improved by adopting a self-supervision learning mode and combining loss optimization in the training process.
In order to achieve the above purpose, the invention is realized by the following technical scheme:
the invention relates to a viewpoint mining method based on self-supervision expression learning for hot topics, which is characterized by comprising the following steps of: the viewpoint mining method comprises the following steps:
Step 1, preprocessing data of an obtained social media comment text, and obtaining a word bag model representation of a document by using a TF-IDF representation method according to the word bag model
Step 2, representing the bag-of-words model obtained in the step 1Data enhancement to obtain pairwise similar document vector representations/>
Step 3, representing the enhanced paired similar document vectors obtained in the step 2As input to the encoder network, an output of the encoder network is derived, the output being represented as a vector representation of the viewpoint distribution of the input document;
and 4, restraining parameter changes of the model by minimizing invariance, variance, covariance regularization loss and priori loss of dirichlet priors aligned by the network output of the encoder, and continuously iterating until a loss function converges so as to ensure stability of the model and accuracy of viewpoint mining.
The invention further improves that: the step 1 specifically comprises the following steps:
Step 1-1, data preprocessing: collecting the content structure of public comments from a social media platform, analyzing and collecting meaningful comment entities in the content, removing texts which do not meet the requirements of language categories, performing morphological reduction and spell check on words of the texts, removing stop words in the texts, screening out texts smaller than a set document length threshold value, and removing the texts;
Step 1-2, obtaining a document representation: for a word t in a document, calculating the ratio of the number of occurrences of the word t in the document d to the total number of all words in the document d, i.e. the word frequency of the word t in the document The importance of the word to the whole corpus, namely the inverse document frequency/>, is calculatedThe ratio of the total number of the documents in the corpus to the total number of the documents containing the word t is self-increased by one is used for taking the logarithm, and the/>, of the word in the documents and the corpusThe weight is calculated by word frequency/>And inverse document frequency/>For a given document, one is obtained that is covered by all words t and its corresponding/>Bag-of-word model representation of weight formation/>
The invention further improves that: step 2, representing the bag-of-words model obtained in step 1The data enhancement is specifically: assume that the bag of words model representation/>Is/>Dimension vector, vector dimension/>The size of the word list is equal to that of the corpus, and the probability/> issetAnd get the vector/>The numerically smallest word representation, for the bag of words model representation/>According to random probabilistic selection of three data enhancement modes:
a) With probability Reducing the/>, of the word t value%;
B) With probabilityIncreasing the/>, of the word t value%;
C) With probabilityThe t-value of the word is set to zero.
The invention further improves that: the encoder network in step 3 uses the following full-join layer transform, enhanced pairwise similar document vector representationsAs input, deducing a perspective representation of the text, the specific implementation steps comprising:
step 3.1, randomly sampling the corpus obtained in the step 2 to obtain Vitamin pair-wise similar document vector representation/>The input encoder network is mapped to/>, via two layers of linear transforms as followsDimension implicit semantic space:
Wherein, Representing a weight matrix of a layer,/>Is a weight matrix representing two layers,/>And/>Is an offset term,/>And/>Representing the hidden state of a layer,/>And/>Representation vector after the activation of the representation layer,/>Is spectral normalization,/>Is an activation function;
Step 3.2, representing the vector in step 3.1 Through full connection layer transformation, it is mapped as/>Document view distribution of dimensions:
Wherein, And/>Is the weight matrix and bias term of this layer,/>Is the hidden state of the document view distribution layer,/>Representing/>, for pairs of similar document vectorsCorresponding/>Dimension document perspective distribution, and (ii)Polynomial distribution/>Similar document vector representation/>, representing the kth point of view in pairsThe specific weight of the material.
The invention further improves that: the prior loss aligned by minimizing invariance, variance, covariance regularization loss and dirichlet priors of the encoder network output in the step 4 specifically comprises the following steps:
step 4.1, representing the enhanced paired similar document vectors Is similar after mapping, is constrained by calculating invariance regularization loss, and is given out the deduced viewpoint distribution/>AndLoss/>The calculation mode of (2) is as follows:
Wherein, Representing batch size,/>Traversal index for paired similar document vector representations for summing process in the above equation,/>And/>Representing pairs of similar document vector representations/>Is distributed in pairs;
And 4.2, in order to prevent identity of viewpoint mapping, solving the model collapse problem by using a variance loss function, wherein the calculation mode is as follows:
Wherein, ,/>Is distributed from the viewpoint/>Vector of each value in the kth dimension in all view distributions,/>Is a tiny scalar for data stability,/>Representing the number of points of view of a document,/>Representing general data;
and 4.3, constraint is carried out by utilizing covariance loss, and the calculation mode is as follows:
Wherein, ,/>Representing the/>, in the matrixColumn/>Representing a matrix transposition operation;
Step 4.4, synthesizing the calculation formulas of step 4.1, step 4.2 and step 4.3 to obtain three regularization losses as follows:
Wherein, Is a different hyper-parameter;
and 4.5, constraining the output distribution of the encoder network by calculating the maximum average deviation of the deduced document view distribution and the dirichlet priors distribution.
The invention further improves that: step 4.5, constraining the output distribution of the encoder network by calculating the maximum average deviation of the deduced document view distribution and dirichlet priors distribution, comprising the following steps:
Step 4.5.1, set of perspective distributions given the inference From the parameters/>Random sampling is performed in dirichlet distribution to obtain/>Prior distribution/>The formula specifically used is as follows:
Wherein k is set as the number of views used for model training, Is defined as/>, by compliance parameterPrior samples sampled in dirichlet distribution,/>Is a parameter vector/>Is the ith value of (2);
step 4.5.2, obtaining a set of viewpoint distributions And a priori distribution/>The maximum average deviation between the two distributions was then calculated using the following formula:
Wherein, Representing a kernel function,/>,/>And/>Distribution of views obtained for encoder encoding,/>And/>A view distribution obtained by sampling from the prior distribution;
According to the calculation formula, the total loss of model training is obtained as follows:
Wherein the method comprises the steps of Is a super-parameter used in model training,/>The a priori matching loss is calculated for using the maximum average offset distance.
The invention further improves that: and (4) constraining the parameter change of the model in the step, and continuously iterating until the loss function converges, wherein the model training specifically comprises the following steps:
step 4.2.1, constructing an encoder network, optimizing the model by using a proper optimizer, and designating the hyper-parameters of the model Wherein the superparameter/>The value of (2) is greater than zero;
step 4.2.2, randomly sampling the input encoder network from the corpus, wherein the compliance parameter is as follows The dirichlet distribution is sampled to obtain prior distribution, and the perspective distribution of the text is sampled from the network output of the encoder;
Step 4.2.3, carrying out random gradient descent by optimizing the three regularization losses and the dirichlet priors matching loss, and updating parameters of an encoder network;
Step 4.2.4, repeating step 4.2.2 and step 4.2.3 until the model converges.
The beneficial effects of the invention are as follows: according to the invention, through modeling the views of the text, formalizing the views by adopting the subject words, fully utilizing the advantages of self-supervision learning, and obtaining efficient view representation;
According to the invention, through three regularization losses and dirichlet priors output by the optimization model are aligned, a high-quality viewpoint is obtained, and various viewpoint representations are mined.
The invention designs a specific loss function, obtains high-quality and various views through different optimization angles, improves the quality of learned view representation, and solves common problems in self-supervision learning such as model collapse.
The invention provides an innovative and effective solution for viewpoint analysis under hot topics.
Drawings
FIG. 1 is a specific flow chart of an embodiment of the present invention.
Fig. 2 is a diagram of a neural network model structure of the present invention.
Detailed Description
The application will be further illustrated with reference to the drawings and detailed description, it being understood that the following specific examples are intended to illustrate the application and not to limit the scope of the application, and that various equivalents thereof will be modified by those skilled in the art after reading the application to fall within the scope of the application as defined by the appended claims.
As shown in fig. 1-2, the invention is a self-supervision representation learning-based viewpoint mining method facing hot topics, which comprises the following steps:
Step 1, preprocessing data of an obtained social media comment text, and obtaining a word bag model representation of a document by using a TF-IDF representation method according to the word bag model The method specifically comprises the following steps:
Step 1-1, data preprocessing: collecting the content structure of public comments from a social media platform, analyzing and collecting meaningful comment entities in the content, removing texts which do not meet the requirements of language categories, performing morphological reduction and spell check on words of the texts, removing stop words in the texts, screening out texts smaller than a set document length threshold value, and removing the texts;
Step 1-2, obtaining a document representation: for the word t in a document, the ratio of the number of occurrences of the word t in the document d to the total number of all the words in the document d, i.e. the word frequency of the word t in the document, is calculated The importance of the word to the whole corpus, namely the inverse document frequency/>, is calculatedThe ratio of the total number of the documents in the corpus to the total number of the documents containing the word t is self-increased by one is used for taking the logarithm, and the/>, of the word in the documents and the corpusThe weight is calculated by word frequency/>And inverse document frequency/>For a given document, one is obtained that is covered by all words t and its corresponding/>Bag-of-word model representation of weight formation/>
Step 2, representing the bag-of-words model obtained in the step 1Data enhancement to obtain pairwise similar document vector representations/>The bag-of-words model representation obtained in step 1/>The data enhancement is specifically: presuming a bag of words model representationIs/>Dimension vector, vector dimension/>The size of the word list is equal to that of the corpus, and the probability/> issetAnd get the vector/>The numerically smallest word representation, for the bag of words model representation/>According to random probabilistic selection of three data enhancement modes:
a) With probability Reducing the/>, of the word t value%;
B) With probabilityIncreasing the/>, of the word t value%;
C) With probabilityThe t-value of the word is set to zero.
Step 3, representing the enhanced paired similar document vectors obtained in the step 2The representation is used as an input of the encoder network, and an output of the encoder network is obtained, wherein the output is expressed as a vector representation of the viewpoint distribution of the input document, and the specific implementation steps comprise:
step 3.1, randomly sampling the corpus obtained in the step 2 to obtain Vitamin pair-wise similar document vector representation/>The input encoder network is mapped to/>, via two layers of linear transforms as followsDimension implicit semantic space:
Wherein, Representing a weight matrix of a layer,/>Is a weight matrix representing two layers,And/>Is an offset term,/>And/>Representing the hidden state of a layer,/>And/>Representation vector after the activation of the representation layer,/>Is spectral normalization,/>Is an activation function;
Step 3.2, representing the vector in step 3.1 Through full connection layer transformation, it is mapped as/>Document view distribution of dimensions:
Wherein, And/>Is the weight matrix and bias term of this layer,/>Is the hidden state of the document view distribution layer,/>Representing/>, for pairs of similar document vectorsCorresponding/>Dimension document perspective distribution, and (ii)Polynomial distribution/>Similar document vector representation/>, representing the kth point of view in pairsThe specific weight of the material.
And 4, restraining parameter changes of the model by minimizing invariance, variance, covariance regularization loss and priori loss of dirichlet priors aligned by the network output of the encoder, and continuously iterating until a loss function converges so as to ensure stability of the model and accuracy of viewpoint mining.
Specifically, by minimizing invariance, variance, covariance regularization loss and prior loss of dirichlet priors alignment of the encoder network output, the method specifically comprises the following steps:
step 4.1, representing the enhanced paired similar document vectors Is similar after mapping, is constrained by calculating invariance regularization loss, and is given out the deduced viewpoint distribution/>And/>Loss/>The calculation mode of (2) is as follows:
Wherein, Representing batch size,/>Traversal index for paired similar document vector representations for summing process in the above equation,/>And/>Representing pairs of similar document vector representations/>Is distributed in pairs;
And 4.2, in order to prevent identity of viewpoint mapping, solving the model collapse problem by using a variance loss function, wherein the calculation mode is as follows:
Wherein, ,/>Is distributed from the viewpoint/>Vector of each value in the kth dimension in all view distributions,/>Is a tiny scalar for data stability,/>Representing the number of points of view of a document,/>Representing general data;
and 4.3, constraint is carried out by utilizing covariance loss, and the calculation mode is as follows:
Wherein, ,/>Representing the/>, in the matrixColumn/>Representing a matrix transposition operation;
Step 4.4, synthesizing the calculation formulas of step 4.1, step 4.2 and step 4.3 to obtain three regularization losses as follows:
Wherein, Is a different hyper-parameter;
Step 4.5, constraining the output distribution of the encoder network by calculating the maximum average deviation of the deduced document viewpoint distribution and the dirichlet priors distribution, specifically:
given inferred point of view distribution collection From the parameters/>Random sampling is performed in dirichlet distribution to obtain/>Prior distribution/>The formula specifically used is as follows:
Wherein k is set as the number of views used for model training, Is defined as/>, by compliance parameterPrior samples sampled in dirichlet distribution,/>Is a parameter vector/>Is the ith value of (2);
In obtaining a set of view distributions And a priori distribution/>The maximum average deviation between the two distributions was then calculated using the following formula:
Wherein, Representing a kernel function,/>,/>And/>Distribution of views obtained for encoder encoding,/>And/>A view distribution obtained by sampling from the prior distribution;
According to the calculation formula, the total loss of model training is obtained as follows:
Wherein the method comprises the steps of Is a super-parameter used in model training,/>The a priori matching loss is calculated for using the maximum average offset distance.
And 4, constraining the parameter change of the model, and continuously iterating until the loss function converges, wherein the model training specifically comprises the following steps:
step 4.2.1, constructing an encoder network, optimizing the model by using a proper optimizer, and designating the hyper-parameters of the model Wherein the superparameter/>The value of (2) is greater than zero;
step 4.2.2, randomly sampling the input encoder network from the corpus, wherein the compliance parameter is as follows The dirichlet distribution is sampled to obtain prior distribution, and the perspective distribution of the text is sampled from the network output of the encoder;
Step 4.2.3, carrying out random gradient descent by optimizing the three regularization losses and the dirichlet priors matching loss, and updating parameters of an encoder network;
Step 4.2.4, repeating step 4.2.2 and step 4.2.3 until the model converges.
In order to verify the benefits of the invention, experiments are performed on comment texts of a recent hot topic large language model ChatGPT, and because the invention is based on a topic model technology, the quality of the mined viewpoint representation is evaluated by using topic consistency and diversity indexes, and the average topic consistency values tested on a User Query dataset are as follows: the CP is 0.2983, the NPMI is 0.0581, the UCI is 0.4084, the UT is 0.9870, the indexes are higher than those of a comparison experiment, and the effect is better as the numerical value is higher, wherein the highest CP in the comparison experiment is 0.2834, the NPMI is 0.0397, the UCI is 0.3096 and the UT is 0.960. The comparative models used in this experiment are comparative example 1, comparative example 2, comparative example 3, wherein:
Comparative example 1: according to the LDA method in (Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. J. Mach. Learn. Res., 3 , 993–1022);
Comparative example 2: according to the ETM method in (Dieng, A. B., Ruiz, F. J. R., & Blei, D. M. (2020). Topic modeling in embedding spaces. Trans. Assoc. Comput. Linguistics, 8 , 439–453);
comparative example 3: according to the CNTM in (Nguyen, T., & Luu, A. T. (2021). Contrastive learning for neural topic model. In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual (pp. 11974–11986)).
In the comparison of the comparison experimental results of the invention, the UT index calculation mode is thatWherein/>Representing the sum number in the uniquely occurring subject term; the used CP, NPMI and UCI indexes are public evaluation indexes for evaluating the quality of the theme, namely the consistency and consistency degree of the theme semantic, and are widely used in scientific experiments in the field of theme modeling and are all proposed from the research paper of Roder (Röder M, Both A, Hinneburg A. Exploring the space of topic coherence measures[C]. In Proceedings of the eighth ACM international conference on Web search and data mining. 2015: 399-408.).
The foregoing is only a preferred embodiment of the invention, it being noted that: it will be apparent to those skilled in the art that various modifications and adaptations can be made without departing from the principles of the present invention, and such modifications and adaptations are intended to be comprehended within the scope of the invention.

Claims (6)

1. A viewpoint mining method based on self-supervision expression learning for hot topics is characterized by comprising the following steps of: the viewpoint mining method comprises the following steps:
Step 1, preprocessing data of an obtained social media comment text, and obtaining a word bag model representation of a document by using a TF-IDF representation method according to the word bag model
Step 2, representing the bag-of-words model obtained in the step 1Data enhancement to obtain pairwise similar document vector representations/>
Step 3, representing the enhanced paired similar document vectors obtained in the step 2As input to the encoder network, an output of the encoder network is derived, the output being represented as a vector representation of the viewpoint distribution of the input document;
Step 4, restraining parameter changes of a model by minimizing invariance, variance, covariance regularization loss and dirichlet priors aligned prior loss of the encoder network output, and continuously iterating until a loss function converges to ensure stability of the model and accuracy of view mining, wherein the method specifically comprises the following steps of:
step 4.1, representing the enhanced paired similar document vectors Is similar after mapping, constraint is carried out by calculating invariance regularization loss, and inferred viewpoint distribution/> isgivenAnd/>Loss/>The calculation mode of (2) is as follows:
where M represents the batch size, i is the traversal index of the summing process over pairs of similar document vector representations in the above equation, And/>Representing pairs of similar document vector representations/>Is distributed in pairs;
And 4.2, in order to prevent identity of viewpoint mapping, solving the model collapse problem by using a variance loss function, wherein the calculation mode is as follows:
Wherein, Is a vector composed of each value in the kth dimension in all view distributions in view distribution Θ a, e is a tiny scalar for data stability, K represents the number of document views,/>Representing general data;
and 4.3, constraint is carried out by utilizing covariance loss, and the calculation mode is as follows:
Wherein, J represents the j-th column in the matrix, and T represents matrix transposition operation;
Step 4.4, synthesizing the calculation formulas of step 4.1, step 4.2 and step 4.3 to obtain three regularization losses as follows:
wherein λ, μ, η are different hyper-parameters;
and 4.5, constraining the output distribution of the encoder network by calculating the maximum average deviation of the deduced document view distribution and the dirichlet priors distribution.
2. The hot topic-oriented self-supervised representation learning-based viewpoint mining method as claimed in claim 1, wherein the method comprises the following steps: the step 1 specifically comprises the following steps:
Step 1-1, data preprocessing: collecting the content structure of public comments from a social media platform, analyzing and collecting meaningful comment entities in the content, removing texts which do not meet the requirements of language categories, performing morphological reduction and spell check on words of the texts, removing stop words in the texts, screening out texts smaller than a set document length threshold value, and removing the texts;
Step 1-2, obtaining a document representation: for a word t in a document, calculating the ratio of the occurrence number of the word t in the document D to the total number of all words in the document D, namely the word frequency TF (t, D) of the occurrence of the word t in the document, calculating the importance of the word for the whole corpus, namely the inverse document frequency IDF (t, D), taking the logarithm of the ratio of the total number of words in the corpus to the total number of documents containing the word t after self-increasing, the calculation mode of the TF-IDF weights of the word in the document and the corpus is the product of the word frequency TF (t, D) and the inverse document frequency IDF (t, D), and obtaining a word bag model representation consisting of the TF-IDF weights corresponding to all the words t for a given document
3. The hot topic-oriented self-supervised representation learning-based viewpoint mining method as claimed in claim 1, wherein the method comprises the following steps: step 2, representing the bag-of-words model obtained in step 1The data enhancement is specifically: assume that the bag of words model representation/>Is a V-dimensional vector, the vector dimension V is equal to the word list size of the corpus, the probability p is set, the minimum word representation on the L numerical values in the vector is obtained, and the word bag model representation/>According to random probabilistic selection of three data enhancement modes:
a) Reducing the q-value of the word t with the probability p;
b) Increasing the q of the word t value with the probability p;
c) The t-value of the word is set to zero with probability p.
4. The hot topic-oriented self-supervised representation learning-based viewpoint mining method as claimed in claim 1, wherein the method comprises the following steps: the encoder network in step 3 uses the following full-join layer transform, enhanced pairwise similar document vector representationsAs input, deducing a perspective representation of the text, the specific implementation steps comprising:
step 3.1, randomly sampling the corpus obtained in the step 2 to obtain a V-dimensional paired similar document vector representation The input encoder network is mapped to the S-dimensional implicit semantic space through the following two layers of linear transformation:
Wherein, Representing a weight matrix of a layer,/>Is a weight matrix representing two layers,/>And/>Is an offset term,/>And/>Representing the hidden state of a layer,/>And/>The representation vector after the representation layer is activated, SN (·) is the spectrum normalization, HARDSWISH (·) is the activation function;
Step 3.2, representing the vector in step 3.1 Through full connection layer transformation, the document view distribution of K dimension is mapped:
Wherein, And/>Is the weight matrix and bias term of this layer,/>Is the hidden state of the document view distribution layer,Representing/>, for pairs of similar document vectorsThe corresponding K-dimensional document views are distributed, and the kth e {1,2,., K } polynomial distribution/>Similar document vector representation/>, representing the kth point of view in pairsThe specific weight of the material.
5. The hot topic-oriented self-supervised representation learning-based viewpoint mining method as claimed in claim 1, wherein the method comprises the following steps: the step 4.5 is to restrict the output distribution of the encoder network by calculating the maximum average deviation of the deduced document view distribution and dirichlet priors distribution, and specifically comprises the following steps:
Step 4.5.1, set of perspective distributions given the inference From the parameters/>Random sampling is carried out in dirichlet distribution to obtain prior distribution/>, of thetaThe formula specifically used is as follows:
Wherein k is set as the number of views used for model training, Is defined as/>, by compliance parameterA i is the parameter vector/>, a priori samples sampled from the dirichlet distributionIs the ith value of (2);
step 4.5.2, after obtaining the set of view distributions Θ and the a priori distribution Θ', calculating the maximum average deviation between the two distributions using the following formula:
where k represents the kernel function, n=2m, And/>Distribution of views obtained for encoder encoding,/>And/>A view distribution obtained by sampling from the prior distribution;
According to the calculation formula, the total loss of model training is obtained as follows:
wherein beta is a super-parameter used in model training, The a priori matching loss is calculated for using the maximum average offset distance.
6. The hot topic-oriented self-supervised representation learning-based viewpoint mining method as claimed in claim 1, wherein the method comprises the following steps: and (4) constraining the parameter change of the model in the step, and continuously iterating until the loss function converges, wherein the model training specifically comprises the following steps:
4.2.1, constructing an encoder network, optimizing a model by using an optimizer, and designating super parameters lambda, mu, eta and beta of the model, wherein the values of the super parameters lambda, mu, eta and beta are larger than zero;
step 4.2.2, randomly sampling the input encoder network from the corpus, wherein the compliance parameter is as follows The dirichlet distribution is sampled to obtain prior distribution, and the perspective distribution of the text is sampled from the network output of the encoder;
Step 4.2.3, carrying out random gradient descent by optimizing the three regularization losses and the dirichlet priors matching loss, and updating parameters of an encoder network;
Step 4.2.4, repeating step 4.2.2 and step 4.2.3 until the model converges.
CN202410226614.5A 2024-02-29 2024-02-29 Viewpoint mining method based on self-supervision expression learning and oriented to hot topics Active CN117808104B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410226614.5A CN117808104B (en) 2024-02-29 2024-02-29 Viewpoint mining method based on self-supervision expression learning and oriented to hot topics

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410226614.5A CN117808104B (en) 2024-02-29 2024-02-29 Viewpoint mining method based on self-supervision expression learning and oriented to hot topics

Publications (2)

Publication Number Publication Date
CN117808104A CN117808104A (en) 2024-04-02
CN117808104B true CN117808104B (en) 2024-04-30

Family

ID=90420377

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410226614.5A Active CN117808104B (en) 2024-02-29 2024-02-29 Viewpoint mining method based on self-supervision expression learning and oriented to hot topics

Country Status (1)

Country Link
CN (1) CN117808104B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110941721A (en) * 2019-09-28 2020-03-31 国家计算机网络与信息安全管理中心 Short text topic mining method and system based on variational self-coding topic model
CN113051932A (en) * 2021-04-06 2021-06-29 合肥工业大学 Method for detecting category of network media event of semantic and knowledge extension topic model
CN115099188A (en) * 2022-06-22 2022-09-23 南京邮电大学 Topic mining method based on word embedding and generating type neural network
CN116150669A (en) * 2022-12-02 2023-05-23 大连海事大学 Mashup service multi-label classification method based on double-flow regularized width learning
CN117236330A (en) * 2023-11-16 2023-12-15 南京邮电大学 Mutual information and antagonistic neural network based method for enhancing theme diversity

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110941721A (en) * 2019-09-28 2020-03-31 国家计算机网络与信息安全管理中心 Short text topic mining method and system based on variational self-coding topic model
CN113051932A (en) * 2021-04-06 2021-06-29 合肥工业大学 Method for detecting category of network media event of semantic and knowledge extension topic model
CN115099188A (en) * 2022-06-22 2022-09-23 南京邮电大学 Topic mining method based on word embedding and generating type neural network
CN116150669A (en) * 2022-12-02 2023-05-23 大连海事大学 Mashup service multi-label classification method based on double-flow regularized width learning
CN117236330A (en) * 2023-11-16 2023-12-15 南京邮电大学 Mutual information and antagonistic neural network based method for enhancing theme diversity

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
基于对抗训练的文本表示和分类算法;张晓辉;于双元;王全新;徐保民;;计算机科学;20200615(第S1期);全文 *
基于狄利克雷多项分配模型的多源文本主题挖掘模型;徐立洋;黄瑞章;陈艳平;钱志森;黎万英;;计算机应用;20181110(第11期);全文 *

Also Published As

Publication number Publication date
CN117808104A (en) 2024-04-02

Similar Documents

Publication Publication Date Title
Liu et al. Attention-based BiGRU-CNN for Chinese question classification
Dharma et al. The accuracy comparison among word2vec, glove, and fasttext towards convolution neural network (cnn) text classification
Van de Cruys A non-negative tensor factorization model for selectional preference induction
Alami et al. Using unsupervised deep learning for automatic summarization of Arabic documents
CN111737453B (en) Unsupervised multi-model fusion extraction-type text abstract method
Rezaei et al. Multi-document extractive text summarization via deep learning approach
Muaad et al. Arabic document classification: performance investigation of preprocessing and representation techniques
O'Neill et al. An analysis of topic modelling for legislative texts
Liu et al. Research on multi-label text classification method based on tALBERT-CNN
Chang et al. A METHOD OF FINE-GRAINED SHORT TEXT SENTIMENT ANALYSIS BASED ON MACHINE LEARNING.
CN113779246A (en) Text clustering analysis method and system based on sentence vectors
Liu et al. Quality-related English text classification based on recurrent neural network
Johnson et al. A detailed review on word embedding techniques with emphasis on word2vec
Truică et al. TLATR: Automatic topic labeling using automatic (domain-specific) term recognition
Kushwaha et al. Textual data dimensionality reduction-a deep learning approach
CN117808104B (en) Viewpoint mining method based on self-supervision expression learning and oriented to hot topics
Li et al. A discriminative approach to sentiment classification
CN110674293A (en) Text classification method based on semantic migration
Zhai et al. TRIZ technical contradiction extraction method based on patent semantic space mapping
Yang et al. Robust anchor-based multi-view clustering via spectral embedded concept factorization
CN114265936A (en) Method for realizing text mining of science and technology project
Lei et al. Hierarchical recurrent and convolutional neural network based on attention for Chinese document classification
CN113111288A (en) Web service classification method fusing unstructured and structured information
Guo et al. Chinese text classification model based on bert and capsule network structure
Wang et al. Detecting coreferent entities in natural language requirements

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant