CN117236330B

CN117236330B - Mutual information and antagonistic neural network based method for enhancing theme diversity

Info

Publication number: CN117236330B
Application number: CN202311524544.3A
Authority: CN
Inventors: 王睿; 郝仁; 刘星; 黄海平
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2023-11-16
Filing date: 2023-11-16
Publication date: 2024-01-26
Anticipated expiration: 2043-11-16
Also published as: CN117236330A

Abstract

The invention belongs to the technical field of natural language processing, and discloses a method for enhancing theme diversity based on mutual information and an antagonistic neural network, which comprises the following steps: word preprocessing in the corpus is used as real text word distribution; using the randomly sampled corpus as input of an encoder to generate a real text topic distribution vector; the distribution pairs are formed by the real text word distribution and the theme distribution, and random disturbance in the batch is used as a negative sample distribution pair; the Dirichlet distribution randomly sampled false text subject distribution is input by a generator and converted into false text word distribution vectors; generating a subject term in the countermeasure training process by using real distribution pairs and false distribution pairs; training is performed with the aim of discriminant loss functions and regularization losses that maximize mutual information. According to the invention, modeling is carried out on the text theme, a high-quality theme is mined, a mutual information maximization technology is integrated into an antagonistic nerve theme modeling process, the theme diversity is enhanced, and higher theme identity and diversity indexes are provided.

Description

Mutual information and antagonistic neural network based method for enhancing theme diversity

Technical Field

The invention belongs to the technical field of natural language processing, and particularly relates to a method for enhancing theme diversity based on mutual information and an antagonistic neural network.

Background

The topic model is an important tool for text mining, hidden information in corpus is mined, and the topic model has wide application in scenes such as topic aggregation, information extraction from unstructured text, feature selection and the like. Wherein implicit dirichlet allocation is the most representative model thereof to infer the topic distribution of the text. However, since the model is complex to solve and has very small adjustment, scientific researchers need to design a corresponding theoretical method for the model, which is not beneficial to modeling the subsequent theme at the application level.

In order to solve the deficiencies of the traditional topic model, based on the rapid development of the generated neural network in recent years, the neural topic model is focused by a plurality of scholars in the fields of text mining and natural language processing and is studied intensively, for example: an challenge-nerve topic model and a bidirectional challenge-nerve topic model are proposed based on challenge training. The model is modeled by using dirichlet distribution as a priori distribution of the subject space, and the encoder and generator generate more realistic data distribution and more accurate subject representation, but ignore valuable information between the generated data distribution and the real data distribution, resulting in insufficient diversity.

Disclosure of Invention

In order to solve the technical problems, the invention provides a method for enhancing topic diversity based on mutual information and an antagonistic neural network, which can lead implicit topic information in a text to obey dirichlet allocation and integrate a mutual information maximization mechanism under an antagonistic neural topic modeling framework to promote the topic diversity of model mining.

In order to achieve the above purpose, the invention is realized by the following technical scheme:

the invention relates to a method for enhancing theme diversity based on mutual information and an antagonistic neural network, which comprises the following steps:

s1: carrying out data preprocessing on the online text of the social platform to obtain a real text, and representing the real text into a real text-word distribution vector by using a word bag model;

s2: placing a plurality of real text-word distribution vectors in a batch to serve as input of an encoder to obtain real text-topic distribution vectors, forming real distribution pairs by the real text-word distribution vectors and corresponding topic distributions, and forming negative sample distribution pairs by scrambling the real text-word distribution vectors in the batch and the real text-topic distributions;

s3: randomly sampling a topic vector from Dirichlet distribution as a pseudo-text-topic distribution, and inputting the pseudo-text-topic distribution into a generator to obtain a pseudo-text-word distribution vector and the pseudo-text-topic distribution to form a pseudo-distribution pair;

s4: the true distribution pair and the false distribution pair are used as the input of the countermeasure generation network, the true distribution pair and the negative sample pair are used as the input of the statistical network, and in the countermeasure training process, the encoder and the generator are trained through signals generated by countermeasure, and the model is trained with the regularization loss of mutual information as the maximum target.

S5: in order to approximate the bulldozer distance and the jensen-shannon distance between two high latitude distributions during training, the training objective is repeatedly optimized and iterated during the countermeasure training until the loss function converges.

The invention further improves that: encoder in step 2Training the mapping relation of the real text-word distribution vector to the real text-topic distribution vector, comprising +.>A dimension text-word distribution layer,/->Dimension semantics-implicit presentation layer and +.>The dimension text-theme distributing layer specifically comprises the following steps:

s2.1, representing the real text in the step 1 by using a bag-of-word model, and randomly sampling to obtainVitamin text-word distribution representation->As input, encoder->Map it to +.>Dimension implicit semantic space, and then get +.>Dimension implicit semantic space mapping to +.>Vitamin text-topic distribution layer:

wherein,and->Weight matrix for text-word distribution layer to semantic-implicit representation layer, +.>Bias term of weight matrix for text-word distribution layer to semantic-implicit representation layer, +.>For the parameters of the LeakyReLU activation function,for batch normalization, ++>For semantic-implicit presentation layer to text-topic distribution layerWeight matrix of>Bias items for semantic-implicit presentation layer to text-topic distribution layer, +.>Is a text-topic distribution corresponding to real text and is the firstWei->Indicate->The proportion of the individual topics in the real text;

s2.2 will then be trueWired distribution vector and true->The dimension theme distribution vectors are spliced into real distribution pairsThe within-batch disruption of the real text-word distribution vector is expressed as +.>The topic distribution and word distribution which are not matched in the batch form a negative sample distribution pair +.>。

Step 3 generatorGenerating a mapping relationship of text-topic distribution to text-word distribution, comprising +.>A dimension text-topic distribution layer, & lt + & gt>Dimension semantics-implicit presentation layer and +.>A dimension text-word distribution layer, the usage parameter is +.>Dirichlet distribution as pseudo-text-topic distribution +.>Is obtained by using the following formula:

wherein the parameters areFor the probability density of the dirichlet distribution, the topic +.>For the subject parameters of the model, +.>Representing the probability that each word in the text belongs to each topic.

S3.1 GeneratorPseudo-text-topic distribution is first +.>Switch to->The dimension semantics-implicit presentation layer, the obtained +.>Dimension implicit semantic space mapping to +.>A dimension text-word distribution layer:

wherein,weight matrix for text-topic distribution layer to semantic-implicit representation layer, +.>Biasing items for text-topic distribution layer to semantic-implicit representation layer, +.>Parameter for the LeakyReLU activation function, < ->For batch normalization, ++>Is a weight matrix of semantic-implicit presentation layer to text-word distribution layer,/for>Is a bias item of semantic-implicit presentation layer to text-word distribution layer, ++>Is a text-topic distribution corresponding to the real text and +.>Wei->Indicate->The proportion of the individual topics in the real text;

s3-2 then distributes the pseudo-text-topicAnd pseudo text-word distribution->Splice into false distribution pairs。

S4.1 real distribution pairs in step 4And pseudo-distribution pair->Is regarded as being composed of two +.>+/>Dimension association distribution pair->And->Random samples from the sample, wherein +.>And->All are made of +.>Vidirichlet distribution pair and one +.>Combined distribution of the pair of Dirichlet distributions, against the production network>Training is to let false joint distributionApproximating true joint distribution->Statistical networks use true sample pairs +.>And negative sample pair->Estimating mutual information between the text-word distribution space and the text-topic distribution space and maximizing it to promote topic diversity, encoder +_ when training is completed>Sum generator->The method can complete the bidirectional mapping relation and the internal mutual information maximization relation between the text-theme distribution and the text-word distribution, and specifically comprises the following steps:

s4.2 discriminatorIs composed of three layers of full-connection networks including one->+/>A combination distribution layer of dimensions +.>Semantic-implicit representation layer of dimension, one output layer. In true distribution pair->And pseudo distribution pair->For inputting and outputting +.>To judge the true or false of the input distribution pair, the method adopts the following formula:

wherein,for bulldozer distance>For the output signal of the arbiter, a value close to 1 indicates that the arbiter is more prone to discriminate it as true, and vice versa;

s4.3 statistical networkComprising a global arbiter->And maximizing the mutual information loss function, global arbiter +.>Comprises a->+/>A combination distribution layer of dimensions +.>Semantic-implicit representation layer of dimension, one output layer. Statistical network->For calculating the true sample pair +.>And negative sample pair->Mutual trust between each otherOutput->The method adopts the following formula:

wherein,representation->Activating function->Input representing an activation function->And->Representing the true data distribution of the text-word distribution layer and the true distribution of the text-topic distribution layer, respectively,/->And->Representing distribution pairs of lot size->Is in the same batch (batch) and +.>Non-matching real text-word distribution.

S4.4, the final training target of the model is as follows:

the step 5 specifically comprises the following steps:

step 5-1, loading the dataset including text data, vocabulary, and word vectors

Step 5-2, build GeneratorEncoder->Discriminator->(mutual information) statistical network->The model is optimized by constructing an optimizer;

step 5-3, true distribution pairsAnd pseudo-distribution pair->As a discriminator +.>Input, its output signal +.>Can guide encoder->And generator->And thereby mine the topics in the text.

Step 5-4, counting the networkBy using realitySample pair->And negative sample distribution pair->Mutual information between the text-word distribution and the text topic distribution space is estimated for input and maximized to promote topic diversity.

Step 5-5, performing random gradient descent optimization according to the loss function of the discriminator and the regularized mutual information loss function, and updating parameters of the encoder and the decoder, namely:

step 5-6, repeating step 5-3, step 5-4 and step 5-5 until convergence.

The beneficial effects of the invention are as follows: the invention can help the topic model learn richer and diversified topic representations through the mutual information maximization mechanism, maximizes the mutual information among different words in the text, and promotes the model to organize related words into topics with more consistency and differentiation. The model can better adapt to task demands by optimizing the objective function with maximized mutual information, and the performance of the model in tasks such as generation, classification, clustering and the like is improved. The related experiments are carried out on a 20Newsgroups data set, and the results show that compared with other methods, the method has higher CP, CV, CA, NPMI and UT indexes, and the quality of the mined theme is obviously improved.

Drawings

Fig. 1 is a model diagram of the present invention.

Fig. 2 is a specific training flow chart of the present invention.

Detailed Description

Embodiments of the invention are disclosed in the drawings, and for purposes of explanation, numerous practical details are set forth in the following description. However, it should be understood that these practical details are not to be taken as limiting the invention. That is, in some embodiments of the invention, these practical details are unnecessary.

As shown in fig. 1-2, the present invention is a method for enhancing theme diversity based on mutual information and antagonistic neural networks, specifically comprising the steps of:

step 1, preprocessing an online text of a social platform to obtain a real text, and representing the real text sample into a real text-word distribution vector by using a word bag model method.

And 2, taking the real text-word distribution vector in the step 1 as input of an encoder to obtain mapping of the real text-theme distribution vector, forming a real distribution pair by the real text-word distribution vector and the theme distribution, and forming a negative sample distribution pair by the real text-word distribution vector in a batch and the real text-theme distribution vector.

Encoder in step 2Training the mapping relation of the real text-word distribution vector to the real text-topic distribution vector, comprising +.>A dimension text-word distribution layer,/->Dimension semantics-implicit presentation layer and +.>The dimension text-theme distributing layer specifically comprises the following steps:

step 2-1, using the real text in the step 1 to represent by using a bag-of-word model, and randomly sampling to obtainVitamin text-word distribution representation->As input, encoder->Map it to +.>Dimension implicit semantic space, and then get +.>Dimension implicit semantic space mapping to +.>Vitamin text-topic distribution layer:

wherein,and->Weight matrix for text-word distribution layer to semantic-implicit representation layer, +.>Bias term of weight matrix for text-word distribution layer to semantic-implicit representation layer, +.>For the parameters of the LeakyReLU activation function,for batch normalization, ++>Weight matrix for semantic-implicit representation layer to text-topic distribution layer, +.>Bias items for semantic-implicit presentation layer to text-topic distribution layer, +.>Is a text-topic distribution corresponding to real text and is the firstWei->Indicate->The weight of the individual subjects in the real text.

In the present example, the encoder network dimension is-/>-/>Wherein->For the word vector dimension,/->For semantic hidden layer dimension, ++>Is the topic vector dimension.

Step 2-2, followed by the actualWired distribution vector and true->The dimension theme distribution vectors are spliced into a real distribution pair +.>The within-batch disruption of the real text-word distribution vector is expressed as +.>The topic distribution and word distribution which are not matched in the batch form a negative sample distribution pair +.>。

Step 3, generator in step 3Generating a mapping relationship of text-topic distribution to text-word distribution, includingA dimension text-topic distribution layer, & lt + & gt>Dimension semantics-implicit presentation layer and +.>A dimension text-word distribution layer, the usage parameter is +.>Dirichlet distribution as pseudo-text-topic distribution +.>Is obtained by using the following formula:

Step 3-1, generatorThe false text is first converted as followsPresent-topic distribution->Switch to->The dimension semantics-implicit presentation layer, the obtained +.>Dimension implicit semantic space mapping to +.>A dimension text-word distribution layer:

wherein,weight matrix for text-topic distribution layer to semantic-implicit representation layer, +.>Biasing items for text-topic distribution layer to semantic-implicit representation layer, +.>Parameter for the LeakyReLU activation function, < ->For batch normalization, ++>Is a weight matrix of semantic-implicit presentation layer to text-word distribution layer,/for>Is a bias item of semantic-implicit presentation layer to text-word distribution layer, ++>Is a text-topic distribution corresponding to the real text and +.>Wei->Indicate->The weight of the individual subjects in the real text.

In the present example, the generator networkDimension is->-/>-/>Wherein->For the topic vector dimension, < >>For semantic hidden layer dimension, ++>Is the word vector dimension.

Step 3-2, then distributing the pseudo-text-subjectAnd pseudo text-word distribution->Splice into false distribution pairs。

True distribution pairs in step 4And pseudo-distribution pair->Is regarded as being composed of two +.>+/>Dimension association distribution pair->And->Random samples from the sample, wherein +.>And->All are made of +.>Vidirichlet distribution pair and one +.>Combined distribution of the pair of Dirichlet distributions, against the production network>The training goal is to let the false distribution pair +.>Approximating the true distribution pair->Statistical network utilizes true distribution pairs +.>And negative sample pair->EstimationMutual information between text-word distribution space and text-topic distribution space and maximize it to promote topic diversity, encoder +_ when training is completed>Sum generator->The method can complete the bidirectional mapping relation and the internal mutual information maximization relation between the text-theme distribution and the text-word distribution, and specifically comprises the following steps:

step 4-1, discriminatorIs composed of three layers of full-connection networks including one->+/>A combination distribution layer of dimensions +.>Semantic-implicit representation layer of dimension, one output layer. In true distribution pair->And pseudo distribution pair->For inputting and outputting +.>To judge the true or false of the input distribution pair, the method adopts the following formula:

wherein,distance to bulldozer，/>For the output signal of the arbiter, a value close to 1 indicates that the arbiter is more prone to discriminate it as true, and vice versa;

step 4-2, counting the networkComprising a global arbiter->And maximizing mutual information loss function, global arbiterComprises a->+/>A combination distribution layer of dimensions +.>Semantic-implicit representation layer of dimension, an output layer, statistical networkFor calculating the true sample pair +.>And negative sample pair->Mutual information between them and output +.>The method adopts the following formula:

In summary, the final training objectives of the model are as follows:

and 5, in order to approximate the bulldozer distance and the jensen-shannon distance between two high latitude distributions during training, repeatedly optimizing and iterating the training target in the countermeasure training process until the loss function converges.

Step 5-1, loading a data set comprising text data, a vocabulary and word vectors;

step 5-2, build GeneratorEncoder->Discriminator->Statistical network->The model is optimized by constructing an optimizer;

Step 5-4, counting networkUse of true sample pairs->And negative sample distribution pair->Estimating text for inputMutual information between the present-word distribution and the text topic distribution space is maximized to promote topic diversity.

step 5-6, repeating step 5-3, step 5-4 and step 5-5 until convergence.

The method for enhancing the topic diversity based on the mutual information and the antagonistic neural network improves the correlation between the topic distribution and the word distribution and enhances the topic diversity by maximizing the mutual information between the topic distribution and the word distribution.

The invention provides an antagonism neural network method for enhancing the diversity of a theme model, which is characterized in that 5 settings [20, 30, 50, 75, 100] are respectively arranged on the theme consistency tested on a 20News groups data set, and the average theme consistency values measured by the method are as follows: the indexes of C_P of 0.273, CA of 0.206, UCI of 0.139, NPMI of 0.052 and UT of 0.761 are higher than those of a comparison experiment, wherein the highest index in the comparison experiment is CP of 0.260, CA of 0.158, UCI of 0.09, NPMI of 0.47 and UT of 0.732.

The invention can help the topic model learn richer and diversified topic representations through the mutual information maximization mechanism, maximizes the mutual information among different words in the text, and promotes the model to organize related words into topics with more consistency and differentiation. The model can better adapt to task demands by optimizing the objective function with maximized mutual information, and the performance of the model in tasks such as generation, classification, clustering and the like is improved.

The foregoing description is only illustrative of the invention and is not to be construed as limiting the invention. Various modifications and variations of the present invention will be apparent to those skilled in the art. Any modification, equivalent replacement, improvement, or the like, which is within the spirit and principles of the present invention, should be included in the scope of the claims of the present invention.

Claims

1. A method for enhancing topic diversity based on mutual information and antagonistic neural networks, characterized by: the method for enhancing the theme diversity comprises the following steps of:

step 1, carrying out data preprocessing on online texts of a social platform to obtain real texts, and representing the real texts into real text-word distribution vectors by using a word bag model;

step 2, placing a plurality of the real text-word distribution vectors in the step 1 in the same batch to serve as input of an encoder to obtain real text-topic distribution vectors, forming real distribution pairs by the real text-word distribution vectors and corresponding topic distribution, and splicing the real text-word distribution vectors in the batch with the real text-topic distribution vectors to form negative sample distribution pairs;

step 3, randomly sampling a theme vector from Dirichlet distribution to serve as a pseudo-text-theme distribution, inputting the pseudo-text-word distribution vector into a generator, and forming a pseudo-distribution pair by the pseudo-text-word distribution vector and the pseudo-text-theme distribution;

step 4, the discriminator receives the real distribution pair obtained in the step 2 and the false distribution pair generated in the step 3 as inputs of the discriminator, calculates losses of the real distribution pair and the false distribution pair to distinguish the real data distribution pair from the generated data distribution pair, and introduces a statistical network, wherein the statistical network receives the real distribution pair and the negative sample distribution pair as inputs, calculates mutual information between the real distribution pair and the negative sample distribution pair, and regularized losses of the mutual information are added into the losses of the discriminator;

step 5, using countermeasure training to approximate the bulldozer distance between the real distribution pair and the false distribution pair and the jensen shannon distance between the real distribution pair and the negative sample distribution pair, and passing through an optimization target and an iterative model of the countermeasure training until a loss function converges, wherein the method specifically comprises the following steps:

step 5-2, constructing an encoder E, a generator G, a discriminator D and a statistical network H model, and constructing an optimizer to optimize the model;

step 5-3, true distribution pairsAnd pseudo-distribution pair->As a discriminator D input, which outputs a signal D during the countermeasure training _out Guiding the learning of the encoder E and the generator G so as to mine out the topics in the text;

step 5-4, the statistical network H uses the real sample pairAnd negative sample distribution pair->Estimating mutual information between text-word distribution and text topic distribution space for input and maximizing it to promote topic diversity;

step 5-6, repeating the steps 5-3 to 5-5 until convergence.

2. The method for enhancing topic diversity based on mutual information and antagonistic neural networks according to claim 1, wherein: the encoder E in step 2 trains the mapping relation from the real text-word distribution vector to the real text-topic distribution vector, including a V-dimensional text-word distribution layer, an S-dimensional semantic-implicit representation layer, and a K-dimensional text-topic distribution layer, and specifically includes the steps of:

step 2-1, using the real text in the step 1 to represent by using a word bag model, and randomly sampling to obtain V-dimensional text-word distribution representationAs input, the encoder E maps it to an S-dimensional latent semantic space, and then maps the resulting S-dimensional latent semantic space to a K-dimensional text-topic distribution layer, using the following formula:

wherein,and->Weight matrix for text-word distribution layer to semantic-implicit representation layer, +.>For biasing terms of the weight matrix of text-word distribution layer to semantic-implicit representation layer, LR is the parameter of the LeakyReLU activation function, BN (·) is batch normalization,/->Weight matrix for semantic-implicit representation layer to text-topic distribution layer, +.>Bias items for semantic-implicit presentation layer to text-topic distribution layer, +.>Is the text-topic distribution corresponding to the real text and kth e {1,2, …, K } dimension ∈ ->Representing the proportion of the kth theme in the real text;

step 2-2, then splicing the true V-dimension word distribution vector and the true K-dimension topic distribution vectorFor true distribution pairsRepresenting the disturbed real text-word distribution vector within the batch as +.>Forming a negative sample distribution pair by the topic distribution and the word distribution which are not matched in the batch>

3. A method of enhancing topic diversity based on mutual information and antagonistic neural networks according to claim 2, characterized in that: in step 3, the generator G generates a mapping relationship from text-topic distribution to text-word distribution, including a K-dimensional text-topic distribution layer, an S-dimensional semantic-implicit representation layer, and a V-dimensional text-word distribution layer, using parameters ofDirichlet distribution as pseudo-text-topic distribution +.>Is obtained by using the following formula:

wherein the parameters areFor the probability density of dirichlet distribution, topic k is the topic parameter of the method for enhancing topic diversity,representing the probability that each word in the text belongs to each topic;

step 3-1, generator G first divides the pseudo-text-subject by the following transformationClothConverting to an S-dimensional semantic-implicit expression layer, and mapping the obtained S-dimensional implicit semantic space to a V-dimensional text-word distribution layer:

wherein,weight matrix for text-topic distribution layer to semantic-implicit representation layer, +.>For text-topic distribution layer to semantic-implicit representation layer bias terms, LR is the parameter of the LeakyReLU activation function, BN (·) is batch normalization,is a weight matrix of semantic-implicit presentation layer to text-word distribution layer,/for>Is a bias item of semantic-implicit presentation layer to text-word distribution layer, ++>Is the text-topic distribution corresponding to the real text and kth e {1,2, …, K } dimension ∈ ->Representing the proportion of the kth theme in the real text;

step 3-2, then distributing the pseudo-text-subjectAnd pseudo text-word distribution->Splice to form pseudo distribution pair->

4. A method of enhancing theme diversity based on mutual information and antagonistic neural networks according to claim 3, characterized in that: true distribution pairs in step 4And pseudo-distribution pair->Consider the pair +.A.by two K+V dimension joint distribution>And->Random samples from the sample, wherein +.>And->Are all combined distributions consisting of a K-dimensional dirichlet allocation pair and a V-dimensional dirichlet allocation pair, and the training goal of the discriminator D is to let false allocation +.>Approximating the true distribution pair->Statistical network H uses true distribution pairs +.>And negative sample distribution pair->Estimating mutual information between the text-word distribution space and the text-topic distribution space and maximizing the mutual information to improve topic diversity, and when training is completed, the encoder E and the generator G obtain a bidirectional mapping relation and an internal mutual information maximizing relation between the text-topic distribution and the text-word distribution, which comprises the following steps of

Step 4-1, the discriminator E is composed of three layers of fully connected networks, the three layers of fully connected networks are specifically a V+K-dimensional joint distribution layer, an S-dimensional semantic-implicit representation layer and an output layer, and the three layers are paired in real distributionAnd pseudo distribution pair->For inputting and outputting D _out To judge the true or false of the input distribution pair, the method adopts the following formula:

wherein W is the bulldozer distance, D (&) is the output signal of the discriminator, and a value close to 1 indicates that the discriminator is more prone to discriminate it as true, and vice versa;

step 4-2, the statistical network H comprises a global arbiter D 'and a maximized mutual information loss function, the global arbiter D' comprises a joint distribution layer of V+K dimension, a semantic-implicit representation layer of S dimension and an output layer, and the statistical network H is used for calculating real sample pairsAnd negative sample pair->Mutual information between them and output S _out The method adopts the following formula:

softplus＝log(1+erp(x))

where sp (·) represents the softplus activation function, x represents the input of the activation function,and->Representing the true data distribution of the text-word distribution layer and the true distribution of the text-topic distribution layer, respectively,/->Is in the same batch and->Non-matching real text-word distribution;

step 4-3, final training targets of the model are as follows: