CN113821642B

CN113821642B - Method and system for cleaning text based on GAN clustering

Info

Publication number: CN113821642B
Application number: CN202111369093.1A
Authority: CN
Inventors: 韩瑞峰; 金霞; 杨红飞
Original assignee: Hangzhou Firestone Technology Co ltd
Current assignee: Huoshi Creation Technology Co ltd
Priority date: 2021-11-18
Filing date: 2021-11-18
Publication date: 2022-03-01
Anticipated expiration: 2041-11-18
Also published as: CN113821642A

Abstract

The invention discloses a method and a system for cleaning a text based on GAN clustering, which comprises the steps of firstly constructing a GAN network comprising a generation network, a coding network and a judgment network, obtaining a hidden variable and a text type distribution vector of the text through the coding network, representing the hidden variable and the text type distribution vector as vectors of the text, and taking an average value as an anchor vector of the text type; and for the text to be cleaned, calculating the vector representation of the text, calculating the distance between the vector and each type of anchor vector, measuring the noise degree of the text, and manually judging whether each piece of text data is noise data according to a noise threshold. The invention utilizes the antagonism training method of GAN to obtain reliable text vector representation under the assistance of various losses, is used for the calculation of text type anchor vectors, can obtain the anchor vectors even under the condition of no marking, measures the noise degree of the text by using the distance from the anchor vectors, and can realize the high-efficiency cleaning of the text without supervision.

Description

Method and system for cleaning text based on GAN clustering

Technical Field

The invention relates to the field of text data mining, in particular to a method and a system for text cleaning based on GAN clustering.

Background

In the application scene of text data mining, data crawl and wash are the first step, write the washing rule mostly at present, or put in order a large amount of positive samples and negative samples, train the text classifier and distinguish the noise data and come to reach abluent purpose, write and need a large amount of manual observations, summarize and continuously invest in and optimize the washing rule, can also bring the condition of rule conflict when rule quantity increases gradually, need establish the rule engine and manage the rule, difficult processing to the variety of natural language, and train the text classifier and need artifical a large amount of positive and negative samples of gathering, the same cost is great, to the different scenes of changeable various demands, need repeatedly to gather the mark.

Disclosure of Invention

The invention aims to provide a method and a system for cleaning data by using an unsupervised clustering method aiming at the defects of the prior art. The invention combines the GAN confrontation training method to cluster the texts, thereby assisting the text cleaning work, and the training process does not need text type labels which are only used for testing the clustering result.

The purpose of the invention is realized by the following technical scheme: a text cleaning method based on GAN clustering comprises the following steps:

(1) constructing and training a GAN network structure comprising a generating network, an encoding network and a discriminating network, wherein the generating network input consists of an implicit variable and an onehot vector with the length of n _ c, wherein n _ c is a predefined text type number; generating a generated text with a fixed length output by a network; the method comprises the steps that the input of a coding network is a real text or a generated text of a generating network, the output is a hidden variable and a text type distribution vector with the length of n _ c, and the text type distribution vector uses softmax to obtain an onehot vector; judging whether the input of the network is a real text or a generated text for generating the network, and outputting the input of the network as a numerical value to show the probability of whether the input of the network is the real text or not;

(2) and (3) performing clustering analysis and cleaning on the text, wherein the method specifically comprises the following steps:

(2.1) for a batch of real texts, obtaining hidden variables and text type distribution vectors of the real texts by using a coding network, converting the text type distribution vectors into onehot vectors, and obtaining text types by the onehot vectors;

(2.2) taking a vector obtained by splicing the hidden variable and the text type distribution vector as a vector representation of the text to obtain vectors of all texts under each text type in a batch of real texts, and taking an average value as an anchor vector of the text type;

(2.3) for the text to be cleaned, calculating vector representation of the text, calculating the distance between the vector and the anchor vector of each text type, if each text type contains a noise type, selecting the minimum distance to be calculated, respectively storing the text with the minimum distance as the noise distance and the minimum distance as the non-noise distance in two lists, and sequencing according to the distance; if the text types do not contain the noise types, representing the noise degree by using the minimum distance value between the vector of the text to be cleaned and the anchor vector of each text type, storing the noise degree in a list, and sequencing according to the distance;

and (2.4) judging whether each piece of text data is noise data or not according to the set noise threshold value for the text data in the list obtained in the step (2.3) manually.

Further, in the step (1), the hidden variable is a floating-point number vector with a length of dim _ latent; the input of the generation network is a vector with the length of dim _ late + n _ c, and the vector is formed by splicing a floating point number vector with the length of dim _ late and an onehot vector with the length of n _ c.

Further, in step (1), the text type is defined according to an actual text classification task, and includes a noise type text and a non-noise type text, or only includes a non-noise type text.

Further, the generation network is composed of an Embedding layer, a plurality of lstm layers, and a plurality of fully connected layers.

Furthermore, the discriminating network and the coding network are composed of an Embedding layer, a convolutional layer or an lstm layer and a full connection layer.

Further, the training process of the GAN network in step (1) is specifically as follows:

a. sampling hidden variables: randomly taking a floating point number vector with the length of dim _ later output by the coding network, randomly selecting a text type serial number zc _ idx and converting the text type serial number zc _ idx into an onehot vector;

b. calculating a loss function, specifically comprising:

and (3) judging loss of the real text: inputting a batch of N real texts into a coding network to obtain N groups of hidden variables, inputting the N groups of hidden variables into a generation network to generate N generated texts, inputting the real texts and the generated texts into a judgment network, and inputting the real texts and the generated texts into the judgment network to obtain a probability D _ real of being input as the real texts and a probability D _ gen of being input as the real texts by the generation text input judgment network;

gradient penalty loss: interpolating a batch of real texts and generated texts to obtain a batch of new texts, obtaining gradient vectors by using a discrimination network, and using an L2 model of the gradient vectors as a gradient penalty loss;

hidden variable reconstruction loss: inputting the generated text into a coding network to obtain a hidden variable and a text type distribution vector, calculating mse loss for the hidden variable and the hidden variable used for generating the generated text, and calculating cross entropy loss by using the text type distribution vector and an onehot vector used for generating the generated text;

text reconstruction loss: inputting a hidden variable and an onehot vector obtained by a real text into a generating network to obtain a generated text, and calculating the mse loss of the real text and the generated text;

clustering loss: calculating clustering loss by using hidden variables of the real text through a kmeans unsupervised clustering method;

and the network passing through when each loss function is calculated is adjusted by corresponding loss function values through back propagation, wherein real sample discrimination loss and gradient penalty loss are used for adjusting the generation network and the discrimination network, hidden variable reconstruction loss and text reconstruction loss are used for adjusting the generation network and the coding network, and clustering loss is used for adjusting the coding network.

Further, when the real text discrimination loss is calculated, if the labeled text exists, the judgment on the specific type of the real text can be increased, and at the moment, the output of the discrimination network represents the probability that the text is various types of texts or the generated text.

The invention also provides a text cleaning system based on GAN clustering, which comprises a GAN network module, an anchor vector calculation module and a text cleaning module:

the GAN network module is composed of a generating network module, a coding network module and a judging network module,

the input of the network generation module consists of a hidden variable and an onehot vector with the length of n _ c, wherein n _ c is a predefined text type number; the output of the generating network module is a generating text with a fixed length;

the input of the coding network module is a real text or a generated text output by the generating network module, the output is a hidden variable and a text type distribution vector with the length of n _ c, the text type distribution vector uses softmax to obtain an onehot vector, and the onehot vector obtains a text type;

the input of the network judging module is a real text or a generated text output by the network generating module, and the output is a numerical value which represents the probability of whether the input is the real text or not;

the vector obtained by splicing the hidden variable output by the coding network module and the text type distribution vector is used as the vector representation of the text, the vectors of all texts under each text type are input into the anchor vector calculation module, and the vector average value of all texts is taken as the anchor vector of the input text type;

the text cleaning module inputs vector representation of texts obtained through the coding network module and anchor vectors of various text types obtained through the anchor vector calculation module, calculates the distance between the vector representation of the texts and the anchor vectors of various text types, selects the minimum distance calculated if the various text types contain noise types, respectively stores the texts with the minimum distance as noise distance and the minimum distance as non-noise distance in two lists, and sorts the texts according to the distance; if the text types do not contain the noise types, representing the noise degree by using the minimum distance value between the vector of the input text and the anchor vector of each text type, storing the noise degree in a list, and sequencing according to the distance; and finally, manually outputting the result of whether each piece of text data is the noise data or not according to the set noise threshold value for the text data in the obtained list.

Further, the hidden variable output by the coding network module is a floating-point number vector with the length of dim _ latent; the input of the generation network module is a vector with the length of dim _ late + n _ c, and the vector is formed by splicing a floating point vector with the length of dim _ late and an onehot vector with the length of n _ c.

Furthermore, the system also comprises a hidden variable acquisition module and a loss function calculation module, which are used for training the GAN network module;

the hidden variable acquisition module randomly takes a floating-point number vector with the length of dim _ later output by the coding network module, randomly selects a text category serial number zc _ idx and converts the text category serial number zc _ idx into an onehot vector;

the loss function calculation module specifically calculates the following loss functions:

and (3) judging loss of the real text: inputting a batch of N real texts into a coding network module to obtain N groups of hidden variables, inputting a generation network module to generate N generation texts, inputting the real texts and the generation texts into a judgment network module, and obtaining a probability D _ real of being input as the real texts and a probability D _ gen of being input as the real texts by the judgment network module for the real text input;

gradient penalty loss: interpolating a batch of real texts and generated texts to obtain a batch of new texts, obtaining gradient vectors by using a discrimination network module, and taking an L2 model of the gradient vectors as a gradient penalty loss;

hidden variable reconstruction loss: inputting the generated text into a coding network module to obtain a hidden variable and a text type distribution vector, calculating mse loss for the hidden variable and the hidden variable used for generating the generated text, and calculating cross entropy loss by using the text type distribution vector and an onehot vector used for generating the generated text;

text reconstruction loss: inputting a hidden variable and an onehot vector obtained by a real text into a generation network module to obtain a generated text, and calculating the mse loss of the real text and the generated text;

The invention has the beneficial effects that: by using the GAN confrontation training method, under the assistance of various losses, reliable text vector representation is obtained and is used for calculating text type anchor vectors, the anchor vectors can be obtained even in the situation of no mark, the noise degree of the text is measured by using the distance from the anchor vectors, and the text classification or noise degree sequencing is based on the method, so that the efficient cleaning of the text can be realized without supervision.

Drawings

FIG. 1 is a schematic flow chart of a text cleaning method according to the present invention;

FIG. 2 is a schematic diagram of a text cleaning system according to the present invention.

Detailed Description

The following describes embodiments of the present invention in further detail with reference to the accompanying drawings.

As shown in fig. 1, the text cleaning method based on GAN clustering provided by the present invention specifically includes the following processes:

training the GAN network:

1.1 network architecture: comprises a generating network, a coding network and a discriminating network.

Generating a vector with the length of dim _ late + n _ c, which is input by a network and consists of a floating point number vector (hidden variable) with the length of dim _ late and an onehot vector with the length of n _ c, wherein n _ c is a predefined text category number, and the text type is defined according to an actual text classification task and comprises a noise type text and a non-noise type text or only comprises the non-noise type text; the output is a generated text with fixed length, and the generated network can be composed of an Embedding layer, a plurality of lstm layers and a plurality of full connection layers.

The encoding network input is a text (a real text or a generated text for generating network output), the output is a vector of dim _ latent + n _ c, the vector is divided into a floating point vector (hidden variable) with the length of dim _ latent and a text type distribution vector with the length of n _ c, the text type distribution vector obtains an onehot vector by softmax, and the onehot vector can be composed of an Embedding layer, a convolutional layer or an lstm layer and a full connection layer, and the meaning of the text type distribution vector is as follows: the type distribution of the text is probability distribution, and if there are 3 text types, the type distribution vector of the text may be [0.7, 0.3, 0.1], and then this text type is considered as type 1.

The input of the network is judged to be text (real text or generated text), the output is numerical value, the probability of whether the input is the real text is shown, the coding network and the judgment network can share network layers except the output layer or share all the network layers, when the network layers except the output layer are shared, the output dimensionality of the output layer is different, when all the network layers are shared, the text output category is obtained by onehot vectors, and the number of the text output category is required to be consistent with the number of the category defined by the onehot vectors.

1.2 training:

a. sampling hidden variables: and randomly selecting a floating point number vector with the length of dim _ later output by the coding network, randomly selecting a text type serial number zc _ idx and converting the text type serial number zc _ idx into an onehot vector.

b. Calculating a loss function:

and (3) judging loss of the real text: inputting a batch of N real texts into a coding network to obtain N groups of hidden variables, inputting the N groups of hidden variables into a generation network to generate N generated texts, inputting the real texts and the generated texts into a discrimination network to obtain the probability of the real texts, and inputting D _ real and D _ gen, wherein D _ real is the probability of the real texts input into the discrimination network, and D _ gen is the probability of the real texts input into the discrimination network.

Optionally, if there is a labeled text, the determination on the specific type of the real text may be increased, and at this time, the probability that the output of the discrimination network indicates that the text is of various types or the generated text may be increased.

Gradient penalty loss: and interpolating a batch of real texts and generated texts to obtain a batch of new texts, obtaining gradient vectors by using a discrimination network, and using an L2 model of the gradient vectors as a gradient penalty loss.

Hidden variable reconstruction loss: inputting the generated text into a coding network to obtain a hidden variable and a text type distribution vector, calculating mse loss for the hidden variable and the text hidden variable for generating the generated text, and calculating cross entropy loss by using the text type distribution vector and an onehot vector for generating the generated text.

Text reconstruction loss: and inputting a hidden variable and an onehot vector obtained by the real text into a generating network to obtain a generated text, and calculating the mse loss of the real text and the generated text.

Clustering loss: and for the hidden variables of the real text, calculating the clustering loss by using unsupervised clustering methods such as kmeans and the like.

Optionally, the reconstruction loss of hidden variables, the reconstruction loss of texts and the clustering loss with large calculation amount are calculated once every M times.

Clustering analysis and cleaning of texts:

for a batch of real texts, a coding network is used for obtaining hidden variables and text type distribution vectors of the real texts, the text type distribution vectors are converted into onehot vectors, and the types of the real texts are obtained through the onehot vectors.

Optionally, a text type distribution vector before the transformation of the hidden variable + softmax is obtained, that is, the hidden variable and the text type distribution vector are spliced to be used as a vector representation of the text, vectors of all texts in each text type in a batch of real texts are obtained, and an average value is taken to be used as an anchor vector of the text type. The text to be cleaned is cleaned, a vector representation of the text is calculated, the distance of the vector from each type of anchor vector, such as the L2 distance, if the distance from the noise type is the minimum, the text is considered to be closer to the noise, and if the noise type is included in each text type, the calculated minimum distance is taken, the texts with the minimum distance as the noise distance and the minimum distance as the non-noise distance are respectively stored in two lists and are sorted according to the distance, wherein, the noise distance refers to the distance between the text vector to be cleaned and the anchor vector of the noise text type, the non-noise distance refers to the distance between the text vector to be cleaned and the anchor vector of the non-noise text type, if each text type does not contain the noise type, representing the noise degree by using the minimum distance value between the vector of the text to be cleaned and the anchor vector of each text type, storing the noise degree in a list, and sequencing according to the distance; and the obtained lists are manually judged, at the moment, a noise threshold value can be obtained for each list manually, noise texts are obtained if the noise threshold value is larger than the threshold value, or whether each piece of data is noisy is judged manually, most of noise is distributed in a position with a large distance, and after a small amount of non-noise data is manually recalled, the cleaning quality can be further improved.

The invention also provides a text cleaning system based on GAN clustering, as shown in FIG. 2, the system comprises a GAN network module, an anchor vector calculation module and a text cleaning module:

the input of the generation network module is a vector with the length of dim _ late + n _ c, and the vector is formed by splicing a floating point vector with the length of dim _ late and an onehot vector with the length of n _ c, wherein n _ c is a predefined text type number; generating a text with a fixed length output by the network module;

the input of the coding network module is a real text or a generated text output by the generating network module, and the output is a hidden variable and a text type distribution vector with the length n _ c, wherein the hidden variable is a floating point number vector with the length dim _ late; the text type distribution vector obtains an onehot vector by softmax, and the onehot vector obtains a text type;

the hidden variable and text type distribution vector output by the coding network module are used as the vector representation of the text (the hidden variable and the text type distribution vector are spliced), the vectors of all texts under each text type are input into an anchor vector calculation module, and the vector average value of all texts is used as the anchor vector of the input text type;

The text cleaning system also comprises a hidden variable acquisition module and a loss function calculation module, and is used for training the GAN network module;

clustering loss: and calculating the clustering loss by using hidden variables of the real text through a kmeans unsupervised clustering method.

Example (b):

the method is used for cleaning medical public opinion text data crawled from a network, such as data in a question-answer form collected according to keywords relevant to inquiry from a Baidu knowledge, and since the keywords can hit some unwanted non-medical question-answer data, such as advertisement texts, a GAN network is trained to remove the text data, so that the subsequent workload is reduced. If the data contains irrelevant data and 2 types (n _ c = 2) of 'medical apparatus' and 'drug', and a certain amount of labeled data can be used for calculating the judgment loss, the calculation of other losses does not need to distinguish the types, the data is used for calculating the anchor vector of the type in the cleaning step, the vector of the input text and the anchor vectors of the 2 types are used for calculating the distance to obtain the noise degree, and the larger the minimum distance obtained by each anchor vector is, the closer the text is to the noise is represented; if the data contains 3 types (n _ c = 3) of labeling data such as ' irrelevant ', ' medical instrument ', ' medicine ', ' anchor vectors of types including ' irrelevant ' types are calculated in the cleaning step, the distance between the anchor vectors of the 3 types is calculated to obtain the noise degree, and the text closest to the ' irrelevant ' anchor vectors is closer to the noise; if the data does not contain labels, the onehot vector obtained by the coding network is used for obtaining the type of the onehot vector, the calculation of the anchor vector and the noise is carried out, and the result of the calculation to obtain the noise degree is shown in tables 1 and 2, wherein the table 1 shows that the text with small distance to the type of inquiry is a non-noise text, and the table 2 shows that the text with large distance to the type of inquiry is a noise text. The "title" and "distance" in tables 1 and 2 represent the text to be cleaned and the calculated distance, respectively.

TABLE 1

TABLE 2

The above-described embodiments are intended to illustrate rather than to limit the invention, and any modifications and variations of the present invention are within the spirit of the invention and the scope of the appended claims.

Claims

1. A text cleaning method based on GAN clustering is characterized by comprising the following steps:

(1) constructing and training a GAN network structure comprising a generating network, a coding network and a judging network, wherein the input of the generating network is a vector with the length of dim _ late + n _ c, the generating network consists of an implicit variable and an onehot vector with the length of n _ c, and the implicit variable is a floating-point number vector with the length of dim _ late; specifically, the input of the generation network is formed by splicing a floating point number vector with the length of dim _ late and an onehot vector with the length of n _ c; wherein n _ c is a predefined number of text types; generating a generated text with a fixed length output by a network; the method comprises the steps that the input of a coding network is a real text or a generated text of a generating network, the output is a hidden variable and a text type distribution vector with the length of n _ c, and the text type distribution vector uses softmax to obtain an onehot vector; judging whether the input of the network is a real text or a generated text for generating the network, and outputting the input of the network as a numerical value to show the probability of whether the input of the network is the real text or not;

the training process for the GAN network is specifically as follows:

b. calculating a loss function, specifically comprising:

the network passing through when each loss function is calculated is adjusted through the corresponding loss function value through back propagation, wherein real sample discrimination loss and gradient penalty loss are used for adjusting a generating network and a discriminating network, hidden variable reconstruction loss and text reconstruction loss are used for adjusting the generating network and a coding network, and clustering loss is used for adjusting the coding network;

2. The GAN cluster-based text washing method as claimed in claim 1, wherein in step (1), the text type is defined according to the actual text classification task, and comprises both noise type text and non-noise type text, or only comprises non-noise type text.

3. The GAN clustering-based text washing method as claimed in claim 1, wherein the generation network is composed of an Embedding layer, a plurality of lstm layers and a plurality of fully connected layers.

4. The GAN clustering-based text cleaning method as claimed in claim 1, wherein the discriminating network and the coding network are composed of an Embedding layer, a convolutional layer or lstm layer and a full link layer.

5. The GAN cluster-based text cleaning method as claimed in claim 1, wherein when calculating the discrimination loss of the real text, if there is a labeled text, the discrimination of the specific type of the real text can be increased, and the output of the discrimination network represents the probability that the text is of various types or the generated text.

6. A text cleaning system based on GAN clustering is characterized by comprising a GAN network module, a hidden variable acquisition module, a loss function calculation module, an anchor vector calculation module and a text cleaning module:

the input of the generation network module is a vector with the length of dim _ late + n _ c, the vector consists of an implicit variable and an onehot vector with the length of n _ c, and the vector is formed by splicing a floating point number vector with the length of dim _ late and an onehot vector with the length of n _ c; wherein n _ c is a predefined number of text types; the output of the generating network module is a generating text with a fixed length;

the input of the coding network module is a real text or a generated text output by the generating network module, the output is a hidden variable and a text type distribution vector with the length of n _ c, the text type distribution vector uses softmax to obtain an onehot vector, and the onehot vector obtains a text type; the hidden variable output by the coding network module is a floating-point number vector with the length of dim _ latent;

the hidden variable acquisition module and the loss function calculation module are used for training the GAN network module;