CN114742018A

CN114742018A - Contrast learning level coding text clustering method and system based on confrontation training

Info

Publication number: CN114742018A
Application number: CN202210646870.0A
Authority: CN
Inventors: 郭湘; 江岭; 黄鹏; 郭涛
Original assignee: Chengdu Xiaoduo Technology Co ltd
Current assignee: Chengdu Xiaoduo Technology Co ltd
Priority date: 2022-06-09
Filing date: 2022-06-09
Publication date: 2022-07-12

Abstract

The invention relates to the technical field of natural language processing, in particular to a contrast learning level coding text clustering method and system based on confrontation training. The method comprises the following steps: inputting a batch of training sets into an encoder for reconstruction learning by using a contrast learning model as the encoder; adding confrontation training in the training process of the encoder, and guiding the training of the encoder according to the loss function; and clustering the text vectors output by the encoder by using an Infomap algorithm. The invention is based on the superior text vector representation obtained by contrast learning and antagonistic training, and the clustering of the constructed undirected graph is carried out by adopting the Infomap algorithm, so that the superior clustering effect can be realized.

Description

Contrast learning level coding text clustering method and system based on confrontation training

Technical Field

The invention relates to the technical field of natural language processing, in particular to a contrast learning level coding text clustering method and system based on confrontation training.

Background

In current intelligent dialogue systems, whether chatty dialogue or task-based dialogue systems, a huge number of samples need to be labeled manually to determine the intention of each sample, and then the samples are trained for intention recognition by using a deep learning model. However, the above method has at least the following drawbacks: (1) up to several tens of millions of data set labels consume very large human resources; (2) the manual labeling can cause the condition of wrong labeling, and once the labeling is wrong, on one hand, uncertainty of effect is brought to model training, on the other hand, the condition of wrong sample reworking inspection can be caused, and the manual waste is further aggravated.

Disclosure of Invention

The invention aims to provide a contrast learning hierarchical coding text clustering method and system based on countermeasure training, which can obtain better text vector representation through contrast learning and countermeasure training, construct an undirected graph through vector representation, cluster the undirected graph through an Infomap algorithm, realize clustering of similar samples in massive unlabeled texts, divide the samples with the same intention into one cluster, find the intention of each cluster, and aim to solve the problems pointed out in the background art.

The embodiment of the invention is realized by the following technical scheme: a contrast learning level coding text clustering method based on confrontation training comprises the following steps:

s1, inputting a batch of training sets into an encoder for reconstruction learning by using a contrast learning model as the encoder;

s2, adding confrontation training in the training process of the encoder, and guiding the training of the encoder according to the loss function;

and S3, clustering the text vectors output by the encoder by using an Infmap algorithm.

According to a preferred embodiment, the encoder employs one of Bert, roberta, tiny _ Bert, or Bert _ wwm.

According to a preferred embodiment, if the encoder uses Bert, the text vector output by the encoder is subjected to dimension reduction through a multi-layer perceptron MLP.

According to a preferred embodiment, the training set uses unlabeled semblance sentences.

According to a preferred embodiment, the counter-training employs one of FGM, PGD or FreeLB.

According to a preferred embodiment, the confrontation training employs FGM, and step S2 specifically includes:

adding disturbance items into the training set, optimizing a loss function of the training set by using a gradient descent method, updating the parameter weight of the model, and continuously iterating the steps to finish the training of the encoder, wherein the expression is as follows:

in the above formula, the first and second carbon atoms are,Da training set is represented that represents the training set,xthe input is represented by a representation of the input,ythe label is represented by a number of labels,θthe parameters of the model are represented by,Lossrepresenting the loss of training samples, ΔxA term of the disturbance is represented,Ωrepresenting the perturbation space.

According to a preferred embodiment, the disturbance term ΔxThe expression of (a) is as follows:

in the above-mentioned formula, the compound has the following structure,εthe representation of the hyper-parameter is,

to representLossTo pairxThe gradient of (a) of (b) is,

representing the modal length of the gradient.

According to a preferred embodiment, step S3 specifically includes:

expressing a node by a text vector, and constructing an undirected graph expressed by the text by calculating cosine similarity between the nodes;

taking the similarity between the nodes as the probability of random walk, carrying out random walk on the undirected graph, and directly constructing huffman codes according to the probability of the random walk;

coding each node according to the huffman coding, and coding each cluster according to the huffman coding;

and respectively calculating the shortest average coding length of the nodes and the clusters by using a greedy algorithm to finish clustering.

The invention also provides a contrast learning hierarchical coding text clustering system based on confrontation training, which is applied to the method, and comprises the following steps:

the comparison learning module is used for inputting a batch of training sets into the encoder for reconstruction learning by using the comparison learning model as the encoder;

the confrontation training module is used for adding confrontation training in the training process of the encoder and guiding the training of the encoder according to the loss function;

and the hierarchical coding clustering module is used for clustering the text vectors output by the coder by utilizing an Infmap algorithm.

The technical scheme of the embodiment of the invention at least has the following advantages and beneficial effects: (1) according to the method and the system, the superior text vector representation can be obtained by using contrast learning plus countertraining, so that the samples can be uniformly distributed on the hypersphere plane as much as possible; (2) based on the superior text vector representation obtained by contrast learning and confrontation training, clustering the constructed undirected graph by adopting an Infomap algorithm, and realizing a superior clustering effect; (3) the number of clusters does not need to be determined in advance, and the workload of extra manual debugging parameters caused by the uncertainty of different sample sizes and types can be reduced; (4) the method can automatically cluster the samples without manual marking, and marking personnel can mark the data in batches according to the clustering result, so that the workload is saved; (5) the method and the system provided by the invention can also be used for judging badcase in the existing statement sample and judging the thickness of semantic granularity.

Drawings

Fig. 1 is a schematic flowchart of a contrast learning hierarchical coded text clustering method based on countermeasure training according to embodiment 1 of the present invention;

fig. 2 is a model framework diagram provided in embodiment 1 of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. The components of embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations.

Example 1

The applicant researches and discovers that the traditional method for determining the intention of each sample by adopting manual marking has at least the following defects: (1) up to several tens of millions of data set labels consume very large human resources; (2) the manual labeling can cause the condition of wrong labeling, and once the labeling is wrong, on one hand, uncertainty of effect is brought to model training, on the other hand, the condition of wrong sample reworking inspection can be caused, and the manual waste is further aggravated.

Therefore, the embodiment of the invention provides a contrast learning hierarchical coding text clustering method based on countermeasure training, which obtains better text vector representation through contrast learning and countermeasure training, constructs an undirected graph through vector representation, and clusters the undirected graph through an Infomap algorithm, so that clustering of similar samples in massive unlabeled texts can be realized, the samples with the same intention can be divided into one cluster, and the intention of each cluster can be found. The specific scheme is as follows:

referring to fig. 1, the contrast learning hierarchical coding text clustering method based on confrontation training mainly includes two steps, a text vector representing step and an Infomap clustering step, which are described in detail below.

Considering that the Infomap does not have good text vector representation when clustering is carried out, a better clustering effect cannot be realized, and the existing contrast learning vector representation is not enough and is not distributed on a hypersphere plane as uniformly as possible, so that the Infomap is rarely used for clustering.

Based on this, in one implementation manner of the embodiment of the present invention, the text vector representing step includes:

referring to fig. 2, a comparison learning model of a Bert frame is used as an encoder, and a batch of training sets are input into the encoder for reconstruction learning, where the training sets of this embodiment use unlabeled similar sentences. It should be noted that, the encoder described above is implemented based on the department of self-supervision in Bert, and the principle of Bert is described in detail below:

and performing data enhancement on the input to construct a positive sample, wherein the data enhancement mode comprises one of synonym replacement, sentence truncation, reverse translation, punctuation mark addition, unimportant word deletion and word order rearrangement. It should be noted that synonym replacement is added, so that the model can improve the score of synonym similarity, and further improve the text matching capability of the model.

Further, self-supervised comparative learning is performed, specifically as follows: will be provided with

And positive sample

Inputting into a coder for fitting training to obtain two expression vectors

And

as a positive example pair, randomly sampling another input within the batch

As

Negative example of (1); further, the cosine similarity is adopted to respectively calculate the similarity of the expression vector and other vectors in the batch, the calculated cosine similarity is used as a matching score to sequence the candidate texts, and the final loss function is expressed as follows through a softmax function and a final loss function of a cross entropy calculation model:

in the above formula, the first and second carbon atoms are,

indicates an index only wheni = jIs equal to 1 in time, and is,

representing the total number of sentences in the batch,

the temperature is shown to be over-temperature,

representing cosine similarity

，

Representing cosine similarity

，

Representing cosine similarity

，

To represent

Positive sample

Is used to represent a vector of (a) or (b),

to represent

Negative sample

Is used to represent a vector of (a) a,

representing the training parameters of the model.

Further, iterative training with knowledge of network parameters based on the final loss function, in this embodiment, temperature hyper-parameters

The setting was made to be 0.05,

is set to 128; experiments prove that the model can be fitted only by one epoch through the setting, and the method can obtain higher similarity scores for similar sentences with different lengths.

It should be noted that, besides the Bert illustrated above, the encoder may also employ one of roberta, tiny _ Bert, or Bert _ wwm; if the encoder adopts Bert, the text vector output by the encoder passes through a multi-layer perceptron MLP for dimensionality reduction, and details are not repeated.

Furthermore, counter training is added in the training process of the encoder, and the training of the encoder is guided according to the loss function. In an implementation manner of the embodiment of the present invention, the countermeasure training employs FGM, and includes the following specific steps:

adding disturbance items into the training set, optimizing a loss function of the training set by using a gradient descent method, updating the parameter weight of the model, and continuously iterating the steps to complete the training of the encoder, wherein the expression is as follows:

During the whole model training process, the forward calculation comprisesxAndx+Δxrespectively calculating to obtain final Loss, and respectively calculating when backward propagatingxAndx+Δxthe gradient of (2) and then the gradient decreases, and the time consumed for training in the above manner is about twice that of the training without adding the disturbance term.

Wherein the perturbation term ΔxThe expression of (a) is as follows:

in the above formula, the first and second carbon atoms are,εthe representation of the hyper-parameter is,

to representLossTo pairxThe gradient of (a) of (b) is,

representing the modal length of the gradient. Note that ΔxTake values according to a gradient rise, thusx+ΔxPost-calculationLoss，LossIt will increase significantly.

It should be noted that, in addition to the above-mentioned FGM, one of PGD and FreeLB may be used for the perturbation training, and the description thereof is omitted here.

In an implementation manner of the embodiment of the present invention, the step of clustering the Infomap includes:

clustering the text vectors output by the encoder by using an Infmap algorithm, which is specifically as follows:

the method includes the steps that a text vector represents a node, an undirected graph represented by the text is constructed by calculating cosine similarity between the nodes, and it is noted that the edge construction can be carried out only by determining threshold values in advance, for example, setting threshold =0.5, that is, the edge construction and composition can be carried out on two nodes only when the cosine similarity between the two nodes is greater than or equal to 0.5. Further, taking the similarity between the nodes as the probability of random walk, carrying out random walk on the undirected graph, and directly constructing a huffman code according to the probability of the random walk; coding each node according to the huffman coding, and coding each cluster according to the huffman coding; and respectively calculating the shortest average coding length of the nodes and the clusters by using a greedy algorithm to finish clustering. It is understood that the above process does not require the confirmation of the number of clusters, the number of clusters can be automatically confirmed according to the hierarchical coding, and the above method has high efficiency and interpretability.

The embodiment of the invention also provides a contrast learning hierarchical coding text clustering system based on confrontation training, which is applied to the method and is characterized by comprising the following steps:

In summary, the technical solution of the embodiment of the present invention has at least the following advantages and beneficial effects: (1) according to the method and the system, the superior text vector representation can be obtained by using contrast learning plus countertraining, so that the samples can be uniformly distributed on the hypersphere plane as much as possible; (2) based on better text vector representation obtained by contrast learning and countermeasure training, clustering the constructed undirected graph by adopting an Infmap algorithm, thereby realizing better clustering effect; (3) the number of clusters does not need to be determined in advance, and the workload of extra manual debugging parameters caused by the uncertainty of different sample volumes and category numbers can be reduced; (4) the method can automatically cluster the samples without manual marking, and marking personnel can mark the data in batches according to the clustering result, so that the workload is saved; (5) the method and the system provided by the invention can also be used for judging badcase in the existing statement sample and judging the thickness of semantic granularity.

Moreover, those skilled in the art will appreciate that aspects of the present description may be illustrated and described in terms of several patentable species or situations, including any new and useful combination of processes, machines, manufacture, or materials, or any new and useful improvement thereof. Accordingly, aspects of this description may be performed entirely by hardware, entirely by software (including firmware, resident software, micro-code, etc.), or by a combination of hardware and software. The above hardware or software may be referred to as "data block," module, "" engine, "" unit, "" component, "or" system. Furthermore, aspects of the present description may be represented as a computer product, including computer readable program code, embodied in one or more computer readable media.

The computer storage medium may comprise a propagated data signal with the computer program code embodied therewith, for example, on a baseband or as part of a carrier wave. The propagated signal may take any of a variety of forms, including electromagnetic, optical, etc., or any suitable combination. A computer storage medium may be any computer-readable medium that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code located on a computer storage medium may be propagated over any suitable medium, including radio, cable, fiber optic cable, RF or the like, or any combination of the preceding.

Computer program code required for the operation of various portions of this specification may be written in any one or more programming languages, including an object oriented programming language such as Java, Scala, Smalltalk, Eiffel, JADE, Emerald, C + +, C #, VB.NET, Python, and the like, a conventional programming language such as C, Visual Basic, Fortran 2003, Perl, COBOL 2002, PHP, ABAP, a dynamic programming language such as Python, Ruby, and Groovy, or other programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any network format, such as a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet), or in a cloud computing environment, or as a service, such as a software as a service (SaaS).

Additionally, the order in which the elements and sequences of the process are recited in the specification, the use of alphanumeric characters, or other designations, is not intended to limit the order in which the processes and methods of the specification occur, unless otherwise specified in the claims. While various presently contemplated embodiments of the invention have been discussed in the foregoing disclosure by way of example, it is to be understood that such detail is solely for that purpose and that the appended claims are not limited to the disclosed embodiments, but, on the contrary, are intended to cover all modifications and equivalent arrangements that are within the spirit and scope of the embodiments herein. For example, although the system components described above may be implemented by hardware devices, they may also be implemented by software-only solutions, such as installing the described system on an existing server or mobile device.

Similarly, it should be noted that in the preceding description of embodiments of the present specification, various features are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure aiding in the understanding of one or more of the embodiments. This method of disclosure, however, is not intended to imply that more features than are expressly recited in a claim. Indeed, the embodiments may be characterized as having less than all of the features of a single embodiment disclosed above.

The present invention has been described in terms of the preferred embodiment, and it is not intended to be limited to the embodiment. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. The contrast learning hierarchical coding text clustering method based on the confrontation training is characterized by comprising the following steps of:

s2, adding confrontation training in the training process of the encoder, and guiding the training of the encoder according to a loss function;

2. The method as claimed in claim 1, wherein the encoder employs one of Bert, roberta, tiny _ Bert, or Bert _ wwm.

3. The method as claimed in claim 2, wherein if the encoder uses Bert, the text vector outputted from the encoder goes through a multi-level perceptron MLP for dimensionality reduction.

4. The method of claim 1, wherein the training set employs unlabeled similar sentences.

5. The method as claimed in claim 1, wherein the countermeasure training employs one of FGM, PGD or FreeLB.

6. The method as claimed in claim 1, wherein the competitive training employs FGM, and step S2 specifically includes:

in the above formula, the first and second carbon atoms are,Da training set is represented that represents the training set,xthe input is represented by a representation of the input,ya label is represented that is, for example,θthe parameters of the model are represented by,Lossrepresenting the loss of training samples, ΔxA term of the disturbance is represented,Ωrepresenting the perturbation space.

7. The method as claimed in claim 6, wherein the perturbation term Δ is a hierarchical coding text clustering method for contrast learning based on antagonistic trainingxThe expression of (a) is as follows:

to representLossTo pairxThe gradient of (a) of (b) is,

representing the modal length of the gradient.

8. The method for clustering contrastive learning hierarchical coded texts based on confrontation training as claimed in claim 1, wherein the step S3 specifically comprises:

9. A contrast learning hierarchical coded text clustering system based on confrontational training, applied to the method of any one of claims 1 to 8, comprising: