CN115640799A

CN115640799A - Sentence vector characterization method based on enhanced momentum contrast learning

Info

Publication number: CN115640799A
Application number: CN202211105642.9A
Authority: CN
Inventors: 金日泽; 齐士博
Original assignee: Tianjin Polytechnic University
Current assignee: Tianjin Polytechnic University
Priority date: 2022-09-07
Filing date: 2022-09-07
Publication date: 2023-01-24

Abstract

The invention discloses a sentence vector characterization method based on enhanced momentum contrast learning, which relates to the technical field of natural language processing, and the technical scheme is as follows: the method specifically comprises the following steps: s1: performing data enhancement on the input sample by using a data enhancement module to generate two groups of word embedding representations; s2: the two groups of words generated in S1 are respectively sent to an encoder E _q And an encoder E _k Coding and respectively obtaining sentence vectors r _q Sum sentence vector r _k (ii) a S3: sentence vector r obtained in S2 _k Push into queue Q; the queue Q is a first-in first-out queue; s4: repeating the operations of S1-S3 for a plurality of times, so that the queue Q is continuously updated and maintains a fixed number of negative samples; s. the5: and performing comparison loss calculation on the sentence vectors generated by the final batch and the sentence vectors in the queue Q. The method exceeds the current optimal result, and proves the outstanding capability of the method in the aspect of sentence vector characterization.

Description

Sentence vector characterization method based on enhanced momentum contrast learning

Technical Field

The invention relates to the technical field of natural language processing, in particular to a sentence vector characterization method based on enhanced momentum contrast learning.

Background

General sentence vector characterization learning has always been the most important work in the field of NLP (natural language processing), and provides a basis for other downstream tasks, such as machine translation, information retrieval, emotion analysis, and the like. With the rise of BERT, the pre-training model-based fine-tuning strategy has achieved great success on many downstream tasks, but is not ideal for sentence vector characterization, even as good as some underlying LSTM structures. Until the recent comparative learning, sentence vector characterization is further improved, the model based on unsupervised or self-supervised model is poor, and the SimCSE, consert and the like also obtain the optimal results of a plurality of tasks. The learning mode does not need a large amount of expensive marking data and is often more effective in practical application.

Negative samples are needed in the training process of contrast learning, the more negative samples are needed in each batch to participate in calculation, the greater the discrimination between the positive samples and the negative samples generated in the calculation process is, and the better the effect is. In recent years, a contrast learning method based on momentum updating has been greatly successful, and the method well solves the problem that the negative sample is limited by the batch size in the contrast file learning, so that the negative sample can be better utilized. However, when it is applied directly to the field of natural language processing, the expected performance cannot be achieved because the data enhancement is not perfect and all the generated negative samples are not fully utilized.

Disclosure of Invention

The invention aims to provide a sentence vector characterization method based on enhanced momentum contrast learning, which exceeds the current optimal result and proves the outstanding capability of the method in the aspect of sentence vector characterization.

The technical purpose of the invention is realized by the following technical scheme: the sentence vector characterization method for enhanced momentum contrast learning specifically comprises the following steps:

s1: performing data enhancement on the input sample by using a data enhancement module to generate two groups of word embedding representations;

the specific mode of data enhancement is one or two of SimCSE Drop, cutoff, word Repetition and Shuffle Token;

s2: the two groups of words generated in S1 are respectively sent to an encoder E _q And an encoder E _k Coding and respectively obtaining sentence vectors r _q Sum sentence vector r _k ；

S3: sentence vector r obtained in S2 _k Push into queue Q; the queue Q is a first-in first-out queue;

s4: repeating the operations of S1-S3 for a plurality of times, so that the queue Q is continuously updated and maintains a fixed number of negative samples;

s5: and performing comparison loss calculation on the sentence vectors generated in the final batch and the sentence vectors in the queue Q.

Further, E of the encoder _k The parameters are represented by the formula

Carrying out synchronization; wherein m ∈ (0, 1).

Further, the step of calculating the contrast loss in S5 is:

s5-1: calculating the similarity between two sentence vectors generated by the same input sample after data enhancement, and the formula is

S5-2: performing loss calculation according to the formula:

wherein N is the size of a batch, sim (.) is a method for calculating the similarity score of two sentence vectors, and tau is the temperature for calculating the softmax classification;

s5-3: calculating the final contrast loss by the formula:

wherein alpha is a hyperparameter.

In conclusion, the invention has the following beneficial effects:

1. the MoCo method in the computer vision field is applied to the NLP field, adapts to the NLP task and decouples the problem that the number of negative samples is restricted by hardware resources when the NLP task is trained;

2. a specific data enhancement strategy suitable for momentum contrast learning is adopted, so that model training is quicker and more efficient, and the model training can be realized under the condition of a small sample;

3. the Dual-Negative loss combining the momentum loss and the loss in batches is applied in the comparison learning, both the historical sample and the current sample are taken into consideration, and the sample utilization rate is improved.

Drawings

FIG. 1 is a flow chart of a sentence vector characterization method based on enhanced momentum contrast learning according to an embodiment of the present invention;

FIG. 2 is a thermodynamic diagram of the Spearman correlation of STS tasks for different combinations of data enhancements in an embodiment of the present invention.

Detailed Description

The present invention is described in further detail below with reference to FIGS. 1-2.

The embodiment is as follows: the sentence vector characterization method based on enhanced momentum contrast learning, as shown in fig. 1 and fig. 2, specifically includes the following steps:

for SimCSE Drop, the same input is transmitted to a pre-trained encoder twice, a group of words with different forms can be embedded into a 'true case pair' through two times of random dropouts, in the embodiment, a standard dropout layer in a transform is used as noise, the noise acts on an Attention Drop process and a Hidden Drop process, and for implementation, the noise is equivalent to that no operation is performed on the input, and a reinforced sample is obtained by directly sending the noise into a BERT;

for Cutoff, deleting a whole row (token level) or a whole column (feature level) in word embedding through a certain probability so as to realize data enhancement, but actually, the dimensionality of word embedding in a code is not changed, and is realized by setting all elements to be deleted to be 0;

for Word Repetition, a token (except CLS and SEP) in a sentence is randomly repeated with a certain probability, compared with Word Deletion, the Word Repetition can better retain the original meaning, and compared with synonym replacement, the Word Repetition is simpler to realize;

for Shuffle Tokens, the method randomly scrambles the word embedding, and when the method is implemented, the order of the word embedding is not changed, but the position ids information representing the word embedding order is scrambled.

In the present embodiment, the encoder E _q And an encoder E _k Are based on BERT, and a pooling layer is added on top of them to facilitate sentence vector generation.

S3: sentence vector r obtained in S2 _k Pushing into queue Q; the queue Q is a first-in first-out queue;

Encoder E _k Is not back-propagated by the loss functionUpdated because the negative cases in Q are from different batches, if E _k Each update will cause the samples pushed in different batches to have great difference, the jumping property is too large, the training is not easy to converge, and the direction propagation will consume very much resource because the size of the negative sample queue Q is usually very large, so in this embodiment, E of the digital encoder is very large _k The parameters are represented by the formula

Carrying out synchronization; in this embodiment, m takes the value of 0.999, which can ensure E _k The change is very small after each batch, and the convergence is always continuously and uniformly changed in the training process, so that sentence vectors r among different batches _k Can be regarded as E under the same parameter _k The sentence vectors produced, and the consistency is high throughout Q.

In the embodiment, in order to better utilize all sentence vector characterizations generated in a batch (batch), dual-Negative loss is proposed, meanwhile, the enqueue logic of the original MoCo is changed, and the priority of the enqueue can be improved by changing the enqueue logic of the MoCo, so that positive and Negative samples in the batch currently participating in calculation are generated in the same batch, and the subsequent calculation is more suitable.

The step of calculating the contrast loss in the step S5 is as follows:

S5-2: performing loss calculation according to the formula:

s5-3: in fact, the loss is equivalent to an N-way softmax classifier, and the final contrast loss is calculated by the formula:

where α is a hyperparameter, and α =0.3 in this embodiment.

The following are the specific experimental procedures and results of the above technical scheme:

1. a data set is established.

In this experiment, our proposed method was trained and evaluated on an English Wikipedia dataset (English Wikipedia) and a semantic text similarity dataset (STS). Wherein the semantic text similarity dataset comprises STS2012-STS2016 (STS 12-STS 16), STSBEnchmark (STS-B), and SICK-Relatedness (SICK-R). Consistent with the SimCSE, a wiki data set was used, which contained 100 million randomly sampled unlabeled sentences. The semantic text similarity dataset consists of sentence pairs, and each sentence pair is annotated with a semantic similarity score between them (0-5). Note that we do not use the STS dataset as a training set in the experiment, they are used for evaluation only, and we use Spearman coefficients to represent the agreement between the predicted results and the labeled tags.

2. And establishing a benchmark.

GloVe, IS-BERT, BERT-CT, BERT-flow, CLEAR, simCSE, consert were selected as our comparative basis in this experiment because they were all tested on STS data sets and the index content was very similar, both being quality assessment of the generated sentence vectors and very relevant to our method.

3. And (5) super parameter setting.

Encoder E _q And E _k Are initialized by BERT-base-uncased, and a pooling layer is additionally provided at the encoder level to generate the final sentence vector representation. In the experiment, the length of the negative example queue length Q is 10240, tau is the temperature of softmax classification, the predicted smoothness of the model is adjusted to be 0.02, the momentum update factor m is set to be 0.999, the learning rate is set to be 1e-5, the maximum sequence length of the input model is 64, and the training is carried out for one round (epoch) in an unsupervised mode.

4. The main result is.

In order to evaluate the sentence vector representation effect generated by the model under unsupervised learning, an English Wikipedia data set is used for training in the experimental process, and evaluation experiments are carried out on 7 semantic text similarity STS data sets.

Note that the ConSERT was trained by randomly mixing unlabeled STS datasets, and in order to make the experiment more comparable, the ConSERT was retrained on the english wikipedia dataset and also tested on the STS dataset. As shown in table 1, the enhanced momentum contrast model of the method can stably improve the representation quality of the sentence vectors, and compared with the original BERTbase standard, the score on the task aiming at semantic text similarity evaluation is improved by 33.93%, which shows that the method is remarkably improved in the aspect of sentence vector representation for BERT. Compared with the current optimal model such as SimCSE, the data set of STS-B is improved by 0.74%, the data set of STS15 is improved by 1.02%, and the data set of STS12 is improved by 1.16%. Similarly, on the BERTlarge reference model, our method outperformed the current optimal model, condert, on the entire 7 STS datasets, yielding an average performance improvement of 2.4% (from 75.05% to 77.45%).

However, on the SICK-R data set, the model effect is not ideal, especially the difference is more than two points compared with the SimCSE, and probably because the SICK-R data set contains more professional knowledge contents which are independent from each other, the discreteness of the method is overlarge when a large number of negative samples are collected, and the effect is reduced due to the fact that the calculation loss is reversed in the same batch and all negative examples.

TABLE 1 evaluation results of sentence vector characterization on STS task under unsupervised settings

5. And (4) performing ablation experiments.

5.1 evaluation of Dual-Negative loss.

The influence of the Dual-Negative loss weight alpha is firstly evaluated, and the importance of the Dual-Negative loss proposed by the method in the combination of momentum loss and loss in batches is verified by the table 2, so that the method can obviously improve the quality of sentence vector characterization.

TABLE 2 Spearman relevance scores for STS tasks with different Dual-Negative loss weight coefficients α

α	0	0.1	0.2	0.3	0.4	0.5
							score	79.04	80.12	81.03	81.54	80.87	80.81

5.2, contrast exploration of data enhancement.

Four strategies, namely SimCSDrop, cutoff, word Repetition and Shuffle Tokens, were used as Data evaluation 1 and Data evaluation 2, and a total of 16 combinations of 4 × 4 were generated, thereby obtaining the thermodynamic diagram shown in FIG. 2. It can be seen that the Shuffle Tokens + SimCSE Drop obtains the best score, and we think that the implicit random strategy can better improve the contrast learning; the general overall effect of the combination including Word Repetition may be that the sentence is directly modified by the strategy, so that the sentence deviates from the original meaning, and the effect is deteriorated.

5.3, the optimal queue length K.

K represents the length of the negative sample queue Q, namely the number of negative examples calculated in each batch, for comparative learning, theoretically, the larger the value is, but in practical tests, K is found to have the Best upper limit, and the upper limit is changed along with the size of the training set, for example, when the training set is 10 ten thousand, the STS task score initially increases along with the increase of K, the score is positively correlated with the K, when K =2300, the score basically reaches the bottleneck, and increasing K cannot enable the score to be continuously improved, and we call K at this moment as Best-K, namely, the STS score reaches the optimal K value on the premise of a certain number of training sets.

In order to verify the relationship between Best-K and the number of training sets, we performed a comparative experiment, see Table 3, and it can be seen that as the number of training sets increases, best-K also increases, and the two are in positive correlation. We believe that when the training set size is small, too large a queue size K will give an excessive negative fit to each batch, and when K is increased to match the training set size, each batch will be softmax calculated with all training sets, which will result in reduced effectiveness or even irreparable training, so the choice of Best-K in performance needs to be dynamically adjusted according to the training set size.

TABLE 3 STS task, under different training set lengths, corresponding to K value for obtaining the best Spearman correlation score (test step size takes three levels of 10, 100, 1000)

Len(dataset)	1k	10k	100k	500k	1000k
						Best-K	30	160	2300	4100	10200

The present embodiment is only illustrative and not restrictive, and those skilled in the art can modify the present embodiment as required without inventive contribution after reading the present specification, but only protected by the scope of the claims of the present invention.

Claims

1. A sentence vector characterization method based on enhanced momentum contrast learning is characterized by comprising the following steps: the method specifically comprises the following steps:

s2: the two groups of words generated in S1 are respectively sent to an encoder E _q And an encoder E _k Respectively obtain sentence vectors r _q Sum sentence vector r _k ；

s5: and (4) performing comparison loss calculation on the sentence vectors generated in the final batch and the sentence vectors in the queue Q.

2. The method of sentence vector characterization based on enhanced momentum contrast learning according to claim 1, wherein: e of the encoder _k The parameters are represented by the formula

Carrying out synchronization; wherein m ∈ (0, 1).

3. The method of sentence vector characterization based on enhanced momentum contrast learning of claim 1, wherein: the step of calculating the contrast loss in the step S5 is as follows:

s5-1: calculating the similarity between two sentence vectors generated by the same input sample after data enhancement, wherein the formula is

S5-2: performing loss calculation according to the formula:

wherein N is the size of a batch, sim (.) is a method for calculating the similarity score of two sentence vectors, and tau is the temperature for calculating softmax classification;

s5-3: calculating the final contrast loss by the formula:

wherein alpha is a hyperparameter.