CN115640799A - Sentence vector characterization method based on enhanced momentum contrast learning - Google Patents

Sentence vector characterization method based on enhanced momentum contrast learning Download PDF

Info

Publication number
CN115640799A
CN115640799A CN202211105642.9A CN202211105642A CN115640799A CN 115640799 A CN115640799 A CN 115640799A CN 202211105642 A CN202211105642 A CN 202211105642A CN 115640799 A CN115640799 A CN 115640799A
Authority
CN
China
Prior art keywords
queue
sentence
sentence vector
encoder
data enhancement
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211105642.9A
Other languages
Chinese (zh)
Inventor
金日泽
齐士博
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin Polytechnic University
Original Assignee
Tianjin Polytechnic University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin Polytechnic University filed Critical Tianjin Polytechnic University
Priority to CN202211105642.9A priority Critical patent/CN115640799A/en
Publication of CN115640799A publication Critical patent/CN115640799A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Machine Translation (AREA)

Abstract

The invention discloses a sentence vector characterization method based on enhanced momentum contrast learning, which relates to the technical field of natural language processing, and the technical scheme is as follows: the method specifically comprises the following steps: s1: performing data enhancement on the input sample by using a data enhancement module to generate two groups of word embedding representations; s2: the two groups of words generated in S1 are respectively sent to an encoder E q And an encoder E k Coding and respectively obtaining sentence vectors r q Sum sentence vector r k (ii) a S3: sentence vector r obtained in S2 k Push into queue Q; the queue Q is a first-in first-out queue; s4: repeating the operations of S1-S3 for a plurality of times, so that the queue Q is continuously updated and maintains a fixed number of negative samples; s. the5: and performing comparison loss calculation on the sentence vectors generated by the final batch and the sentence vectors in the queue Q. The method exceeds the current optimal result, and proves the outstanding capability of the method in the aspect of sentence vector characterization.

Description

Sentence vector characterization method based on enhanced momentum contrast learning
Technical Field
The invention relates to the technical field of natural language processing, in particular to a sentence vector characterization method based on enhanced momentum contrast learning.
Background
General sentence vector characterization learning has always been the most important work in the field of NLP (natural language processing), and provides a basis for other downstream tasks, such as machine translation, information retrieval, emotion analysis, and the like. With the rise of BERT, the pre-training model-based fine-tuning strategy has achieved great success on many downstream tasks, but is not ideal for sentence vector characterization, even as good as some underlying LSTM structures. Until the recent comparative learning, sentence vector characterization is further improved, the model based on unsupervised or self-supervised model is poor, and the SimCSE, consert and the like also obtain the optimal results of a plurality of tasks. The learning mode does not need a large amount of expensive marking data and is often more effective in practical application.
Negative samples are needed in the training process of contrast learning, the more negative samples are needed in each batch to participate in calculation, the greater the discrimination between the positive samples and the negative samples generated in the calculation process is, and the better the effect is. In recent years, a contrast learning method based on momentum updating has been greatly successful, and the method well solves the problem that the negative sample is limited by the batch size in the contrast file learning, so that the negative sample can be better utilized. However, when it is applied directly to the field of natural language processing, the expected performance cannot be achieved because the data enhancement is not perfect and all the generated negative samples are not fully utilized.
Disclosure of Invention
The invention aims to provide a sentence vector characterization method based on enhanced momentum contrast learning, which exceeds the current optimal result and proves the outstanding capability of the method in the aspect of sentence vector characterization.
The technical purpose of the invention is realized by the following technical scheme: the sentence vector characterization method for enhanced momentum contrast learning specifically comprises the following steps:
s1: performing data enhancement on the input sample by using a data enhancement module to generate two groups of word embedding representations;
the specific mode of data enhancement is one or two of SimCSE Drop, cutoff, word Repetition and Shuffle Token;
s2: the two groups of words generated in S1 are respectively sent to an encoder E q And an encoder E k Coding and respectively obtaining sentence vectors r q Sum sentence vector r k
S3: sentence vector r obtained in S2 k Push into queue Q; the queue Q is a first-in first-out queue;
s4: repeating the operations of S1-S3 for a plurality of times, so that the queue Q is continuously updated and maintains a fixed number of negative samples;
s5: and performing comparison loss calculation on the sentence vectors generated in the final batch and the sentence vectors in the queue Q.
Further, E of the encoder k The parameters are represented by the formula
Figure BDA0003836123870000021
Figure BDA0003836123870000022
Carrying out synchronization; wherein m ∈ (0, 1).
Further, the step of calculating the contrast loss in S5 is:
s5-1: calculating the similarity between two sentence vectors generated by the same input sample after data enhancement, and the formula is
Figure BDA0003836123870000023
S5-2: performing loss calculation according to the formula:
Figure BDA0003836123870000031
Figure BDA0003836123870000032
Figure BDA0003836123870000033
wherein N is the size of a batch, sim (.) is a method for calculating the similarity score of two sentence vectors, and tau is the temperature for calculating the softmax classification;
s5-3: calculating the final contrast loss by the formula:
Figure BDA0003836123870000034
wherein alpha is a hyperparameter.
In conclusion, the invention has the following beneficial effects:
1. the MoCo method in the computer vision field is applied to the NLP field, adapts to the NLP task and decouples the problem that the number of negative samples is restricted by hardware resources when the NLP task is trained;
2. a specific data enhancement strategy suitable for momentum contrast learning is adopted, so that model training is quicker and more efficient, and the model training can be realized under the condition of a small sample;
3. the Dual-Negative loss combining the momentum loss and the loss in batches is applied in the comparison learning, both the historical sample and the current sample are taken into consideration, and the sample utilization rate is improved.
Drawings
FIG. 1 is a flow chart of a sentence vector characterization method based on enhanced momentum contrast learning according to an embodiment of the present invention;
FIG. 2 is a thermodynamic diagram of the Spearman correlation of STS tasks for different combinations of data enhancements in an embodiment of the present invention.
Detailed Description
The present invention is described in further detail below with reference to FIGS. 1-2.
The embodiment is as follows: the sentence vector characterization method based on enhanced momentum contrast learning, as shown in fig. 1 and fig. 2, specifically includes the following steps:
s1: performing data enhancement on the input sample by using a data enhancement module to generate two groups of word embedding representations;
the specific mode of data enhancement is one or two of SimCSE Drop, cutoff, word Repetition and Shuffle Token;
for SimCSE Drop, the same input is transmitted to a pre-trained encoder twice, a group of words with different forms can be embedded into a 'true case pair' through two times of random dropouts, in the embodiment, a standard dropout layer in a transform is used as noise, the noise acts on an Attention Drop process and a Hidden Drop process, and for implementation, the noise is equivalent to that no operation is performed on the input, and a reinforced sample is obtained by directly sending the noise into a BERT;
for Cutoff, deleting a whole row (token level) or a whole column (feature level) in word embedding through a certain probability so as to realize data enhancement, but actually, the dimensionality of word embedding in a code is not changed, and is realized by setting all elements to be deleted to be 0;
for Word Repetition, a token (except CLS and SEP) in a sentence is randomly repeated with a certain probability, compared with Word Deletion, the Word Repetition can better retain the original meaning, and compared with synonym replacement, the Word Repetition is simpler to realize;
for Shuffle Tokens, the method randomly scrambles the word embedding, and when the method is implemented, the order of the word embedding is not changed, but the position ids information representing the word embedding order is scrambled.
S2: the two groups of words generated in S1 are respectively sent to an encoder E q And an encoder E k Coding and respectively obtaining sentence vectors r q Sum sentence vector r k
In the present embodiment, the encoder E q And an encoder E k Are based on BERT, and a pooling layer is added on top of them to facilitate sentence vector generation.
S3: sentence vector r obtained in S2 k Pushing into queue Q; the queue Q is a first-in first-out queue;
s4: repeating the operations of S1-S3 for a plurality of times, so that the queue Q is continuously updated and maintains a fixed number of negative samples;
s5: and performing comparison loss calculation on the sentence vectors generated in the final batch and the sentence vectors in the queue Q.
Encoder E k Is not back-propagated by the loss functionUpdated because the negative cases in Q are from different batches, if E k Each update will cause the samples pushed in different batches to have great difference, the jumping property is too large, the training is not easy to converge, and the direction propagation will consume very much resource because the size of the negative sample queue Q is usually very large, so in this embodiment, E of the digital encoder is very large k The parameters are represented by the formula
Figure BDA0003836123870000051
Figure BDA0003836123870000052
Carrying out synchronization; in this embodiment, m takes the value of 0.999, which can ensure E k The change is very small after each batch, and the convergence is always continuously and uniformly changed in the training process, so that sentence vectors r among different batches k Can be regarded as E under the same parameter k The sentence vectors produced, and the consistency is high throughout Q.
In the embodiment, in order to better utilize all sentence vector characterizations generated in a batch (batch), dual-Negative loss is proposed, meanwhile, the enqueue logic of the original MoCo is changed, and the priority of the enqueue can be improved by changing the enqueue logic of the MoCo, so that positive and Negative samples in the batch currently participating in calculation are generated in the same batch, and the subsequent calculation is more suitable.
The step of calculating the contrast loss in the step S5 is as follows:
s5-1: calculating the similarity between two sentence vectors generated by the same input sample after data enhancement, and the formula is
Figure BDA0003836123870000061
S5-2: performing loss calculation according to the formula:
Figure BDA0003836123870000062
Figure BDA0003836123870000063
Figure BDA0003836123870000064
wherein N is the size of a batch, sim (.) is a method for calculating the similarity score of two sentence vectors, and tau is the temperature for calculating the softmax classification;
s5-3: in fact, the loss is equivalent to an N-way softmax classifier, and the final contrast loss is calculated by the formula:
Figure BDA0003836123870000065
where α is a hyperparameter, and α =0.3 in this embodiment.
The following are the specific experimental procedures and results of the above technical scheme:
1. a data set is established.
In this experiment, our proposed method was trained and evaluated on an English Wikipedia dataset (English Wikipedia) and a semantic text similarity dataset (STS). Wherein the semantic text similarity dataset comprises STS2012-STS2016 (STS 12-STS 16), STSBEnchmark (STS-B), and SICK-Relatedness (SICK-R). Consistent with the SimCSE, a wiki data set was used, which contained 100 million randomly sampled unlabeled sentences. The semantic text similarity dataset consists of sentence pairs, and each sentence pair is annotated with a semantic similarity score between them (0-5). Note that we do not use the STS dataset as a training set in the experiment, they are used for evaluation only, and we use Spearman coefficients to represent the agreement between the predicted results and the labeled tags.
2. And establishing a benchmark.
GloVe, IS-BERT, BERT-CT, BERT-flow, CLEAR, simCSE, consert were selected as our comparative basis in this experiment because they were all tested on STS data sets and the index content was very similar, both being quality assessment of the generated sentence vectors and very relevant to our method.
3. And (5) super parameter setting.
Encoder E q And E k Are initialized by BERT-base-uncased, and a pooling layer is additionally provided at the encoder level to generate the final sentence vector representation. In the experiment, the length of the negative example queue length Q is 10240, tau is the temperature of softmax classification, the predicted smoothness of the model is adjusted to be 0.02, the momentum update factor m is set to be 0.999, the learning rate is set to be 1e-5, the maximum sequence length of the input model is 64, and the training is carried out for one round (epoch) in an unsupervised mode.
4. The main result is.
In order to evaluate the sentence vector representation effect generated by the model under unsupervised learning, an English Wikipedia data set is used for training in the experimental process, and evaluation experiments are carried out on 7 semantic text similarity STS data sets.
Note that the ConSERT was trained by randomly mixing unlabeled STS datasets, and in order to make the experiment more comparable, the ConSERT was retrained on the english wikipedia dataset and also tested on the STS dataset. As shown in table 1, the enhanced momentum contrast model of the method can stably improve the representation quality of the sentence vectors, and compared with the original BERTbase standard, the score on the task aiming at semantic text similarity evaluation is improved by 33.93%, which shows that the method is remarkably improved in the aspect of sentence vector representation for BERT. Compared with the current optimal model such as SimCSE, the data set of STS-B is improved by 0.74%, the data set of STS15 is improved by 1.02%, and the data set of STS12 is improved by 1.16%. Similarly, on the BERTlarge reference model, our method outperformed the current optimal model, condert, on the entire 7 STS datasets, yielding an average performance improvement of 2.4% (from 75.05% to 77.45%).
However, on the SICK-R data set, the model effect is not ideal, especially the difference is more than two points compared with the SimCSE, and probably because the SICK-R data set contains more professional knowledge contents which are independent from each other, the discreteness of the method is overlarge when a large number of negative samples are collected, and the effect is reduced due to the fact that the calculation loss is reversed in the same batch and all negative examples.
TABLE 1 evaluation results of sentence vector characterization on STS task under unsupervised settings
Figure BDA0003836123870000081
Figure BDA0003836123870000091
5. And (4) performing ablation experiments.
5.1 evaluation of Dual-Negative loss.
The influence of the Dual-Negative loss weight alpha is firstly evaluated, and the importance of the Dual-Negative loss proposed by the method in the combination of momentum loss and loss in batches is verified by the table 2, so that the method can obviously improve the quality of sentence vector characterization.
TABLE 2 Spearman relevance scores for STS tasks with different Dual-Negative loss weight coefficients α
α 0 0.1 0.2 0.3 0.4 0.5
score 79.04 80.12 81.03 81.54 80.87 80.81
5.2, contrast exploration of data enhancement.
Four strategies, namely SimCSDrop, cutoff, word Repetition and Shuffle Tokens, were used as Data evaluation 1 and Data evaluation 2, and a total of 16 combinations of 4 × 4 were generated, thereby obtaining the thermodynamic diagram shown in FIG. 2. It can be seen that the Shuffle Tokens + SimCSE Drop obtains the best score, and we think that the implicit random strategy can better improve the contrast learning; the general overall effect of the combination including Word Repetition may be that the sentence is directly modified by the strategy, so that the sentence deviates from the original meaning, and the effect is deteriorated.
5.3, the optimal queue length K.
K represents the length of the negative sample queue Q, namely the number of negative examples calculated in each batch, for comparative learning, theoretically, the larger the value is, but in practical tests, K is found to have the Best upper limit, and the upper limit is changed along with the size of the training set, for example, when the training set is 10 ten thousand, the STS task score initially increases along with the increase of K, the score is positively correlated with the K, when K =2300, the score basically reaches the bottleneck, and increasing K cannot enable the score to be continuously improved, and we call K at this moment as Best-K, namely, the STS score reaches the optimal K value on the premise of a certain number of training sets.
In order to verify the relationship between Best-K and the number of training sets, we performed a comparative experiment, see Table 3, and it can be seen that as the number of training sets increases, best-K also increases, and the two are in positive correlation. We believe that when the training set size is small, too large a queue size K will give an excessive negative fit to each batch, and when K is increased to match the training set size, each batch will be softmax calculated with all training sets, which will result in reduced effectiveness or even irreparable training, so the choice of Best-K in performance needs to be dynamically adjusted according to the training set size.
TABLE 3 STS task, under different training set lengths, corresponding to K value for obtaining the best Spearman correlation score (test step size takes three levels of 10, 100, 1000)
Len(dataset) 1k 10k 100k 500k 1000k
Best-K 30 160 2300 4100 10200
The present embodiment is only illustrative and not restrictive, and those skilled in the art can modify the present embodiment as required without inventive contribution after reading the present specification, but only protected by the scope of the claims of the present invention.

Claims (3)

1. A sentence vector characterization method based on enhanced momentum contrast learning is characterized by comprising the following steps: the method specifically comprises the following steps:
s1: performing data enhancement on the input sample by using a data enhancement module to generate two groups of word embedding representations;
the specific mode of data enhancement is one or two of SimCSE Drop, cutoff, word Repetition and Shuffle Token;
s2: the two groups of words generated in S1 are respectively sent to an encoder E q And an encoder E k Respectively obtain sentence vectors r q Sum sentence vector r k
S3: sentence vector r obtained in S2 k Push into queue Q; the queue Q is a first-in first-out queue;
s4: repeating the operations of S1-S3 for a plurality of times, so that the queue Q is continuously updated and maintains a fixed number of negative samples;
s5: and (4) performing comparison loss calculation on the sentence vectors generated in the final batch and the sentence vectors in the queue Q.
2. The method of sentence vector characterization based on enhanced momentum contrast learning according to claim 1, wherein: e of the encoder k The parameters are represented by the formula
Figure FDA0003836123860000011
Figure FDA0003836123860000012
Carrying out synchronization; wherein m ∈ (0, 1).
3. The method of sentence vector characterization based on enhanced momentum contrast learning of claim 1, wherein: the step of calculating the contrast loss in the step S5 is as follows:
s5-1: calculating the similarity between two sentence vectors generated by the same input sample after data enhancement, wherein the formula is
Figure FDA0003836123860000013
S5-2: performing loss calculation according to the formula:
Figure FDA0003836123860000021
Figure FDA0003836123860000022
Figure FDA0003836123860000023
wherein N is the size of a batch, sim (.) is a method for calculating the similarity score of two sentence vectors, and tau is the temperature for calculating softmax classification;
s5-3: calculating the final contrast loss by the formula:
Figure FDA0003836123860000024
wherein alpha is a hyperparameter.
CN202211105642.9A 2022-09-07 2022-09-07 Sentence vector characterization method based on enhanced momentum contrast learning Pending CN115640799A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211105642.9A CN115640799A (en) 2022-09-07 2022-09-07 Sentence vector characterization method based on enhanced momentum contrast learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211105642.9A CN115640799A (en) 2022-09-07 2022-09-07 Sentence vector characterization method based on enhanced momentum contrast learning

Publications (1)

Publication Number Publication Date
CN115640799A true CN115640799A (en) 2023-01-24

Family

ID=84942074

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211105642.9A Pending CN115640799A (en) 2022-09-07 2022-09-07 Sentence vector characterization method based on enhanced momentum contrast learning

Country Status (1)

Country Link
CN (1) CN115640799A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116069903A (en) * 2023-03-02 2023-05-05 特斯联科技集团有限公司 Class search method, system, electronic equipment and storage medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116069903A (en) * 2023-03-02 2023-05-05 特斯联科技集团有限公司 Class search method, system, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN111177365B (en) Unsupervised automatic abstract extraction method based on graph model
Samanta et al. Towards crafting text adversarial samples
CN112699247B (en) Knowledge representation learning method based on multi-class cross entropy contrast complement coding
CN113268995B (en) Chinese academy keyword extraction method, device and storage medium
CN111581385B (en) Unbalanced data sampling Chinese text category recognition system and method
Sallam et al. Improving Arabic text categorization using normalization and stemming techniques
CN109885675B (en) Text subtopic discovery method based on improved LDA
CN110889282B (en) Text emotion analysis method based on deep learning
CN107357895B (en) Text representation processing method based on bag-of-words model
CN112749274B (en) Chinese text classification method based on attention mechanism and interference word deletion
CN111506728B (en) Hierarchical structure text automatic classification method based on HD-MSCNN
CN109918507B (en) textCNN (text-based network communication network) improved text classification method
Nam et al. Padding methods in convolutional sequence model: an application in Japanese handwriting recognition
KR20200102095A (en) Feature extraction and learning method for summarizing text documents
CN108345901A (en) A kind of graphical diagram node-classification method based on own coding neural network
CN110705247A (en) Based on x2-C text similarity calculation method
CN110728144B (en) Extraction type document automatic summarization method based on context semantic perception
CN113254582A (en) Knowledge-driven dialogue method based on pre-training model
CN115640799A (en) Sentence vector characterization method based on enhanced momentum contrast learning
Sabbah et al. Hybrid support vector machine based feature selection method for text classification.
Gaanoun et al. Arabic dialect identification: An Arabic-BERT model with data augmentation and ensembling strategy
CN113935459B (en) Automatic scoring method of deep neural network model based on BERT
CN112231476B (en) Improved graphic neural network scientific literature big data classification method
CN113486143A (en) User portrait generation method based on multi-level text representation and model fusion
Zhao et al. Commented content classification with deep neural network based on attention mechanism

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination