CN117252195A

CN117252195A - Natural language processing method

Info

Publication number: CN117252195A
Application number: CN202310917948.2A
Authority: CN
Inventors: 彭鑫; 王卫红
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Zhejiang University of Technology ZJUT
Priority date: 2023-07-25
Filing date: 2023-07-25
Publication date: 2023-12-19

Abstract

The invention discloses a natural language processing method, which relates to the technical field of natural language processing and comprises the following steps: s1: optimizing the gating attention unit to generate a gating residual attention unit; s11: optimizing the network model; s12: combining an attention residual structure with a gated attention unit, and adapting the attention residual structure and the gated attention unit; s2: generating a general unsupervised optimization algorithm of a transformer model based on a continuous word bag model, optimizing the problem that a gating residual attention unit performs poorly when a natural language template is used in prompt learning.

Description

Natural language processing method

Technical Field

The invention relates to the technical field of natural language processing, in particular to a natural language processing method.

Background

Since the advent of computer technology, achieving machine to human conversations has been a pursuit of many students. The figure and spirit test of the ancient cooking vessel is also that whether the machine can simulate the conversation of human and people is not found, so that the machine can learn the language ability of human and is always a constant pursuit of human in the artificial intelligence direction. From the last century, as technology evolves and technology continues to accumulate [1], and the two are increasingly tightly combined, machines have passed the Turing test of the last century, and natural language models that are now able to effectively converse with humans have emerged, which marks the development of artificial intelligence as a new era naturally. When the problem related to subjective language processing is solved at home and abroad, the teacher model is often pre-trained by selecting and utilizing the field related text, and then the network scale is reduced by using pruning, distillation and other methods, so that the function of the specific field is realized. Taking the p-turn method as an example, by creating word embedding initialized using the LSTM method, it is combined with the word-embedded input tensor as the input of the model. During training, the backbone of the model tends to freeze. The self-learning input is optimized through training data, and the self-learning template can be capable of performing the same task as the natural language template through training, and even better performance can be achieved under certain conditions. However, this method has a disadvantage that it cannot achieve effects in the case of zero samples like natural language templates. Furthermore, this approach, when used alone, requires that the data be sufficient to achieve an approximation to the fine tuning approach, or that the model be optimized by requiring additional MLM tasks. The natural language word embedding limit is canceled, and meanwhile, the knowledge carried in the natural language template is lost, and the template needs to be obtained again in the self-learning process. In order to improve the performance of the model, a natural language processing method is provided.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a natural language processing method, which solves the problems in the background art.

In order to achieve the above purpose, the invention is realized by the following technical scheme: a natural language processing method, comprising the steps of:

s1: optimizing the gating attention unit to generate a gating residual attention unit;

s11: optimizing the network model;

s12: combining an attention residual structure with a gated attention unit, and adapting the attention residual structure and the gated attention unit;

s2: and generating an unsupervised optimization algorithm which is universal to a transformer model based on a continuous word bag model, and optimizing the problem that the gating residual attention unit is not good in performance when a natural language template is used in prompt learning.

Optionally, in the step of optimizing the network model in S11, input in the network model of the gated residual attention unit algorithm is Input into the model, and is Input into natural language with a length of n, the whole network structure integrally uses a gated linear unit structure to perform three linear transformations to obtain a matrix U for self-gating, a processing object V for attention weight, and a matrix Z for self-attention calculation, where the matrix V is a center of the model, the Input is subjected to linear transformation and dimension-increasing operation by linear transformation operation, the matrix Z is subjected to dimension-reducing operation, and self-attention calculation is performed by using Z lower than the Input dimension, and the matrices U, V and Z all use an activation function swish.

Optionally, the step S12 combines the attention residual structure and the gated attention unit, and the gated attention unit in the step of adapting the two is substantially consistent with the transducer model in the processing of the data. Unlike the self-attention-plus-feedforward neural network structure in the traditional transducer model, the gated attention unit combines the self-attention mechanism with the gated linear unit, expressed by the formula:

O＝(UΘAV)W+b∈R ^e (1)

the above equation is a combination of a gated linear unit and a self-attention mechanism, the input of which approximates the transducer model, and for the input natural language set of the network, it can be expressed as x= (X) ₁ ,x ₂ ,....,x _n-1 ,x _n ) Natural language processing typically performs word-embedded representations while learning, converting text input into the above-mentioned distributed representation, which may be represented as e= (E) ₁ ,e ₂ ,...,e _n-i ,e _n )∈R ^e

The Input E is a matrix with n×e, the gating attention unit needs to calculate the Input, and then the Input is subjected to three linear transformations, and then an activation function swish is used to obtain vectors U, V and Z. The calculation process is as follows:

eRhU＝swish(EW _n +b)∈R ^h (2)

eRhV＝swish(EW _v +b)∈R ^h (3)

eRdZ＝swish(EW _z +b)∈R ^d (4)

the magnitude of the U and V vectors is consistent, and the gating linear unit plays a larger role in the model, so that the dimension h of the U and V is larger than the dimension e of word embedding, and Z is used for calculating the self-attention matrix of the model, and the introduction of self-gating in the gating linear unit weakens the attention mechanism, and the dimension d of the matrix Z used for self-calculation of the attention matrix A is smaller than the word embedding e; e, after linear transformation and function activation, changing the dimension from E to 2h+d, and then performing block operation to simplify operation; z is used as a calculation source of the self-attention module, Q and K are obtained after linear transformation, and the calculation formula is as follows:

Q＝(ZW _q +b)∈R ^d (5)

K＝(ZW _K +b)∈R ^d (6)

Q and K are used as inputs to calculate the attention score, and the formula is as follows:

the proposed model has 24 layers in total under the condition that the size is similar to the BASE version of BERT, n is the layer number corresponding to the attention score, the normalization is carried out on the model, and the calculation formula is as follows:

the attention score result is subjected to square operation after passing through an activation function relu, and finally an attention matrix A epsilon R is obtained ^n×n MatrixMultiplying A by V, resulting in a matrix V that achieves global attention through the self-attention matrix A, and then bringing the parameters into equation (1).

Optionally, the step S12 combines the attention residual structure and the gated attention unit, and adapts the two, by introducing the attention residual in the result of the attention calculation score, the attention can be strengthened with less occupied memory space, and the attention residual can make the attention moment array converge fast, so that the model is fast and stable, and the regular effect is obtained, and the most direct implementation method is as follows:

in the BASE version model, n is 24, and the square operation is performed after the relu operation is performed on the residual attention in the above formula, and the calculation formula and experiment prove that A _n The change speed of the numerical value of (a) is increased along with the increase of the layer number, and a certain influence is caused on the back propagation of the model, therefore, the above formula is required to be optimized, the normalization operation is carried out on the accumulated sum of attention, the operation of squaring after the activation function relu in the formula is replaced by the self-normalized softmax function, and meanwhile, the hyper-parameter n ensuring the numerical value is removed, and then the above formula is changed into the following form:

to prevent A from advancing regularization, the momentum thought is adopted to multiply the realized part of attention residual error by the super-parameterThe stabilization of A is realized at the later stage of the model, the early appearance of gradient disappearance is avoided, and in the BASE version of 24 layers, the specific realization formula is as follows:

optionally, the step S2 optimizes the problem that the gated residual attention unit performs poorly when using the natural language template in prompt learning, and proposes a continuous word bag model-based optimization algorithm in which the network structure of the algorithm includes a model trunk task and a model branch task, and the continuous word bag model optimization algorithm implements the model trunk on the branch tasks in the network structure from E along H ₁ ，H ₂ ，...,H _n-1 ，H _n A part of loss1 is obtained through path calculation of (1), wherein E is that word embedding is carried out on input through a word embedding module, initialization setting of distributed expression is carried out, and H ₁ To H _n Is an integral part of the backbone of the model.

Optionally, the step S2 of generating a general unsupervised optimization algorithm for a transformer model based on a continuous word bag model, optimizing the problem that the gating residual attention unit performs poorly when using a natural language template in prompt learning, wherein the algorithm selects the output of the lower layer of the model trunk according to a super parameter M, marks the output with the length of N and the dimension of e of the embedded layer as V, regards the output as a distributed expression of a word vector with a phrase-level feature, uses a method consistent with a positive sampling method in the continuous word bag model, selects tensors in V as central words one by one, selects the central words from input, determines peripheral words according to the set super parameter N, creates peripheral words according to different central words, takes the central words and the number of 2N peripheral word dot products, obtains an optimal result when N takes the value of 4, uses an activation function log mod to adjust the loss optimization direction after summing, and finally calculates the task loss of the optimization algorithm, and the specific calculation formula is as follows:

wherein V is _i For the distributed expression of the ith word in the input vector V, near (V _i ) Selecting adjacent words according to the input central words, ensuring that the optimization direction of the activation function logsigmod is consistent with the model trunk, and taking The average value ensures that the sentence length does not influence the similarity measurement; the proposed continuous word bag model optimization algorithm, together with the model trunk task, has the overall loss of the two cooperative multitasks of total_loss, and simultaneously utilizes the loss function of the two tasks to optimize the model as follows: total_loss=loss 1+ loss2+ loss1; the loss value of the model trunk is calculated in different modes according to the trunk model and the downstream tasks. According to the loss function optimization method, in order to ensure that both loss1 and loss2 are equally important, initial values of the loss functions are selected and weighted, and the final loss function is expressed as follows:dividing the actual value of the loss function by the initial value ensures that the model treats a plurality of tasks as equally important in the model optimization process so as to achieve pareto optimal.

The invention provides a natural language processing method, which has the following beneficial effects:

1. the natural language processing method provides an attention residual optimization method for the situation that the attention of a gating attention unit structure is too weak, and improves the gating attention unit and the attention residual, so that the gating attention unit and the attention residual are effectively combined together, and meanwhile, the model performance is improved.

2. The natural language processing method provides a general unsupervised optimization algorithm of a transformer model based on a continuous word bag model, and the optimization algorithm strengthens learning of phrase level structure information by low-level attention by optimizing low-level expression of the transformer model, and optimizes high-level information expression on the basis. Meanwhile, due to the additional loss and gradient caused by the optimization algorithm, the model is more converged.

Drawings

FIG. 1 is a network architecture diagram of a gated residual attention unit of the present invention;

FIG. 2 is a graph comparing the performance of the model on a pre-training task after using the attention residual in the present invention;

FIG. 3 is a network structure diagram of the continuous word bag model optimization algorithm of the invention;

fig. 4 is a step diagram of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments.

Referring to fig. 1 to 3, the present invention provides a technical solution: a natural language processing method, comprising the steps of:

S11: optimizing the network model;

the Input in a network model of a gated residual attention unit algorithm is Input of the model, the Input is natural language Input with the length of n, the whole network structure integrally uses a gated linear unit structure to carry out three times of linear transformation to obtain a matrix U for self-gating, a processing object V for attention weight and a matrix Z for self-attention calculation, wherein the matrix V is the center of the model, the Input is subjected to linear transformation dimension-increasing operation through linear transformation operation, so that the model can fit more natural language knowledge, the matrix Z is dimension-reducing Input, self-attention calculation is carried out by using Z with lower dimension than Input, so that the cost of the whole model is lower, the matrix U is the realization of a self-gating mechanism in the model, the linear dimension-increasing operation is carried out in the same way, the model can be fitted with more natural language knowledge, a better self-gating effect is achieved, the matrix U, V and Z are all used for activating functions swish, the model Input is smoother, gradient tends to be prevented from being saturated when the model is trained slowly, the whole structure and the matrix Z is similar to the matrix, but the matrix is similar to the BERT, the BERT is similar to the BERT, and the BERT has a large number of the situation. The most essential is the distribution of the input text. The input of the model is a text X with the length of n, X= { X1, X2,. Xn-1, xn }, X is decomposed into word groups with the length of n through the processing of a word segmentation device, words are converted into character codes corresponding to each character one by one through the mapping of a word list corresponding to the word segmentation device, besides, no difference exists, the next step is word embedding, the input is converted into an array at the moment, the word embedding is converted into a distributed representation through a matrix with the same size as the word list through the input, the word embedding representation converted into n e after the word embedding is input, meanwhile, the text relative position coding is added, so that the model can learn the relative position information in a language, the model can learn the relation among words in the global dimension of the input through a self-attention mechanism, and the word meaning and sentence meaning of the model can be learned into the input through the stacking of codes;

the gated attention unit is substantially identical to the transformer model during processing of the data. Unlike the self-attention-plus-feedforward neural network structure in the traditional transducer model, the gated attention unit combines the self-attention mechanism with the gated linear unit, expressed by the formula:

O＝(UΘAV)W+b∈R ^e (1)

eRhU＝swish(EW _n +b)∈R ^h (2)

eRhV＝swish(EW _v +b)∈R ^h (3)

eRdZ＝swish(EW _z +b)∈R ^d (4)

Q＝(ZW _q +b)∈R ^d (5)

K＝(ZW _K +b)∈R ^d (6)

the attention score result is subjected to square operation after passing through an activation function relu, and finally an attention matrix A epsilon R is obtained ^n×n The matrix a is multiplied by V to obtain a matrix V that achieves global attention through the self-attention matrix a, after which the parameters are brought into equation (1). After the input self-gating is realized through Ha Mada products, linear transformation with residual errors is carried out, and the model is ended and then the next layer is entered;

compared with the traditional transducer encoder, the model structure uses a gating linear unit to replace a feedforward neural network, is more complex in structure, and meanwhile, the parameter and the calculation amount are relatively less because of omitting a multi-head attention mechanism.

The 24 layers combined with the gating linear units have a gating attention unit parameter of 96M, the 12 layers have a BERT-BASE parameter of 102M, the parameter is less than BERT, and the transducer encoder implements self-attention first and then adds a feedforward neural network. The two are relatively isolated and have no significant relationship. The gated attention unit structure is different, and the self-attention implementation in the structure is realized in the whole gated linear unit structure. The structure skillfully combines the gating linear unit with the self-attention structure, takes the gating linear unit as a main part and takes the self-attention as an auxiliary part, thereby realizing a special network structure;

In the base version of the model, dimension d of Z is 128, dimension E of input E is 768, and dimension h of U and V is 1536. From the vector dimension point of view, the encoder structure focuses more on the role of the gating linear unit self-gating mechanism, which plays a role that is not as important as the gating linear unit, but is still indispensable, but the attention head is only one, and the dimension d is only 128, but still effective. At the same time, the weakening of the self-attention mechanism brings advantages, such as that the model does not need to consider the waste of multi-head attention any more, and model optimization is completed by branch reduction in the downstream task. Besides, the advantage brought by the reduction of model parameters is that the running speed is accelerated except that the video memory occupation is small, the other advantage is that the depth is the same size model, and the number of layers of the model using linear gating attention is about twice that of the BERT model;

s2: generating a general unsupervised optimization algorithm of a transformer model based on a continuous word bag model, and optimizing the problem that a gating residual attention unit performs poorly when a natural language template is used in prompt learning:

attention residual error structure, through introducing attention residual error in the result of attention calculation score, only use less video memory space to occupy, can realize the strengthening to attention, and attention residual error can make attention moment matrix accelerate the convergence moreover for the model is fast stable, obtains regular effect, and the most direct realization mode is as follows:

The attention residual error code is introduced by only providing an extra space to store the attention score after each layer operation is finished, and then adding the attention score to the result of the next layer attention score, but obviously, the problem is not solved, n is 24 in the model of the BASE version, and the square operation after the relu operation is carried out on the residual attention in the above formula, and A is proved by a calculation formula and experiments _n The change speed of the numerical value of (a) is continuously increased along with the increase of the layer number, a certain influence is caused on the back propagation of the model, the above formula is required to be optimized, so that the attention score using the attention residual error is convenient to optimize and cannot cause numerical explosion, the simplest method is to normalize the accumulated sum of the attention, replace the operation of squaring after the activation function relu in the formula with a self-normalized softmax function, and remove the hyper-parameter n guaranteeing the numerical value, and then the above formula is changed into the following form:

with this approach, the model's performance on the pre-training task is slightly improved, as shown in FIG. 2 below:

the performance of the gated attention unit after using the attention residual is improved to some extent, but is still not ideal. The attention score can be accumulated rapidly, so that A is stable quickly and does not change any more, the gradient in the layer number behind the model is disappeared, and in order to prevent A from being regular in advance, the momentum thought is adopted to multiply the implementation part of the attention residual error by the super-parameter The stabilization of A is realized at the later stage of the model, the early appearance of gradient disappearance is avoided, and in the BASE version of 24 layers, the specific realization formula is as follows:

the network structure of the algorithm comprises a model trunk task and a model branch task, and the continuous word bag model optimization algorithm realizes the model trunk on the branch task in the network structure from E along H ₁ ，H ₂ ，...,H _n-1 ，H _n A part of loss1 is obtained through path calculation of (1), wherein E is that word embedding is carried out on input through a word embedding module, initialization setting of distributed expression is carried out, and H ₁ To H _n Is an integral part of the backbone of the model;

the algorithm selects the output of the lower layer of the trunk of the model according to the super parameter M, marks the output with the length of N and the dimension of the embedded layer of e as V, regards the output as the distributed expression of word vectors with the feature of short sentence level, uses a method consistent with a positive sampling method in a continuous word bag model, selects tensors in V as central words one by one, selects the central words from the input, determines peripheral words according to the set super parameter N, creates the peripheral words according to different central words, creates the dot product of the central words and the peripheral words with the number of 2N, obtains the best result when the value of N is 4, uses an activation function logsigmod to adjust the loss optimizing direction after summation, and finally calculates the task loss of the optimizing algorithm, wherein the specific calculation formula is as follows:

Wherein V is _i For the distributed expression of the ith word in the input vector V, near () selects an adjacent word according to the input center word, the guaranteed optimization direction of the activation function logsigmod is consistent with the model trunk, and the average value is taken so that the sentence length does not influence the similarity measurement. The smaller the loss function value, the center in the hidden layerThe larger the dot product of the word and the peripheral word, the higher the similarity degree, and the more sufficient the attention mechanism can be ensured to learn the attention of the middle and short distances. And the effectiveness of the positive sampling method on the sentence-level semantic learning is also proved by the effect of the continuous word bag model. Because of the poor experimental results, the model abandoned the negative sampling method in the continuous bag-of-words model. The principle of the self-attention mechanism determines that even though the model is low, the long-range dependence can be learned, the long-range dependence is destroyed by using a negative sampling method, the loss of model precision is about 7%, and the worst result appears in the process of processing the model middle layer, namely the middle layer of the transducer model most depends on local attention.

The proposed continuous word bag model optimization algorithm, together with the model trunk task, has the overall loss of the two cooperative multitasks of total_loss, and simultaneously utilizes the loss function of the two tasks to optimize the model as follows: total_loss=loss 1+ loss2+ loss1;

The loss value of the model trunk is calculated in different modes according to the trunk model and the downstream tasks. According to the loss function optimization method, in order to ensure that both loss1 and loss2 are equally important, initial values of the loss functions are selected and weighted, and the final loss function is expressed as follows:dividing the actual value of the loss function by the initial value ensures that the model treats a plurality of tasks as equally important in the model optimization process so as to achieve pareto optimal.

Experiment:

optimizing the gating attention unit in the step of S1 to generate the gating residual attention unit, and carrying out experiments on the gating residual attention unit generated in the step of generating the gating residual attention unit:

the cluecorpusmall dataset is selected for model migration. Initializing a model, namely initializing a pre-trained model of a linear attention unit from a network, and performing migration training by using a CLUECorpusSmall data set after the model structure is changed;

after the migration training is completed, selecting a subtask of the CLUE version to evaluate the model performance; the emotion classification task cluecorpuscull dataset with simpler verification and selection for prompting learning effect comprises the following sub-corpora (total 14G corpora): the news corpus news2016zh_corpus is divided into two upper and lower parts, and a total of 2000 small files are obtained. The corpus of the cleuecorpuscle dataset mainly includes news, advertisements, and knowledge from which the pre-trained model can learn knowledge and common sense. 2. Community interaction-corpus webText2019 zh_morphus: the 3G corpus contains 3G text, and has a total of 900 small files. Wikipedia-corpus wiki2019 zh_morphus: about 1.1G text, containing about 300 doclets. Comment data-corpus comments2019 zh_morphus: 2.3G left and right texts, wherein the total of 784 small files comprises 547 comment comments and 227 Amazon comments, and a plurality of comment data of Chinese NLPCorpus are combined, cleaned, converted in format and split into small files. CLUE [45] is a Chinese natural language understanding test list, which contains 9 tasks for evaluating the performance of the model on Chinese natural language tasks; the dataset covers two major classifications, 11 task subsets, of which TNEWS, IFLYTEK, WSC, AFQMC, CSL, OCNLI, CMNLI belongs to classification tasks, TNEWS: the text classification task of headline news headlines is a Chinese news composed of Chinese news published by headline, and 73,360 pieces of data are all available. Each title is labeled as one of 15 news categories (finance, science, sports, etc.), dividing the entire dataset into a training set, a development set, and a test set;

IFLYTEK: the descriptions of the corresponding app are contained 17332, and the task is to assign the corresponding descriptions to one of 119 categories to which the corresponding app belongs, such as food, car rental, education, and the like. Data filtering is similar to the techniques used by the TNEWS dataset;

WSC: the task is an reference resolution task, requiring the model to determine whether the pronouns and nouns in a sentence are worth the same person or thing. The dataset is built up from similar datasets in english. Sentences in the dataset were manually selected from 36 contemporary chinese literature works. Then manually annotating their reference relationships by linguists for a total of 1,838 questions;

AFQMC: the ant golden suit problem matches corpus 3 from ant technology exploring conference developer contests. It is a binary classification task aimed at predicting whether two sentences are semantically similar.

CSL: the chinese scientific literature data set contains chinese paper summaries from the chinese core journal and keywords thereof. The core journal of China covers a plurality of fields of natural science and social science. False keywords are generated by the TF-IDF method and mixed with the true keywords. Given a summary and keywords, the model aims at distinguishing the true or false of the keywords on the task.

OCNLI: the original Chinese natural language reasoning is collected and arranged strictly according to the MNLI method. The OCNLI consists of 56000 inference pairs with five genres: the premise of news, government, novel, television real records and telephone real records is that the premise of reasoning pair is collected from Chinese information, and the assumption is written by the students who hire from linguistic professionals.

CMNLI: the data set gives two texts, and the model judges whether an implication relationship exists between the two texts. Of course, implementations are also dichotomous tasks.

Wherein CMCR, C3, CHID, clenner are reading understanding and NER tasks.

CMRC: CMRC is a span-based extracted chinese machine-readable understanding dataset. The dataset contains about 19071 questions from the wikipedia, with answers noted by humans. In this dataset, all samples consist of context, questions and related answers. Furthermore, the answer is a text span in the context.

CHID: is a large-scale Chinese ideom head and tail cutting test data set, which comprises about 498,611 paragraphs and 623,377 blanks, and covers news, novels and pros. The candidate question bank contains 3,848 chinese idioms. For each blank in a paragraph, there are 10 candidate idioms, one of which is a golden option, several similar idioms, and some other idioms. Gold options, several similar idioms, and other idioms randomly selected from a dictionary.

And C3: c3 is the first free form multiple choice machine reading understanding dataset. Given a document, which may be a dialogue or a more formal mixture of text, and a free choice question, not limited to a single question type, the correct answer is selected from 2 to 4 options. The 13,369 documents of the data set design 19,577 problems, and the problems are divided into a training set, a development set and a test set after being disturbed. These topics are collected during language exams carefully designed by educational specialists for assessing the reading comprehension of language learners, similar to the english reading comprehension test data sets RACE and DREAM. CLUENER: conventional non-nested named entity recognition tasks; emotion analysis data set the annotation data set includes 16883 pieces of training data, 2111 pieces of verification data, and 2111 pieces of test data. Mainly comprises information such as film evaluation, consumption evaluation and the like.

The model of the present invention was implemented on a ubuntu20.04 computer using PyTorch. The present invention uses NVIDIARTXA5000GPU for training. The pre-training task was MLM, coverage was 15% and vocabulary size was 12000. The model is optimized by an AdamW optimizer, β1=0.9, β2=0.999, e=10-10, the accuracy of the model in training is fp16, the learning rate is 2e-5, and the norm-up ratio is 0.1. The method sets the batch size to be 32, the number of accumulated gradients to be 8, records model parameters once every 2000 steps, selects the model with the best performance on the pre-training task after two epochs to test on a training set, and takes 72 hours for pre-training;

Downstream tasks were also optimized using AdamW optimizer, β1=0.9, β2=0.999, e=10-10; learning rate is 2e-5, training precision is fp32, batch size is 32, accumulation gradients are not used, and classifiers with different structures are called according to different downstream tasks to carry out experiments;

the realization of prompt learning uses an open source toolkit openprompt, is also suitable for optimization by an AdamW optimizer, searches for optimal super parameters in a training stage, verifies accuracy on a verification set after each epoch is finished, and finally obtains an optimal value in the training process. In the case of adjusting the model hyper-parameters to the optimal values, the accuracy of the model on the pre-training task MLM is about 79%. Moreover, from the image comparison of experimental results, the speed of the system is greatly improved compared with BERT and a gating attention unit. And meanwhile, the model is evaluated on a downstream task, and the evaluated data set is CLUE. Meanwhile, according to the performance of the model on the pre-training model, the performance of the model on prompt learning is tested;

and (3) the test result of the CLUE is that the accuracy of the model on the pre-training task MLM is 79% by using the CLUE related data set under the condition of adjusting the model super-parameters to the optimal values. Moreover, from the view of image comparison, the convergence speed and the accuracy rate bar are greatly improved compared with BERT and a gating attention unit, and meanwhile, the model is evaluated on a downstream task, and the evaluated data set is CLUE. Meanwhile, according to the performance of the device in a pre-training model, the performance of the device in prompt learning is tested, and GRAU in a table represents a gating residual error attention unit proposed in the paper;

TABLE 1 experimental results of GRAU on part of the task in CLUE

From Table 1, it can be seen that the models for experimental comparison include BERT, roBERTa and Roformer, GAU and GRAU. GRAU is better than GAU in overall, and two classification tasks of tnews, cmnli and cluener reading task are better than RoFormer. While performing poorly on wsc tasks, it is still better than BERT and GAU. Compared with the improvement object, GRAU has certain promotion in the classification task and the reading and understanding task, and the experimental result shows that the promotion in the pre-training stage of the model is also represented in the fine-tuning stage of the downstream task through the promotion brought by the attention residual error; because the gated residual attention unit has better performance in the pre-training phase, but performs poorly in downstream task fine tuning. The prompt learning is used as a new model for model optimization of natural language processing direction, the performance of the model on an MLM task can be converted into the performance in a downstream task, GRAU proposed by the paper is used for verifying the task performance of the method when the prompt learning is used on an emotion analysis data set, and specific results are shown in a table 3;

The natural language templates P1, P2, P3 used in table 3. The specific implementation thereof is shown in table 2 as follows:

table 2 gated linear unit uses templates for natural language prompt learning

The three templates P1, P2 and P3 are selected from a plurality of natural language templates, and the effects on the BERT model are respectively bad, and three layers are good.

[ MASK ] is used as a word to be predicted to output a prediction result, and [ TEXT ] refers to model input and is filled in by the input. The implementation of the template uses the toolkit openprompt of the bloom open source. The verification set size is 2111, and the experimental result is zero sample experimental result of directly carrying out prediction and pre-training on the pre-training completion model. The models for comparison of this experiment included BERT-BASE, BERT-LARGE, GAU and GRAU. The experimental results are shown in table 3:

TABLE 3 experiment results using different natural language templates on emotion analysis task

Because the model performs very well on the MLM task, its performance in the prompt learning direction was tested. The model has better performance under the condition of using natural language prompt words. With few model parameters, the model performs nearly as well as the BERT-LARGE model on a zero sample task. In the prompt learning direction, the attention residual error encoder combined with the gating linear unit performs very well; under the condition of poor effect of the prompt words, the results of BERT-BASE and GAU are very bad, but GRAU can still keep the test effect similar to BERT-LARGE; the description is that the effect of GRAU is forced into a large model under the zero sample learning condition using natural language as a template, and the effect of the residual attention unit is described.

Moreover, GRAU has an experimental effect exceeding BERT-LARGE on the natural language template P2 with poor BERT-LARGE performance. To illustrate that GRAU has a better hint learning potential than BERT-LARGE.

It can be seen that under the condition of zero sample, the natural language template with better performance is selected, the accuracy of 86.83 can be achieved on the emotion analysis two-classification task, the model performance under the condition of only 96M parameters and the performance of the BERT-LARGE model with 330M parameters are basically equal, in all three templates, the accuracy of the model performance and the model performance is only 0.28 different under the condition of only considering the best result, and the accuracy of the model performance and the model performance is comprehensively superior to that of GAU with 92M parameters and BERT-BASE with 110M parameters.

Based on this experiment, the present invention continued to conduct the experiment with a small sample [53], the experimental results of which are shown in the following table. Based on this, table 2 records the additional performance of the model on emotion analysis tasks for the model with few samples.

TABLE 4 experimental results using P3 natural language templates with few samples

The BERT-LARGE cannot be tested for hardware reasons; from the results in the table, it can be seen that the final accuracy is improved to some extent in the case of few samples, regardless of the structure, compared with the case of zero samples. Especially, the accuracy is improved by about 8% in GAU structure. The importance of the data to the model can be seen; not from the final results, GRAU still achieved the best results, with a 89.72% accuracy on the verification set of 2111 pieces of data with fine-tuning of the model using 200 pieces of data.

The attention residual coder combined with the gating linear unit achieves excellent effect on prompt learning by using a natural language template, but the effect of the natural language template is limited by various conditions, and the design of the natural language template is very dependent on the selection of artificial features, and has various inconveniences although the effect is achieved. So the prompt learning is developed to the end, and a self-learning prompt learning template is provided. Through LSTM or model initialization, freeze model trunk in training process, only optimize this template of self-learning, its template length and cover word position are super parameter, use verification set after many times adjustment, the best self-learning template of performance is as follows:

[soft][soft][soft][soft][MASK][soft][soft][soft][soft][TEXT]

[ Soft ] is a self-learning template to be learned, [ MASK ] is a position of a word to be tested, [ TEXT ] is a model input, and experimental results using the self-learning template above are as follows:

TABLE 5 experimental results of GRAU when self-learning templates are used

The training set of the experiment is 200 pieces of marking data, which is one tenth of the size of the final testing set, compared with an excellent natural language template, the experimental result of simply using the self-learning template has a little step back, however, the problem that the self-learning template is not good in performance under the condition of few samples can be relieved to a certain extent by using an additional MLM task.

The disadvantages of the attention residual encoder in combination with a gated linear gating unit can also be seen here. In the case of optimizing the self-learning template only by the downstream task, the accuracy is much worse than when using the natural template, and it can be seen that the encoder method proposed by the present invention requires very much natural language information.

And S2, carrying out experiments by using the optimization algorithm based on the continuous word bag model, wherein the optimization algorithm is provided in the step S:

the proposed optimization algorithm based on the bag-of-words model aims at improving the performance improvement of the model in the downstream task by changing the feature expression of the model. Therefore, the experimental contents performed in this section are that, in the case of using a natural language template, the algorithm is verified to bring improvement to the performance of the model through optimization of the model expression, and meanwhile, because the self-learning template p-training method needs to freeze a trunk model during optimization, the training effect of thawing the trunk model is poor, and a solution is difficult to find, and the performance of the method on the self-learning template is difficult to experiment. So other models are selected to use the unsupervised optimization algorithm, and the effectiveness of the unsupervised optimization method is verified on a downstream task test set.

Furthermore, the effectiveness of the method in fine tuning of downstream tasks was also verified, using the BERT model performed on the GLUE dataset.

A general language understanding assessment benchmark (glute) is a tool for assessing and analyzing the performance of models of a variety of existing natural language understanding tasks. The model is evaluated based on the average accuracy over all tasks.

The evaluation criterion comprises 9 tasks of COLA, SST-2, MRPC, STS-B and QQP, MNLI, QNLI, RTE, WNLI, wherein:

copa: copa (the language acceptance corpus) is a single sentence sub-classification task, the corpus comes from books and journals of language theory, and each sentence is marked as a grammatical word sequence. The task is a two-class task, and the labels are 0 and 1 respectively, wherein 0 is not grammatical and 1 is grammatical.

Number of samples: training set 8551, development set 1043, test set 1063.

Tasks: acceptable, both grammatical and non-grammatical classifications.

Evaluation criteria: matthewscreenrelationship.

SST-2: SST-2 (the ThestanfordSentimentTreebank, stanford emotion tree library), a single sentence sub-classification task, contains the sentences in movie reviews and human annotations of their emotion. This task is to classify the emotion of a given sentence into two categories of positive emotion (1 for sample tag) and negative emotion (0 for sample tag), and only tags at sentence level are used. That is, the task is also a classification task, which is classified into positive and negative emotion for sentence level.

Number of samples: training set 67350, development set 873, test set 1821.

Tasks: emotion classification, positive emotion and negative emotion classification.

Evaluation criteria: accuracy.

MRPC and STS-B, QQP belong to the sentence similarity task, the input is a sentence pair, and the output is the similarity measure of the two sentences

MRPC: the MRPC (the Microsoft research Alprassary corpus) is a task of automatically extracting sentence pairs from an online news source and manually annotating whether sentences in the sentence pairs are semantically equivalent. The category is not balanced, where 68% of positive samples, so the accuracy (accuracy) and F1 values are reported following conventional practices.

Number of samples: training set 3,668, development set 408, test set 1,725.

Tasks: whether to interpret the two categories is interpretation, and not interpretation.

Evaluation criteria: accuracy (accuracy) and F1 value.

STS-B: STSB (semantic text similarity benchmark) is a collection of sentence pairs extracted from news headlines, video headlines, image headlines, and natural language inferred data, each pair being annotated by humans with a similarity score of 0-5 (floating point number greater than or equal to 0 and less than or equal to 5). The task is to predict these similarity scores, which is essentially a regression problem, but still can be classified into text five-classification tasks of sentence pairs by classification methods.

Number of samples: training set 5749, development set 1379, test set 1377.

Tasks: regression task, predicting the floating point number as similarity score between 1-5. But still the classification method can be used as the five classifications.

Evaluation criteria: pearsonand Spearmancovergencoaeficients.

QQP: QQP (the Quora questions pair), similarity and paraphrasing tasks, are a collection of question pairs in the community question-answering website Quora. The task is to determine if a pair of questions are semantically equivalent. As with MRPC, QQP is also positive and negative sample unbalanced, except QQP negative sample is 63% and positive sample is 37%, so this task is also reporting accuracy and F1 value. We use a standard test set for which we obtain a special tag from the author. We observe that the test set is distributed differently than the training set.

Number of samples: training set 363,870, development set 40,431, test set 390,965.

Tasks: judging whether sentence pairs are equivalent, equivalent and nonequivalent, and classifying tasks.

Evaluation criteria: accuracy (accuracy) and F1 value.

MNLI, QNLI, RTE, WNLI are all "inference tasks", the input is a sentence pair, and the output is an inference of semantic relationship of the post sentence relative to the pre sentence

MNLI: the natural language inference task is to make text implication labeling set for sentence pairs by crowd-sourcing. Given a precondition statement and a hypothesis statement, the task is to predict whether the precondition statement contains a hypothesis (implication), contradicts the hypothesis (contradiction), or neither. The premised sentences are collected from tens of different sources, including transcribed speech, novels and government reports.

Number of samples: 392,702 training sets, 9,815 development sets dev-matched, 9,832 development sets dev-matched, 9,796 test sets test-matched, 9,847 test sets test-matched.

Because MNLI is a text that integrates many different domain styles, it is classified into two versions of data sets, matched, meaning that the sources of data for the training set and the testing set are consistent, and matched, meaning that the sources of the training set and the testing set are inconsistent.

Tasks: sentence pairs, one premise, one is hypothesis. There are three situations for the precondition and hypothesized relationships: implications (entailment), contradictions (continuity), neutral (neutral). Sentence pair three classification problem.

Evaluation criteria: matrichedaccuracy/mimatcchedaccuracy.

QNLI: QNLI (query-answer nli), natural language inference task. QNLI is converted from the other dataset, the stanfordsequest poweringdataset. Squiad 1.0 is a question-answer dataset consisting of one question-paragraph pair, where a paragraph is from wikipedia and one sentence in the paragraph contains the answer to the question. Here one can see that there is an element, a paragraph from wikipedia, a question, one sentence in a paragraph containing the answer to the question. Sentence pairs in QNLI are obtained by combining the question with each sentence in the context (i.e., wikipedia paragraph) and filtering out sentence pairs with low lexical overlap. Compared with the original SQUAD task, the requirement of the model for selecting accurate answers is eliminated; simplified assumptions, i.e. hints that the answer is moderate in input and that the lexical overlap is reliable, are also eliminated.

Number of samples: training set 104,743, development set 5,463, test set 5,461.

Tasks: judging whether the question (question) and the sentence (one sentence in wikipedia paragraph) are contained or not, and classifying the two kinds.

Evaluation criteria: accuracy (accuracy).

RTE: RTE (the text implication data sets) is an integrated and consolidated set of data for a series of annual text implication challenges, including RTE1, RTE2, RTE3, RTE5, etc., which are all constructed from news and wikipedia. All of these data are converted into two categories, and for three categories of data, neutral (neutral) and contradiction (contradiction) are converted into non-inclusive (non-inclusive) in order to maintain consistency.

Number of samples: training set 2,491, development set 277, test set 3,000.

Tasks: judging whether the sentence pair is contained, and judging whether the sentence 1 and the sentence 2 are mutually contained or not, and classifying tasks.

Evaluation criteria: accuracy (accuracy).

WNLI: WNLI (WinogradNLI, winograd natural language inference), a natural language inference task, the dataset comes from the transformation of the competition data. Winograd schema Change, a reading understanding task in which the system must read a sentence with a pronoun and find the pronoun's referents from a list, these samples are manually created to defeat the simple statistical approach: each sample depends on the context information provided by a single word or phrase in the sentence. To convert a question into a sentence pair classification, the method is by replacing the pronouns in the original sentence with each possible reference in each possible list. The task is to predict whether two sentence pairs are relevant (implicate, not implicate), the two categories of the training set are balanced, the test set is unbalanced, 65% are not implicate;

number of samples: 635 training sets, 71 development sets and 146 test sets;

tasks: judging whether sentence pairs are related, implying or not implying, and classifying tasks;

Evaluation criteria: accuracy (accuracy);

the data set has comprehensive tasks and rich data, and can comprehensively measure the capability of the model in terms of natural language processing from multiple angles;

GLUE contains nine NLU tasks, all in English. Nine GLUE tasks relate to a plurality of tasks such as natural language inference, text implication, emotion analysis, semantic similarity and the like. Well known models like BERT, XLNet, roBERTa, ERINE, T5 are tested on this benchmark. Therefore, the data set has certain authority and can well evaluate a model;

the model of the present invention was implemented on a ubuntu20.04 computer using PyTorch. The present invention uses NVIDIARTX3090GPU for training. The test was performed using the standard data set provided by the GLUE test benchmark with a vocabulary size of 30122. The model was optimized by Adam optimizer, β1=0.9, β2=0.999, e=10-10, searching for optimal super parameters (i.e. initial learning rate and epoch) during the training phase. The invention sets the batch size to 64 and evaluates the model performance after trimming on the test set.

GRAU for natural language template learning comes from a pre-trained model, and BERT-BASE model for transfer learning comes from Google.

In order to verify the effectiveness of the unsupervised optimization algorithm in improving the performance of the model, experiments are designed in the chapter to verify the effectiveness of the unsupervised training optimization algorithm in natural language processing tasks.

Experiments mainly verify the performance of the method in two directions:

1: verifying the effectiveness of the algorithm on prompt learning by using a natural language template;

although used as a multitasking optimization goal, the optimization algorithm belongs to unsupervised training, so that prompt learning when a natural language template is used is optimized, and the performance of the optimization algorithm can be slightly improved.

The optimization algorithm can bring about a certain degree of performance improvement only when using natural language templates. For natural language processing models, with few samples, more and more unsupervised model features may be able to bring about different degrees of improvement to the model performance, and these unsupervised features are likely to be orthogonal to the improvement brought about by the model. The use of this optimization algorithm in the experiment was noted CBOW, the experimental configuration of table 5 was identical to that of table 4, with a partial overlap of the results, and the performance of the different models was recorded with 0 samples and 200 labeling data, respectively.

Table 5 experimental results of CBOW optimization algorithm on hint learning using natural language templates

Table CBOW optimization algorithm results (subsequent tables) on prompt learning using natural language templates

The results of the experiments in the table can be used for seeing that the unsupervised training task target provided in the section can bring remarkable effect improvement for BERT, GAU, GRAU in prompt learning by using a natural language template.

In the case of zero sample training, only using the unsupervised optimization algorithm on the validation set together with the validation process can have about a 1% improvement, and the effect is good.

On learning of a small amount of data, the unsupervised optimization algorithm is used to run together with the original training task, the final experimental result exceeds the combination of the self-learning template p-training+MLM task, the accuracy reaches 90.29, the accuracy 90.09 of the combination of the self-learning template and the MLM is exceeded, and the potential of the multi-task model optimization method is illustrated to a certain extent. Experimental results in the table have demonstrated the effectiveness of the continuous bag of words model-based multitask model optimization method in a prompt learning method.

When the prompt learning experiment is performed, the optimal result appears when the task is applied to the last layer or the penultimate layer of the model, the second best result appears in the process of optimizing the result in the originally predicted shallow layer of the model, and in the subsequent downstream task experiment, the optimal result only appears in the process of optimizing the shallow layer of the model.

2: using the non-supervision training task target on other models to verify the application range and the effectiveness of the algorithm;

in order to verify the applicability of the unsupervised optimization algorithm, the algorithm is used for experiments on 9 subtasks on a GLUE data set, the experimental contents comprise a classification task, a sentence similarity measure and a sentence inference task, and the experimental results are shown in the following table:

table 6 experimental results of BERT model on a glut dataset using CBOW optimization algorithm

Table 6 BERT model experimental results on GLUE data set using CBOW optimization algorithm (continuous table)

It can be seen that after the optimization algorithm is used on the basis of the original model, the accuracy of the model on SST-2 and CoLA tasks is improved by about 0.2%, and the accuracy on QNLI, RTE and other tasks is improved by more than 1%.

In addition, on the XLNet model structure, the SST-2 task can also be improved by using the optimization algorithm, so that the accuracy of improvement is improved by about 0.5.

Moreover, the prompt learning task of the natural language template is Chinese, and the test of GLUE is English task. The optimization algorithm can bring a certain effect improvement to the model on both data sets. It can be seen that the unsupervised optimization algorithm does improve model performance on downstream tasks. And certain applicability does exist. Meanwhile, the calculation of the optimization algorithm is simpler, and compared with the calculation cost of the attention mechanism, the optimization algorithm does not use excessive calculation resources.

The optimization target of the optimization algorithm is relatively orthogonal with the optimization target of the model trunk, the similarity degree of the adjacent word embedding of the lower hidden layer is only improved, the operation is similar to BERT-whistening, however, the whitening operation aims at model output, and the method is to optimize the middle hidden layer of the model.

The method obtains the best result condition and optimizes the output result of the fourth hidden layer of the BERT-BASE model. In the fine tuning task, the super-parameter value of the best result is completely different from the super-parameter value in prompt learning.

The algorithm ensures that the attention mechanism is focused on short-distance sentences at the lower layer of the model to a certain extent, strengthens the lower-layer attention mechanism to a certain extent, and provides a relatively fixed optimization direction for the lower-layer self-attention mechanism so as to improve the stability and performance of the model. Meanwhile, an additional gradient optimization path is provided, so that the gradient disappearance problem of the over-deep model is well relieved, and the model is ensured to converge more quickly.

The foregoing is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art, who is within the scope of the present invention, should make equivalent substitutions or modifications according to the technical scheme of the present invention and the inventive concept thereof, and should be covered by the scope of the present invention.

Claims

1. A natural language processing method, characterized in that: the method comprises the following steps:

s11: optimizing the network model;

2. A natural language processing method according to claim 1, wherein: in the step of optimizing the network model in S11, input in the network model of the gated residual attention unit algorithm is Input into the model, the Input is natural language Input with the length of n, the whole network structure integrally uses a gated linear unit structure to perform three linear transformations to obtain a matrix U for self-gating, an attention weight processing object V, and a matrix Z for self-attention calculation, wherein the matrix V is the center of the model, the Input is subjected to linear transformation dimension-increasing operation through linear transformation operation, the matrix Z is subjected to dimension-decreasing operation, the self-attention calculation is performed by using Z lower than the Input dimension, and the matrices U, V and Z all use an activation function swish.

3. A natural language processing method according to claim 1, wherein: the step S12 combines the attention residual structure and the gating attention unit, and the gating attention unit is basically consistent with a transformer model in the processing process of the data in the step of adapting the attention residual structure and the gating attention unit. Unlike the self-attention-plus-feedforward neural network structure in the traditional transducer model, the gated attention unit combines the self-attention mechanism with the gated linear unit, expressed by the formula:

O＝(UΘAV)W+b∈R ^e (1)

eRhU＝swish(EW _n +b)∈R ^h (2)

eRhV＝swish(EW _v +b)∈R ^h (3)

eRdZ＝swish(EW _z +b)∈R ^d (4)

Q＝(ZW _q +b)∈R ^d (5)

K＝(ZW _K +b)∈R ^d (6)

the attention score result is subjected to square operation after passing through an activation function relu, and finally an attention matrix A epsilon R is obtained ⁿ ^×n The matrix a is multiplied by V to obtain a matrix V that achieves global attention through the self-attention matrix a, after which the parameters are brought into equation (1).

4. A natural language processing method according to claim 1, wherein: the step S12 combines the attention residual structure and the gate-controlled attention unit, and adapts the attention residual structure in the step of adapting the attention residual structure and the gate-controlled attention unit, and the attention residual structure can strengthen the attention by introducing the attention residual in the result of calculating the score by using less occupied memory space, and the attention residual can enable the attention moment array to converge quickly, so that the model is fast stabilized, a regular effect is obtained, and the most direct implementation mode is as follows:

5. a natural language processing method according to claim 1, wherein: the S2 is used for solving the problem that the gating residual attention unit does not perform well when a natural language template is used in prompt learningLine optimization, wherein the network structure of the optimization algorithm based on the continuous word bag model comprises a model trunk task and a model branch task in the step of providing the optimization algorithm based on the continuous word bag model, and the continuous word bag model optimization algorithm realizes the model trunk on the branch task in the network structure from E along H ₁ ，H ₂ ，...,H _n-1 ，H _n A part of loss1 is obtained through path calculation of (1), wherein E is that word embedding is carried out on input through a word embedding module, initialization setting of distributed expression is carried out, and H ₁ To H _n Is an integral part of the backbone of the model.

6. A natural language processing method according to claim 1, wherein: the method comprises the steps of S2 generating a general unsupervised optimization algorithm of a transformer model based on a continuous word bag model, optimizing the problem that a gating residual attention unit performs poorly when a natural language template is used in prompt learning, selecting the output of a model trunk lower layer according to a super parameter M, marking the output with the length of N and the dimension of an embedded layer as V, regarding the output as a distributed expression of a word vector with a phrase-level characteristic, selecting tensors in V as central words one by using a method consistent with a positive sampling method in the continuous word bag model, selecting the central words from the input, determining peripheral words according to the set super parameter N, creating peripheral words according to different central words, obtaining the optimal result when the central words and the number of the peripheral words are 2N, adjusting the loss optimization direction by using an activating function after summation, and finally calculating the task loss of the optimization algorithm, wherein the specific calculation formula is as follows:

Wherein V is _i For the distributed expression of the ith word in the input vector V, near (V _i ) Selecting adjacent words according to the input center words, enabling the guaranteed optimization direction of the activating function logsigmod to be consistent with the model trunk, and taking the average value to enable the sentence length not to influence the similarity measurement; the continuous word bag model optimization algorithmAdding a model trunk task, wherein the overall loss of the two cooperative multitasks is total_loss, and simultaneously, using the loss of the two tasks to optimize the model has the following loss function: total_loss=loss 1+ loss2+ loss1; the loss value of the model trunk is calculated in different modes according to the trunk model and the downstream tasks. According to the loss function optimization method, in order to ensure that both loss1 and loss2 are equally important, initial values of the loss functions are selected and weighted, and the final loss function is expressed as follows:dividing the actual value of the loss function by the initial value ensures that the model treats a plurality of tasks as equally important in the model optimization process so as to achieve pareto optimal.