CN113743095A

CN113743095A - Chinese problem generation unified pre-training method based on word lattice and relative position embedding

Info

Publication number: CN113743095A
Application number: CN202110814546.0A
Authority: CN
Inventors: 朱磊; 皎玖圆; 张亚玲; 姬文江; 王一川; 黑新宏
Original assignee: Xian University of Technology
Current assignee: Xian University of Technology
Priority date: 2021-07-19
Filing date: 2021-07-19
Publication date: 2021-12-03

Abstract

The invention discloses a Chinese question generation unified pre-training method based on word lattice and relative position embedding, which specifically comprises the following steps: performing field pre-training on the Robert parameter; a semi-supervised and semi-manual mode is used for quickly and accurately generating a target field dictionary; according to the dictionary, relative position information of the input Chinese characters and words is merged into a Transformer layer; a newly-built Transformer layer performs task pre-training through a large amount of open domain question-answer data; training and inference of the problem is generated. According to the invention, the relative position information of each list and the domain vocabulary is added in the model input, so that the model can learn more position relations and has a better effect when generating problems aiming at the target domain input. Domain pre-training and task pre-training are also applied to the model to enhance the model's ability to infer in a particular domain. The model provided by the invention has better effect based on the same question and answer data set.

Description

Chinese problem generation unified pre-training method based on word lattice and relative position embedding

Technical Field

The invention belongs to the field of problem generation in Chinese natural language processing, and provides a unified pre-training method for Chinese problem generation based on word lattice and relative position embedding.

Background

With the development of the internet and information technology, a large amount of information is flooded into the internet, and meanwhile, the development of artificial intelligence is promoted. In the field of natural language processing, intelligent systems usually process a large amount of input corpora, so it is valuable to find a better way to process a large amount of corpus information.

The smart question-answering system qg (question generation) is a research hotspot in aspects of natural language processing. Because of the activity of human thinking and the innovation, the traditional rule-based question-answering system is difficult to obtain satisfactory question sentences, and meanwhile, the computing power of computers is greatly improved in recent years, so that a plurality of deep learning-based question-answering systems gradually start to be applied. One very important application area is in the field of education, and as a large amount of professional knowledge and vocabularies are often contacted in the learning stage, in order to improve the familiarity of students with knowledge, the memories need to be consolidated through some problems. The problem generation system can assist teachers to put forward relevant problems in the field and relieve teaching pressure. The problem generation can also be used for the chat robot, and the man-machine interaction capability is enhanced. In summary, generating high quality questions can not only advance the research in natural language processing, but also promote the development of the fields of psychotherapy, education, and the like. Therefore, it is of great practical significance to study question-answering systems that can present high quality questions.

In recent years, a model based on a Transformer has been developed in a large amount, and since the Transformer proposes an attention mechanism, the mechanism can effectively acquire context information from input corpus. Training the Transformer model through a large amount of text can enable the model to learn the implicit relationship of context in natural language. Such as Bert, Robert, GPT2, Unilm, etc., which all perform well in the NLP domain. The models can be migrated aiming at different downstream tasks, after pre-training, the downstream tasks can enable the models to be converged by using a small amount of labeled texts, and the migrated models have more excellent performance in the downstream tasks. The Unilm language model combines various covering training ideas of other models, and adopts two directions, from left to right, from right to left, and sequence-to-sequence (sequence-to-sequence) according to different specific tasks, so that the model can be better adept in different directions by different covering ideas. For example, in the context of text generation, the use of left-to-right masking ideas may improve text generation capabilities. In the field of Chinese language models, Cui et al use a full word masked Chinese Robert pre-training model in Chinese, and since a word in the Chinese often forms a new complete word meaning, masking the token formed by each of the words can better capture the boundary relationship between the words, and the model achieves the best level of the pre-training model on multiple Chinese data sets. However, because the vocabulary composition and semantic difference of each field in Chinese are great, the model cannot achieve good effect in each professional field.

Disclosure of Invention

The invention aims to provide a Chinese problem generation unified pre-training method based on word Lattice and Relative Position Embedding, which integrates field word Lattice (Lattice) Embedding and Relative Position encoding (Relative Position Embedding) and simultaneously adds field pre-training and task pre-training. The generation precision of the model in the target field is improved, and meaningful question sentences are generated more efficiently.

The technical scheme adopted by the invention is that,

the method for generating the unified pre-training based on the Chinese problem embedded in the word lattice and the relative position specifically comprises the following steps:

step 1, performing field pre-training on the Robert parameter;

step 2, a semi-supervised and semi-manual mode is used for quickly and accurately generating a target field dictionary;

step 3, constructing a special mask matrix, and improving the generation capacity of the model;

step 4, constructing a special relative position embedding matrix, and fusing the relative position information of the input Chinese characters and words into a Transformer layer according to the dictionary in the step 2;

step 5, inheriting 12 th layer parameters of a Robert model by the newly-built Transformer layer, and performing task pre-training through a large amount of open domain question-answer data;

and 6, generating training and inference of the problem.

The present invention is also characterized in that,

the step 1 comprises the following specific steps:

initial parameters of a Transformer block of a model in the field pre-training are taken from a basic Robert of Wiki hundred-subject corpus training, and then model pre-training is carried out on a field information text crawled on the internet. Pre-training uses the Robert's two-way hiding pre-training mechanism and the full word hiding mechanism. The dictionary in full word masking uses the open domain dictionary disclosed to accommodate the need for pre-training. By using these two mechanisms, we have optimized the pre-processing of the model.

The specific steps of the step 2 are as follows:

in order to acquire the target domain dictionary more quickly, the invention uses a semi-supervised and semi-manual mode to accelerate the dictionary generation efficiency. Firstly, manually selecting an electronic document in a target field and a large-scale dictionary in an open field, inputting the document in the target field into a named entity recognition deep learning model, and adding an entity recognized by the model into the field dictionary. And then, indexing words of the large-scale open domain in the target domain text in a rule-based mode, and adding the words existing in the index into a target domain dictionary. And finally, manually examining the formed domain dictionary to form a final domain vocabulary dictionary.

The specific steps of the step 3 are as follows:

in the training process of the model, the original text and the target question sentence are spliced and then sent to the model for training. Wherein, the token in the first half of the text can focus on the text in both front and back directions, and the token in the second half can focus on the text in the first half on the left.

The specific steps of the step 4 are as follows:

word lattice and "relative position embedding" can add the position relationship between each single word or vocabulary to the calculation of attition, strengthening the attention mechanism in the transform. Therefore, the invention uses relative position coding for each single character and vocabulary in the task pre-training stage. Meanwhile, the relative position codes can clearly express the position information between every two vocabularies.

The step 5 comprises the following specific steps:

to save computational resources and to adapt to smaller manually labeled datasets, migration schemes employing pre-trained models are required to provide sufficient common encyclopedia knowledge and domain information. Therefore, the invention inherits the last layer of the Robert parameter pre-trained in the step 1 by the Transformer layer integrated with the word lattice and the relative position code, and migrates the encyclopedic knowledge and the domain knowledge.

Because the model has more parameters and the manually labeled question and answer data are often less, the task pre-training is added, and the task pre-training is carried out on the model through a large amount of question and answer data of the open field crawled from the network, so that the capability of the model in the aspect of problem generation is enhanced.

Step 6 comprises the following steps:

and (3) training the question and answer text in the target field by using a Unilm language model which replaces the last layer of encoder module and is subjected to task pre-training, calculating the cross entropy of a model decoding prediction result and a problem given by original training data, and optimally training the model by using an Adam optimizer. The inference idea mainly adopts the beam search technology.

The invention has the beneficial effects that:

the invention provides a Chinese question generation unified pre-training method based on word lattice and relative position embedding, which is based on a Transformer and a constructed field dictionary of a target field. The core of the model lies in the Embedding of additional information such as lattice Embedding (domain vocabulary Embedding) and relative position Embedding (relative position Embedding) which are specially designed for the particularity of problem generation, and domain pre-training and task pre-training are also applied to the model for enhancing the inference capability of the model in a specific domain. Because the relative position information of each list and the domain vocabulary is added in the model input, the model can learn more position relations and has better effect when generating problems aiming at the target domain input. The model provided by the invention has better effect based on the same question and answer data set.

Drawings

FIG. 1 is a flow chart of the present invention for generating a unified pre-training model based on word lattice and relative position embedding Chinese problems.

FIG. 2 shows a method for generating a domain dictionary according to the present invention.

FIG. 3 is a diagram of the manner in which the invention performs word lattice embedding with respect to position coding and domain vocabulary.

FIG. 4 shows the Seq2Seq mask matrix M in step 3 of the present invention.

Detailed Description

The method for generating the unified pre-training based on the Chinese problem of word lattice and relative position embedding according to the present invention is further described in detail with reference to the accompanying drawings and the detailed description thereof.

As shown in figure 1 of the drawings, in which,

step 1: domain pre-training for Robert parameters

The specific steps are 1.1: acquiring domain pre-training data;

The specific steps are 1.2: bidirectional covering pre-training mechanism of Robert

The model uses a bi-directional full-word masking predictive training mechanism, allowing tokens to focus on the text content in both front and back directions. In order to adapt to the problem of Chinese, full word covering pre-training is used, and the meaning of a word expressed by the whole word and the two words which are split into the word in Chinese is completely different in some cases. This way context information can be efficiently encoded, thereby generating an information representation of the context. In specific implementation, the model randomly replaces the word or the word with the 'MASK'. The model randomly replaced 15% of the words in the sequence. Wherein 80% of the probability is replaced by "[ MASK ]", 10% of the probability is replaced by other words or words, and 10% of the probability is not replaced.

The cross entropy loss function carries out loss calculation on the prediction result and the original result so as to train

Step 2: quickly and accurately generating a target field dictionary by using a semi-supervised and semi-manual mode

The specific steps are as follows: named entity recognition acquisition dictionary

Referring to fig. 2, in order to acquire the target domain dictionary more quickly, the invention uses a semi-supervised and semi-manual mode to accelerate the dictionary generation efficiency. And manually selecting a document of the target field, inputting the document of the target field into the named entity recognition deep learning model, and adding the entity recognized by the model into the field dictionary.

The specific step 2.2: obtaining a dictionary based on a rule approach

And selecting a dictionary of a large-scale open field and indexing in the target field text in a rule-based mode, and adding words existing in the index into the target field dictionary. And finally, manually examining the formed domain dictionary to form a final domain vocabulary dictionary.

And step 3: constructing a special mask matrix and improving the generating capacity of the model

In the training process of the model, the original text and the target question sentence are spliced and then sent into the model for training. Wherein, the token in the first half of the text can be concerned with the text in the front and back directions, and the token in the back half can only focus on the text in the first half on the left. For example, given the sequence "[ SOS ] t1 t2 t3[ EOS ] t3 t4 t 5", the three tokens of t1 t2 t3 can only focus on the first 5 tokens, while t3 t4 t5 can focus on itself and all tokens text preceding itself.

Referring to fig. 4, the matrix M is shown. Wherein S1 represents the first half sentence after the input sequence is spliced, and the elements are all set to "0" to represent that the interior thereof can be associated with all token information in the first half sentence. S2 denotes the second half sentence after the concatenation of the input sequence, and the element is set to "- ∞" for indicating that the second half sentence can be associated with the first half sentence information. In order to improve the text generation capability of the model, the S1 sequence in the matrix M is set to focus on both preamble and postamble information, and the S2 sequence is set to focus on only preamble information including itself. For the submatrix at the lower right, we set the upper triangular element to "- ∞" and the remaining elements to "0" for representing that the text information of the part behind the current token cannot be focused on. And adding the specially constructed mask matrix and the attribute score matrix in the encoder part to realize the generation capability of the enhanced model.

And 4, step 4: constructing a special relative position embedding matrix, and fusing the relative position information of the characters and words in the input into a Transformer layer according to the dictionary in the step 2

Referring to fig. 3, at this stage, the input sequence is compared with an example of a domain dictionary to obtain words contained in a plurality of input sequences, the head and tail indexes of the words in the input sequence are respectively recorded, and the head and tail indexes respectively represent the start index and the end index of the word lexicon; for a single word, its start and end indices are the same.

"relative position embedding" can add the positional relationship between each token or vocabulary to the calculation of attention, enhancing the attention mechanism in the transform. The model therefore uses relative position coding for each token and vocabulary during the task pre-training phase. Meanwhile, the relative position codes can clearly express the position information between every two vocabularies.

In particular, head [ i ]]And tail [ i]Is sp of the ith span_iThe four relative position matrix calculations are performed as follows:

wherein the content of the first and second substances,

both represent the distance between the head entity and the tail entity span, and the final relative position information is calculated as follows and activated by an activation function:

wherein p is_dIs calculated according to the Bert official absolute position embedding method, W is a learnable parameter,

the splicing operation of the expression tensor becomes [ hidden size [ "hidden size]It represents the location association information between the token and the token.

In order for the model to learn this correlation information adequately, we use the self-attention mechanism in the Transformer, which is defined as the following form:

where all W represent the weight matrix and u, v are also learnable parameters. The final output represents the infused embedding tensor of the relative position information for the token.

And 5: a newly-built Transformer layer inherits the 12 th layer parameters of the Robert model, and performs task pre-training through a large amount of open field question-answer data

To save computational resources, and to accommodate smaller data sets, migration schemes employing pre-trained models are required to provide sufficient common encyclopedia knowledge information. Therefore, the invention carries out the transfer of the encyclopedic knowledge by inheriting the 12 th layer of the Robert parameter of encyclopedic knowledge training by the transform layer merged with the relative position code.

Because the model has more parameters, the task pre-training is added, and the task pre-training is carried out on the model through a large amount of question-answer data of the open field crawled from the network, so that the capability of the model in the aspect of problem generation is enhanced.

Examples all methods were tested on a pytorre platform using a GPU as GTX2080 TI. In the pre-training process, the maximum sequence length is defined to be 512. The Adam optimizer parameters β 1-0.9, β 2-0.99, learning rate set to 1e-4, dropout ratio set to 0.1, weight decay set to 1e-3, batch size set to 2, and dropout ratio set to 0.2 for each scenario training 200 epochs with dynamic learning rate.

Step 6: training and inference of generating problems

For the question and answer text in the target field, a Unilm model which is fused with relative position codes and is subjected to task pre-training is used for training, the result of model decoding prediction is used for calculating the cross entropy of the problem given by original training data, and the obtained gradient value is subjected to optimization training on the model through an Adam optimizer. The inference idea mainly adopts the beam search technology.

The invention generates a unified pre-training method based on word lattices and Chinese problems of relative position embedding, is different from the prior absolute position coding, and inputs more position information for a model. The model is calculated to know not only the positional relationship of the characters with the adjacent positions but also the positional relationship with each character and the vocabulary. Meanwhile, the domain pre-training and the task pre-training are used, so that the generating capacity of the model in the target domain is improved to a certain extent.

Claims

1. The method is characterized in that a domain pre-training and a task pre-training are used, and a semi-supervised semi-manual mode is used for generating a domain dictionary. In the task pre-training stage, the domain vocabulary index in the input is firstly recorded at the head and the tail of the position in the input sequence, and the indexed vocabulary is spliced behind the input sequence. The relative position between each word and child is then recorded and input into the last self-built transform module in the Unilm model. The resulting problem is finally decoded by a decoder. The method specifically comprises the following steps:

step 1, performing field pre-training on the Robert parameter;

step 5, the newly-built Transformer layer inherits the 12 th layer parameters of the Robert model, and pre-training tasks is carried out through a large amount of open domain question-answer data;

and 6, generating training and inference of the problem.

2. The method for generating unified pre-training for Chinese problems based on word lattice and relative position embedding according to claim 1, wherein the step 1 comprises the following specific steps:

initial parameters of a Transformer block of a model in the field pre-training are taken from a basic Robert of Wiki encyclopedia corpus training, and then the model pre-training is carried out on a field information text crawled on the internet. Pre-training uses Robert's two-way hiding pre-training mechanism and the full word hiding mechanism. The dictionary in full word masking uses the open domain dictionary disclosed to accommodate the need for pre-training. By using these two mechanisms, we have optimized the pre-processing of the model.

3. The method for generating unified pre-training for Chinese problems based on word lattice and relative position embedding of claim 1, wherein the step 2 comprises the following specific steps:

4. The method for generating unified pre-training for Chinese problems based on word lattice and relative position embedding of claim 1, wherein the step 3 comprises the following specific steps:

5. The method for generating unified pre-training for Chinese problems based on word lattice and relative position embedding of claim 1, wherein the step 4 comprises the following specific steps:

"relative position embedding" can add the position relationship between each single word or vocabulary to the calculation of attition, and strengthen the attention mechanism in the transform. Therefore, the invention uses relative position coding for each single character and vocabulary in the task pre-training stage. Meanwhile, the relative position codes can clearly express the position information between every two vocabularies.

6. The method for generating unified pre-training for Chinese problems based on word lattice and relative position embedding of claim 1, wherein the step 5 comprises the following specific steps:

to save computational resources and to adapt to smaller manually labeled datasets, migration schemes employing pre-trained models are required to provide sufficient common encyclopedia knowledge and domain information. Therefore, the invention inherits the last layer of the Robert parameter pre-trained in the field through the step 1 by the Transformer layer integrated with the word lattice and the relative position code, and migrates the encyclopedic knowledge and the field knowledge.

7. The method for generating unified pre-training for Chinese problems based on word lattice and relative position embedding of claim 1, wherein the step 6 comprises the following specific steps: