CN112163410A

CN112163410A - Ancient text pre-training system based on deep learning and training method thereof

Info

Publication number: CN112163410A
Application number: CN202011094231.5A
Authority: CN
Inventors: 吕建成; 田荟双; 杨可心; 屈茜; 彭玺; 刘权辉
Original assignee: Sichuan University
Current assignee: Sichuan University
Priority date: 2020-10-14
Filing date: 2020-10-14
Publication date: 2021-01-01

Abstract

The invention provides an ancient writing training system based on deep learning and a training method thereof, belonging to the technical field of ancient writing training and comprising a preprocessing module and a pre-training module; the preprocessing module is used for acquiring pre-trained ancient Chinese data and preprocessing the ancient Chinese data; and the pre-training module is used for utilizing the training of the shape filling and blank filling task to obtain an ancient text pre-training model according to the preprocessed ancient text data and the BERT-base model. The invention provides a pre-training model AnchiBERT in the ancient language field by constructing an unmarked ancient language corpus data set for model pre-training in the ancient language field, wherein the model is formed by training a large number of ancient language corpus data sets based on a BERT framework, and can effectively improve the task effect in the ancient language field, including an ancient language understanding task and an ancient language generating task. The method solves the problems that the supervision task in the prior art depends on ancient parallel corpus data excessively and the parallel data is not easy to obtain.

Description

Ancient text pre-training system based on deep learning and training method thereof

Technical Field

The invention belongs to the technical field of ancient Chinese training, and particularly relates to an ancient Chinese training system and method based on deep learning.

Background

Ancient languages are written languages composed of ancient Chinese, and mainly include written languages formed on the basis of the spoken language in the period of the first Qin, including many types of languages such as ju, poem, word, song, eight-strand, parallel, antithetical couplet, etc., and the ancient languages are used for thousands of years, so the number of the existing ancient languages is large. Previous mission studies in the field of ancient literature have included translation of ancient literature into contemporary literature, creation of ancient poems, generation of couplets, and the like. Most of these ancient tasks use parallel labeled corpora to be applied to various machine learning models, including neural network models, and the effect of model training is greatly limited by the scale of parallel labeled data.

In recent years, many researches propose that pre-training models are applied to natural language processing to improve effects, the pre-training models train a better neural network model with a certain larger data set under the condition that the calculation performance is met, then the pre-training models are reconstructed according to different tasks, and the data set of a new task is used for fine adjustment on the pre-training models. Consequently in order to promote the task effect in ancient writing field, this application has provided the pre-training model in ancient writing field, carries out the pre-training on large-scale ancient writing monolingual, then applies to the ancient writing task of low reaches in order to promote the effect.

Disclosure of Invention

Aiming at the defects in the prior art, the ancient Chinese training system based on deep learning and the training method thereof provided by the invention solve the problems that the supervision task in the prior art excessively depends on ancient parallel corpus data and the parallel data is not easy to obtain.

In order to achieve the above purpose, the invention adopts the technical scheme that:

the scheme provides an ancient text pre-training system based on deep learning, which comprises a preprocessing module and a pre-training module;

the preprocessing module is used for acquiring pre-trained ancient Chinese data and preprocessing the ancient Chinese data;

and the pre-training module is used for obtaining an ancient text pre-training model by utilizing the training of a complete filling task according to the preprocessed ancient text data and the BERT-base model, and finishing the training of the ancient text pre-training system based on deep learning.

Based on the system, the invention also discloses a training method of the ancient Chinese pre-training system based on deep learning, which comprises the following steps:

s1, obtaining pre-trained ancient Chinese data, and preprocessing the ancient Chinese data;

s2, initializing the parameters of the ancient text pre-training model by using a BERT-base model, and taking a Chinese vocabulary in the BERT-base model as a vocabulary of the ancient text pre-training model;

and S3, according to the preprocessed ancient text data and the vocabulary, training by utilizing a complete form filling task to obtain an ancient text pre-training model, and completing training of the ancient text pre-training system based on deep learning.

Further, the step S1 includes the following steps:

s101, obtaining pre-trained ancient text data;

s102, deleting special symbols in the ancient text data, and converting traditional Chinese into simplified Chinese;

s103, deleting the title in the ancient text data, and finishing the pretreatment of the ancient text data.

Still further, the structure of the ancient Chinese pre-training model in step S2 includes an input layer, 12 transform layers with the same structure, and an output layer, which are connected in sequence; each Transformer layer comprises a multi-head attention layer, a first residual error and regularization layer, a feed-forward network layer and a second residual error and regularization layer which are connected in sequence, wherein,

the multi-head attention layer of the first transform layer is connected with the input layer, the second residual error and regularization layer of the twelfth transform layer is connected with the output layer, and two adjacent transform layers are connected through the second residual error and regularization layer and the multi-head attention layer.

Furthermore, the number of hidden layer nodes of the multi-head attention layer, the first residual error and regularization layer, the feedforward network layer and the second residual error and regularization layer in each transform layer is 768; the number of the attention heads of each multi-head attention layer is 12; the number of nodes of each feedforward network layer is 3072.

Still further, the input layer comprises a word embedding vector layer and a position vector layer, the number of nodes of the word embedding vector layer is 21128, and the number of nodes of the position vector layer is 512.

Still further, the step S3 includes the steps of:

s301, randomly selecting 15% of ancient characters according to the preprocessed ancient text data to obtain a first text sequence;

s302, respectively covering 80% of ancient characters in the first text sequence by using MASK symbols, replacing 10% of ancient characters by using characters in a vocabulary table, and keeping 10% of ancient characters unchanged to obtain a second text sequence;

s303, training an ancient text pre-training model by using all the preprocessed ancient text data comprising the first text sequence and the second text sequence, and finishing training the ancient text pre-training system based on deep learning.

Still further, the expression of the objective function of the ancient Chinese pre-training model in step S303 is as follows:

L(θ；X)＝∑_x∈Xlog(x^mask|x^\mask；θ)

X＝{x₁,x₂,...,x_n}

wherein L (theta; X) represents an objective function of the ancient Chinese pre-training model,x represents all the ancient text training data, X_nRepresenting the nth text sequence, theta representing the ancient text pre-training model parameter, x^mask15% of the ancient data, X, which were masked in X^\maskRepresenting the remaining 85% of the ancient text data in X, except for the first text sequence.

The invention has the beneficial effects that:

(1) the method is used for model pre-training in the ancient Chinese field by constructing the unmarked ancient Chinese corpus data set. The model is based on a BERT framework and formed by training on a large number of ancient single corpus data sets, and can effectively improve the task effect of the ancient field, including an ancient understanding task and an ancient generating task.

(2) The method updates the model parameters by using the MLM task training model, the MLM task aims to simply hide some characters, and the model predicts what the hidden characters are through the un-hidden upper and lower characters.

(3) The input vector of the invention firstly passes through a multi-head self-attention mechanism module, the part can enable vector sequences to mutually obtain information one by one, so that the vector is integrated into sequence global information, then two sub-layers are connected through a residual error network structure, and then a layer of regularization layer is connected, which aims to solve the problems of gradient disappearance and neural network overfitting caused by excessive neural network parameters.

(4) According to the method, the parameters of the ancient text pre-training model are trained by maximizing the objective function, so that the ancient text pre-training model can better predict the characters which are masked through the context. Therefore, the ancient Chinese pre-training model can learn some text representations in advance, and the ancient Chinese pre-training model parameters are directly used in downstream tasks without random initialization, so that the model convergence effect and speed of the downstream tasks are increased.

Drawings

FIG. 1 is a schematic diagram of the system of the present invention.

FIG. 2 is a flow chart of the method of the present invention.

Fig. 3 is a schematic diagram of the downstream task implementation in this embodiment.

Fig. 4 is a schematic diagram of poetry generation and couplet generation in this embodiment.

Detailed Description

The following description of the embodiments of the present invention is provided to facilitate the understanding of the present invention by those skilled in the art, but it should be understood that the present invention is not limited to the scope of the embodiments, and it will be apparent to those skilled in the art that various changes may be made without departing from the spirit and scope of the invention as defined and defined in the appended claims, and all matters produced by the invention using the inventive concept are protected.

Example 1

The invention provides a pre-training model in the ancient language field, which aims to solve the problems that the existing supervised task excessively depends on ancient language parallel corpus data and the parallel data is not easy to obtain. The model is based on a BERT framework, pre-training is carried out on large ancient documents, and then fine adjustment is carried out on downstream tasks, so that the effect of all downstream ancient tasks can be improved, and the ancient tasks comprise an ancient understanding task and an ancient generating task. As shown in fig. 1, the present invention provides an ancient Chinese pre-training system based on deep learning, which comprises a pre-processing module and a pre-training module; the preprocessing module is used for acquiring pre-trained ancient Chinese data and preprocessing the ancient Chinese data; and the pre-training module is used for obtaining an ancient text pre-training model by utilizing the training of the shape filling and blank filling task according to the preprocessed ancient text data and the BERT-base model, and finishing the training of the ancient text pre-training system based on deep learning.

In the embodiment, the method is used for pre-training the ancient Chinese field model by constructing the unmarked ancient Chinese corpus data set. The model is based on a BERT framework and formed by training on a large number of ancient single corpus data sets, and can effectively improve the task effect of the ancient field, including an ancient understanding task and an ancient generating task. And updating model parameters by using an MLM task training model, wherein the MLM task aims to simply hide some characters, and the model predicts what the hidden characters are through the un-hidden upper and lower characters.

Example 2

As shown in fig. 2, the present invention further provides a training method of a deep learning-based training system, which is implemented as follows:

s1, obtaining pre-trained ancient text data, and preprocessing the ancient text data, wherein the method comprises the following steps:

s101, obtaining pre-trained ancient text data;

s103, deleting the title in the ancient text data to finish the pretreatment of the ancient text data;

s3, according to the preprocessed ancient text data and the vocabulary, a complete blank filling task training is utilized to obtain an ancient text pre-training model, and training of the ancient text pre-training system based on deep learning is completed, wherein the realization method comprises the following steps:

In this embodiment, the expression of the objective function of the ancient pre-training model is as follows:

L(θ；X)＝∑_x∈Xlog(x^mask|x^\mask；θ)

X＝{x₁,x₂,...,x_n}

wherein L (theta; X) represents an objective function of the ancient text pre-training model, X represents all ancient text training data, and X_nRepresenting the nth text sequence, theta representing the ancient text pre-training model parameter, x^mask15% of the ancient data, X, which were masked in X^\maskRepresenting the remaining 85% of the ancient text data in X, except for the first text sequence.

As shown in fig. 1, the structure of the ancient pre-training model includes an input layer, 12 transform layers with the same structure and an output layer which are connected in sequence; each Transformer layer comprises a multi-head attention layer, a first residual error and regularization layer, a feedforward network layer and a second residual error and regularization layer which are sequentially connected, wherein the multi-head attention layer of the first Transformer layer is connected with an input layer, the second residual error and regularization layer of the twelfth Transformer layer are connected with an output layer, two adjacent Transformer layers are connected through the second residual error and regularization layer and the multi-head attention layer, and the number of hidden layer nodes of the multi-head attention layer, the first residual error and regularization layer, the feedforward network layer and the second residual error and regularization layer in each Transformer layer is respectively multiple; the number of the attention heads of each multi-head attention layer is 12; the number of nodes of each feedforward network layer is 3072, the input layer comprises a word embedding vector layer and a position vector layer, the number of nodes of the word embedding vector layer is 21128, and the number of nodes of the position vector layer is 512.

In this embodiment, the input layer of the model includes a word embedding vector layer and a position vector layer, the word embedding vector layer encodes each input word to obtain a vector expression thereof, the number of word embedding vector nodes is 21128, the position vector layer models a word order relationship in a text, the number of position vector matrix nodes is 512, and then 12 layers of transform layers with the same structure are provided. In each transform layer, the number of nodes of a hidden layer is 768, in a first sub-layer, an input vector firstly passes through a multi-head self-attention mechanism module, the number of attention heads is 12, the attention mechanism can enable vector sequences to mutually obtain information one by one, so that the vector is fused into sequence global information, the model is focused on a certain important word and is subjected to attention for multiple times to form a plurality of subspaces, the model can pay attention to information in different aspects, and then a regularization and residual error network layer is connected, so that the problems of gradient disappearance and neural network overfitting caused by excessive neural network parameters are solved; and the second sub-layer is provided with a simple feedforward neural network layer which is fully connected according to positions one by one, the number of nodes is 3072, then the regularization and residual connection operations are continuously carried out, finally, the output is obtained, and the input layer and 12 layers of the above Transformer layers are sequentially connected to obtain the ancient character pre-training model structure.

In the embodiment, the training method is obtained by continuously pre-training on the basis of BERT-base (Chinese edition) by using ancient single language materials, and is not training from the beginning.

In this embodiment, first, the parameters of the ancient text pre-training model are initialized using BERT-base (Chinese version). And updating the model parameters by using an MLM task training model. The purpose of the mask language model MLM is simply to mask out some words and let the model predict what the masked out words are by the un-masked-out upper and lower words. And finally, aiming at the training model, the prediction is more accurate, namely, the value of the training objective function is maximized, so that the trained model parameters can be obtained, and the model is the ancient pre-training model AnchiBERT.

In this embodiment, the present invention trains the ancient pre-trained model AnchiBERT using the mask language model MLM proposed in the BERT paper, also known as the filled-in-shape task, which is the prediction of certain words that are obscured by the application. The masking MSAK method is as follows: this applicationRandomly extracting 15% of Chinese characters from text sequence, using [ MASK ] for 80% of Chinese characters]This notation is substituted, 10% with a random one of the Chinese characters in the vocabulary, and 10% remains unchanged. The MLM task is to predict these masked words by context. For text sequence X ═ X₁,x₂,...,x_n}. Training an objective function: the training objective function of the model is the log-likelihood:

L(θ；X)＝∑_x∈Xlog(x^mask|x^\mask；θ)

the training goal of the present application is to maximize this function.

In this embodiment, the hyper-parameter setting of training: the learning rate 1e-4 is used in the training process, and the adam optimizer is used for training. The text sequence length is 512 characters longest and the training batch size is 15. The present application uses the published chinese vocabulary by the authors of the BERT paper as the vocabulary for AnchiBERT. The word segmentation method is to divide the ancient corpus into Chinese characters.

In this embodiment, as shown in fig. 4, the pre-training is for fine tuning, that is, the pre-trained model can be used to improve the effect of other downstream tasks. The reason for improving the effect is that the parameters of the pre-training model are well-learned and already contain a part of previously learned text information, and the parameters do not need to be learned from the head. The downstream tasks comprise poetry classification, ancient sentence translation, ancient poetry generation and couplet generation, and can be seen in an experiment part B.

In the embodiment, for poetry classification, the whole poetry is input into AnchiBERT, and the input initial character' S]' the corresponding vector of a character at the final layer is H_SWill vector H_SInputting the poem theme into the sorted softmax layer to obtain the poem theme category; for text generation tasks (ancient translation, ancient poetry generation, couplet generation), these tasks are based on an encoder-decoder framework, as is the encoder-decoder structure of the Transformer. The present application encodes the input using anchebert as the encoder and a randomly initialized transform as the decoder. During training, the ancient language translation task takes the ancient language as input and inputsIn the encoder, the model output is obtained through decoding by a decoder, and the modern text is used as a standard answer; the ancient poetry generation takes the first sentences of the ancient poetry as input and the later sentences as standard answers; the generation of the couplet takes the upper-link as an input and the lower-link as a standard answer.

In this embodiment, in order to verify the promotion of AnchiBERT to the effect task in the ancient chinese field, the experiment was made on the following task in this application:

1) and (6) classifying poems. The poems are classified into 9 categories in total according to the contents of the poems, such as delivering poems, war poems and the like.

2) And (4) translating the ancient languages. Because ancient texts are difficult to understand for modern people, the ancient text translation task translates the ancient texts into modern texts.

3) And generating ancient poems. Ancient poetry generation the application sets two experimental settings, one is to generate the last three sentences (1-3) from the first sentence of the ancient poetry, and the other is to generate the last two sentences (2-2) from the first two sentences.

4) And generating the couplet. The couplet generation is given an upper-couplet generation lower-couplet.

In the embodiment, a poetry classified data set is a data set disclosed on a 2.8K network, the data format is that 4 poems of each poem and some keywords are used as input, and a corresponding category is used as output; the data set of the ancient translation is 1M-sized ancient-modern sentence pairs, the ancient serves as input, and the modern serves as output; in the ancient poetry generating data set, 4 sentences of ancient poetry generating (1-3) tasks of 0.23M are disclosed by a network, the first sentence is used as input, the last three sentences are used as standard output, and in the ancient poetry generating (2-2) task, the first two sentences are used as input, and the last two sentences are used as standard output; the dataset generated by the couplet is a 0.77M public pair of data for the upper and lower couplets, with the upper being the input and the lower being the output.

In this example, the present application compares AnchiBERT with some baseline models, which are as follows: Std-Transformer: in the classification task, the Std-Transformer and AnchiBERT models are configured in a same mode, including model structure, layer number, vocabulary used and the like, and only the parameter weight of the Std-Transformer is initialized randomly. In the generation task, Std-Transformer is an encoder-decoder framework. Its encoder is Std-Transformer of classification task, and its decoder is a transform decoder initialized randomly. Bert-Base: in the classification task, Bert-Base is the official release BerT-Base Chinese edition of the BERT paper. In the generation task, Bert-Base is an encoder-decoder framework. Its encoder is the Bert-Base of the classification task and its decoder is a randomly initialized transform decoder. AnchiBERT: fine tuning training on the ancient corpus. The method and the device use 3 RTX 2080ti GPUs for training, the training time is 3 days, and codes are realized based on a pyrrch-transform library disclosed on a network.

In this embodiment, as shown in fig. 3, for fine tuning anchobert (downstream task implementation): for poetry classification, the vector corresponding to the final hidden layer 'S' character of the poetry is input into the classification layer so as to obtain the theme label. For text generation tasks (ancient translation, ancient poetry generation, couplet generation), these tasks are based on the encoder-decoder framework. The present application initializes the encoder with anchebert and randomly initializes the transform-based decoder. The training objective is to minimize the negative log-likelihood, just like most sequence-to-sequence tasks, as for the training setup, poetry classification task, the present application sets the batch size 24 and uses an Adam optimizer. For the text generation task, the same optimizer as the Transformer is used, and a warmup technology is adopted, namely the learning rate linearly increases within a certain step number, and linearly decreases along with the increase of the step number after the threshold is reached. In ancient translation, the present application applies a batch size of 30 and a 4-layer decoder; in poetry generation, the application adopts a batch size of 80 and a 2-layer decoder; in the generation of the couplet, the present application applies the batch size of 80 and a 4-layer decoder. The Dropout ratio is always 0.1. The application selects the optimal number of training rounds and learning rate on the development set. In fig. 3, [ S ] indicates a beginning of a sentence, a start character [ E ] to be added before each text sequence indicates an end of a sentence, and an end character [ E ] to be added after each text sequence.

In the present embodiment, as shown in fig. 4, fig. 4 shows examples of translation of ancient texts, generation of poetry and generation of couplets. In the generation task, the application observes that Std-Transformer has weak capability of learning language characterization, so that sentences generated before and after ancient texts are lack of consistency. BERT-Base is a learning language representation from modern chinese corpora, so it is somewhat inferior to the generation of ancient chinese, and AnchiBERT is able to generate coherent and meaningful ancient chinese sentences. For example, in ancient and recent Chinese translation, the ancient sentence "listen already" is translated into "after listen" (after listen already). However, Std-Transformer and BERT-Base directly ignore the sentence and do not translate, whereas the AnchiBERT model proposed in this application translates. In ancient poetry generation (2-2), the original poetry described the pragmatic idea of the author, but the sentences generated did not include this meaning, while the first sentences generated by BERT-Base described the lives of common people, there was some deviation from the semantics of the original poetry and the concatenation with the previous sentences was not so consistent, and the sentences generated by anchebert expressed a heavy atmosphere and a desire for prosperous dynasties, which closely matched the theme of poetry. From the example of the Generation task, the "Std-Trans" in fig. 4 is an abbreviation of "Std-Transformer", and Chinese Poem Generation (2-2) indicates that the last two sentences are generated from the first two sentences of ancient poetry, and (1-3) indicates that the last three sentences are generated from the first sentence.

The invention firstly provides a pre-training model in the ancient language field, and the model is trained on the ancient language material constructed based on an applicant. The method is used for model pre-training in the ancient Chinese field by constructing the unmarked ancient Chinese corpus data set. The model is based on a BERT framework and formed by training on a large number of ancient single corpus data sets, and can effectively improve the task effect of the ancient field, including an ancient understanding task and an ancient generating task. The present invention verifies the effectiveness of the present invention AnchiBERT on four downstream tasks with regard to ancient, including ancient language understanding and language generation tasks. The AnchiBERT achieves the best effect on all tasks, and meanwhile, the AnchiBERT verifies that the pre-training model can effectively improve the task performance in the ancient field. Therefore, a complete idea is provided to integrate the pre-training model into the whole field of ancient texts.

Claims

1. An ancient text pre-training system based on deep learning is characterized by comprising a preprocessing module and a pre-training module;

2. A training method of the ancient text pre-training system based on deep learning according to claim 1, which comprises the following steps:

3. The training method of the ancient text pre-training system based on deep learning of claim 1, wherein the step S1 comprises the steps of:

s101, obtaining pre-trained ancient text data;

4. The training method of the ancient text pre-training system based on deep learning of claim 1, wherein the ancient text pre-training model in the step S2 has a structure comprising an input layer, 12 Transformer layers with the same structure and an output layer which are connected in sequence; each Transformer layer comprises a multi-head attention layer, a first residual error and regularization layer, a feed-forward network layer and a second residual error and regularization layer which are connected in sequence, wherein,

5. The training method of the ancient text pre-training system based on deep learning of claim 4, wherein the number of nodes of a multi-head attention layer, a first residual error and regularization layer, a feed-forward network layer and a second residual error and regularization layer in each fransformer layer is 768; the number of the attention heads of each multi-head attention layer is 12; the number of nodes of each feedforward network layer is 3072.

6. The training method of the ancient text pre-training system based on deep learning of claim 4, wherein the input layer comprises a word embedding vector layer and a position vector layer, the number of nodes of the word embedding vector layer is 21128, and the number of nodes of the position vector layer is 512.

7. The training method of the ancient text pre-training system based on deep learning according to claim 4, wherein the step S3 comprises the following steps:

8. The training method of the ancient text pre-training system based on deep learning of claim 7, wherein the expression of the objective function of the ancient text pre-training model in the step S303 is as follows:

L(θ；X)＝∑_x∈Xlog(x^mask|x^\mask；θ)

X＝{x₁,x₂,...,x_n}