CN112395841B

CN112395841B - BERT-based method for automatically filling blank text

Info

Publication number: CN112395841B
Application number: CN202011291822.1A
Authority: CN
Inventors: 柯逍; 卢恺翔
Original assignee: Fuzhou University
Current assignee: Fuzhou University
Priority date: 2020-11-18
Filing date: 2020-11-18
Publication date: 2022-05-13
Anticipated expiration: 2040-11-18
Also published as: CN112395841A

Abstract

The invention provides a method for automatically filling a blank text based on BERT, which comprises the following steps; step S1: taking a public complete gap-filling CLOTH data set as a training data base, preprocessing the data set by using a word segmentation device, and extracting the content of an article and gap-filling options; step S2: pre-training a deep bidirectional representation model by jointly adjusting the contexts in all layers of the processed data set; providing a language model by using a pre-training model, finely adjusting the language model by using an extra output layer, and finally forming an encoder by adding the position information of the problem and the language model; step S3: stacking the full connection layer, the gelu activation function layer, the return layer and the full connection layer in sequence to form a decoder, and inputting the result of the encoder into the decoder for decoding; step S4: predicting the word which should appear at the blank space by the output of the decoder; the invention can realize the purpose of utilizing artificial intelligence to complete the prediction and proofreading of the text with vacancy and assist proofreading personnel to check and publish books.

Description

BERT-based method for automatically filling blank text

Technical Field

The invention relates to the technical field of pattern recognition and natural language processing, in particular to a method for automatically filling a blank text based on BERT.

Background

In recent years, artificial intelligence technology is rapidly developed, and deep learning is utilized to process some conversational understanding in our lives, namely natural language processing becomes a hot technology. Natural language processing is a very important research field in the fields of computer science and technology and artificial intelligence, and is mainly used for researching whether a machine can correctly understand human language so as to complete functions of translation, question answering and the like.

The aim of automatically filling the blank text is to automatically fill or automatically check the blank content or the wrong content of an unpublished book with a large amount of linguistic data by utilizing a deep learning method. By utilizing the context semantic obtaining capability and the long-distance semantic information obtaining capability of the BERT model, the context of the article can be understood, and the automatic filling function of blank parts and the automatic checking function of wrong contents are completed.

Disclosure of Invention

The invention provides a BERT-based method for automatically filling in vacant texts, which can realize the purposes of utilizing artificial intelligence to complete the prediction and proofreading of the vacant texts and assisting proofreading personnel in checking and publishing books.

The invention adopts the following technical scheme.

A BERT based method for automatically filling in a blank text, the method comprising the steps of;

step S1: taking the articles in the public complete gap-filling CLOTH data set as a training data base, preprocessing the CLOTH data set by using a word segmentation device, and extracting the content and gap-filling options of the articles;

step S2: pre-training a deep bidirectional representation model by jointly adjusting the context in all layers in the processed data set; providing a language model by using a pre-training model, finely adjusting the language model by using an extra output layer, and finally forming an encoder by adding the position information of the problem and the language model;

step S3: stacking a full connection layer, a gelu activation function layer, a return layer and another full connection layer in sequence to form a decoder, and inputting the result of the encoder into the decoder for decoding;

step S4: the word that should appear at the space is predicted by the output of the decoder, i.e. the resulting word probability vector.

The step S1 specifically includes the following steps;

step S11, acquiring a public completed type gap filling CLOTH data set;

step S12: utilizing word segmenters corresponding to different pre-training models to perform word segmentation on articles and candidate items in the CLOTH data set and converting the articles and the candidate items into indexes in corresponding dictionaries;

step S13: recording the position of each space in the corresponding text sequence, and converting the standard answers from letters to numbers in sequence;

step S14: each article in the CLOTH data set is classified into five types of data, namely sample name, attribute IDs, options IDs, queries locations and answer after data preprocessing.

Step S2 specifically includes the following steps;

step S21: acquiring a representation vector X of each word of an input sentence, wherein X is obtained by adding a word embedding vector of the word and a word embedding vector of the position of the word;

step S22: the method comprises the steps that a self-attention mechanism in a transform encoder is achieved through three matrixes, wherein the matrixes comprise a query matrix Q, a key matrix K and a value matrix V; first, the words of the input sentence are embedded into a matrix X, where each row of the matrix X represents a word in the input sentence, which is multiplied by a weight matrix W used by the pre-training model^Q、W^K、W^VRespectively obtaining a matrix Q, a matrix K and a matrix V;

step S23: multiplying the query matrix Q by the key matrix K to perform word-to-word score evaluation for each word in the sentence; wherein the high or low of the score represents whether the degree of association between two words is tight; the resulting score is then divided by the dimension d of the key vector_kSquare root of (c) to enhance the stability of the gradient; the softmax function is used again to make the scores of all words positive and their sum 1; finally, multiplying the obtained softmax fraction by a value matrix V to obtain the output of the self-attention layer at the position, which is represented as a matrix Z; expressed as:

step S24: after obtaining Z, the Z is sent to the next module of the encoder, namely a Feed-Forward Neural Network; the module has two fully connected layers, the activation function of the first layer is ReLU, and the second layer is a linear activation function, which can be expressed as:

ffn (z) ═ max (0, ZW1+ b) W2+ b formula two;

w1 and W2 in the formula are weight matrixes, and b1 and b2 are bias vectors; (ii) ffn (z) as the output of the transform encoder;

step S25: the method comprises the following steps of utilizing a language model provided by a pre-training model, and then finely adjusting the existing language model through an additional output layer so that the existing language model is more suitable for filling a downstream task of a blank text;

step S26: the position information of the added problems and a language model based on a Transformer module jointly form an encoder for automatically completing the blank text method;

step S27: the obtained word expression vector matrix is transmitted into an encoder of an automatic completion vacant text method, and an encoding information matrix C of all words of a sentence can be obtained after 6 encoder modules; the word vector matrix is represented by X (n multiplied by d), n is the number of words in the sentence, and d is the dimension representing the vector; the matrix dimensions of the output of each encoder module are identical to the input.

Step S3 specifically includes the following steps;

step S31: the decoder is formed by sequentially stacking a full connection layer, a gelu activation function layer, a return layer and another full connection layer;

step S32: transmitting the coding information matrix C output by the coder to a decoder, and predicting and judging the (n + 1) th word by the decoder according to the current analyzed previous n words in sequence;

step S33: when the translated n +1 word is translated, the word behind the n +1 word needs to be covered by a Mask covering operation; randomly masking 15% of the words in each input sequence and then letting the model predict these masked words during the training process;

step S34: in order to avoid the situation that some selected words are repeatedly shielded to cause the model to fail to see the words in the future fine adjustment process, measures are further taken, wherein in the shielding operation, 80% of probability replaces the selected words with [ MASK ]; a probability of 10% to replace these words with a random word; the probability of 10% remains unchanged.

Step S4 specifically includes the following steps;

step S41: predicting words which should appear at the blank according to a preset dictionary by utilizing the word probability vector output by the decoder;

step S42: and outputting the predicted words to spaces of the specified text or the original text.

The complete type gap-filling CLOTH dataset is provided by the university of Chimerlong in the card and is totally called Large-scale clock TestDataset Created by Teachers.

Compared with the prior art, the invention has the following beneficial effects:

1. compared with other existing precautions, the method for automatically filling the blank text based on the BERT, which is constructed by the invention, has the advantage that the bidirectional Transformer module can effectively understand the context semantics.

2. The data set does not need a large amount of labeled texts, and a language model with good performance can be trained by using a pre-training model and a CLOTH data set provided by Google.

3. The self-attention mechanism in the transform model block can simulate the attention focusing phenomenon caused by human observation of things, so that the hidden context relation in the sentence can be understood by connecting local features or disregarding some useless features.

4. The performance of the language model can be further optimized by using methods such as data expansion, data enhancement, data integration and the like, and the accuracy can be further improved.

The invention provides a method based on BERT aiming at the problems of a large amount of unmarked linguistic data, overlarge parameters contained in a language model, incapability of effectively understanding context and the like.

The invention utilizes the self-attention mechanism proposed by BERT, can imitate the attention focusing phenomenon caused by human observation of things, effectively extracts the hidden connection in the text, reduces the model parameters, and understands the hidden context relationship in the sentence by connecting local features or disregarding some useless features.

Drawings

The invention is described in further detail below with reference to the following figures and detailed description:

fig. 1 is a schematic diagram of the principle of the present invention.

Detailed Description

As shown, a BERT-based method for automatically filling in a blank text, the method comprises the following steps;

The step S1 specifically includes the following steps;

step S11, acquiring a public completed type gap filling CLOTH data set;

Step S2 specifically includes the following steps;

step S21: acquiring a representation vector X of each word of an input sentence, wherein the X is obtained by adding a word embedding vector of the word and a word embedding vector of a word position;

step S22: the method comprises the steps that a self-attention mechanism in a transform encoder is achieved through three matrixes, wherein the matrixes comprise a query matrix Q, a key matrix K and a value matrix V; first of all, the first step is to,embedding the words of the input sentence into a matrix X, each row of the X matrix representing a word in the input sentence, multiplying it by a weight matrix W used by the pre-training model^Q、W^K、W^VRespectively obtaining a matrix Q, a matrix K and a matrix V;

step S24: after obtaining Z, the Z is sent to the next module of the encoder to Feed Forward a Neural Network, namely a Feed Forward Neural Network; the module has two fully connected layers, the activation function of the first layer is ReLU, and the second layer is a linear activation function, which can be expressed as:

ffn (z) ═ max (0, ZW1+ b) W2+ b formula two;

Step S3 specifically includes the following steps;

Step S4 specifically includes the following steps;

The public complete fill CLOTH Dataset is a public complete fill CLOTH Dataset provided by the university of Meilong in the card and is all called Large-scale clock Test Dataset Created by Teachers.

Particularly, the embodiment provides a method based on BERT aiming at the problems of a large amount of unmarked corpora, too large parameters contained in a language model, incapability of effectively understanding the context and the like, and the problem of a large amount of unmarked corpora is effectively solved by utilizing the thought of a pre-training model provided by BERT; the invention utilizes the self-attention mechanism proposed by BERT, can imitate the attention focusing phenomenon caused by human observation of things, effectively extracts the hidden connection in the text, reduces the model parameters, and understands the hidden context relationship in the sentence by connecting local features or disregarding some useless features.

The above description is only a preferred embodiment of the present invention, and all equivalent changes and modifications made in accordance with the claims of the present invention should be covered by the present invention.

Claims

1. A method for automatically filling in blank texts based on BERT is characterized in that: the method comprises the following steps;

step S4: predicting the word which should appear at the blank by using the output of the decoder, namely the obtained word probability vector;

step S2 specifically includes the following steps;

step S22: transfo implementation by three matricesA self-attention mechanism in the rmer encoder, said matrix comprising a query matrix Q, a key matrix K and a value matrix V; first, the words of the input sentence are embedded into a matrix X, where each row of the matrix X represents a word in the input sentence, which is multiplied by a weight matrix W used by the pre-training model^Q、W^K、W^VRespectively obtaining a matrix Q, a matrix K and a matrix V;

step S24: after obtaining Z, the Z is sent to the next module of the encoder to Feed Forward a Neural Network, namely a Feed Forward Neural Network; the module has two fully connected layers, the activation function of the first layer is ReLU, and the second layer is a linear activation function expressed as:

ffn (z) ═ max (0, ZW1+ b) W2+ b formula two;

step S25: utilizing a language model provided by a pre-training model, and then finely adjusting the existing language model through an additional output layer to make the existing language model suitable for filling a downstream task of a vacant text;

step S27: the obtained word expression vector matrix is transmitted into an encoder of an automatic completion vacant text method, and an encoding information matrix C of all words of a sentence is obtained after 6 encoder modules; the word vector matrix is represented by X (n multiplied by d), n is the number of words in the sentence, and d is the dimension representing the vector; the matrix dimension output by each encoder module is completely consistent with the input;

step S3 specifically includes the following steps;

step S33: when the translated n +1 word is translated, covering the word behind the n +1 word through a Mask covering operation; randomly masking 15% of the words in each input sequence and then letting the model predict these masked words during the training process;

step S34: in order to avoid the situation that some selected words are repeatedly shielded to cause the model to fail to see the words in the future fine adjustment process, measures are further taken, wherein in the shielding operation, 80% of probability replaces the selected words with [ MASK ]; a probability of 10% replacing these words with a random word; the probability of 10% remains unchanged.

2. The BERT-based method for automatically filling in the blank text according to claim 1, wherein: the step S1 specifically includes the following steps;

step S11, acquiring a public completed type gap filling CLOTH data set;

3. The BERT-based method for automatically filling in the blank text according to claim 1, wherein: step S4 specifically includes the following steps;

4. The BERT-based method for automatically filling in the blank text according to claim 1, wherein: the public complete fill CLOTH Dataset is a public complete fill CLOTH Dataset provided by the university of Meilong in the card and is all called Large-scale clock Test Dataset Created by Teachers.