CN114064856A

CN114064856A - XLNET-BiGRU-based text error correction method

Info

Publication number: CN114064856A
Application number: CN202111394371.9A
Authority: CN
Inventors: 王伦; 张发雨; 王宁; 党章; 吴兴龙; 孟奥; 冯立二; 杨正云
Original assignee: Jiangsu Future Networks Innovation Institute
Current assignee: Jiangsu Future Networks Innovation Institute
Priority date: 2021-11-23
Filing date: 2021-11-23
Publication date: 2022-02-18

Abstract

The invention provides a text error correction method based on XLNET-BiGRU, which is characterized by comprising the following steps: s1, training an XLNT (generalized automated training for Language understanding) Chinese Model based on large-scale unlabeled corpus, wherein the XLNT Model mainly comprises a ranking Language Model, a double-Stream Attention machine (Two-Stream Self-Attention) and a Transformer-XL core component; s2, preprocessing and labeling the text error correction corpus data; s3, constructing an XLNet-BiGRU neural network model on the basis of the XLNet pre-training Chinese model trained in S1, wherein the model mainly comprises a detection network and an error correction network, and is trained by using the marked data in S2. The invention improves the problem of long time consumption of the traditional translation model-based error correction method, optimizes the serial process of text error correction for generating correct sentences word by word into a parallel process of error correction by using an XLNET neural network only aiming at the wrong contents.

Description

XLNET-BiGRU-based text error correction method

Technical Field

The invention relates to the field of artificial intelligence and natural language processing, in particular to an XLNet-BiGRU text error correction method.

Background

Text error correction is a natural language processing technology for correcting error contents in texts, and specifically comprises error correction objects such as spelling error correction, grammar error correction and semantic pragmatic error correction in characteristic scenes. The spelling error correction is characterized in that the length of the text is not changed, and only the wrongly written characters in the text are corrected one by one; the grammar error correction and the semantic language error correction need to process errors such as multi-word errors, few-word errors, word errors and word sequence errors in the text, and the length of the text can be changed.

In recent years, large-scale deep pre-training language models such as BERT and XLNET promote rapid development of the natural language processing field, so that a better initial text semantic representation can be obtained when a specific text processing task is carried out, and the time and cost required by model convergence are reduced.

The traditional text error correction mainly adopts a method based on rules or a translation model, wherein the method based on the rules mainly depends on manual definition of a replacement word dictionary and can only correct specific errors; text error correction using translation models is currently the mainstream method, and neural network-based translation models have been used for error correction instead of statistical-based translation models, which solves text error correction as a translation problem from a wrong sentence to a correct sentence, although it works well, sentences are smooth, but requires a large amount of training data, and there is a problem of long time consumption in use. In addition, if only misspellings are corrected, the current method mainly adopts a sequence marking method, which can quickly correct wrongly written characters but is not suitable for correcting other errors.

Disclosure of Invention

Chinese text correction is a challenging task because models must have human-level language understanding capability to achieve a satisfactory solution, and conventional text correction is difficult to achieve satisfactorily using methods based on rules or translation models. The invention aims to provide a text error correction method based on XLNET-BiGRU, which aims to solve the problems in the background technology.

In order to achieve the purpose, the invention adopts the following technical scheme:

an XLNET-BiGRU text error correction method is characterized by comprising the following steps:

s1, training an XLNT (generalized automated training for Language understanding) Chinese Model based on large-scale unlabeled corpus, wherein the XLNT Model mainly comprises a ranking Language Model, a double-Stream Attention machine (Two-Stream Self-Attention) and a Transformer-XL core component;

s2, preprocessing and labeling the text error correction corpus data;

s3, constructing an XLNet-BiGRU neural network model on the basis of the XLNet pre-training Chinese model trained in S1, wherein the model mainly comprises a detection network and an error correction network, and is trained by using the marked data in S2.

The step S1 specifically includes: the purpose of the permutation language model included in the XLNET model is to randomly shuffle the Chinese characters of the sentences in the text, for the Chinese character x_iHan { x } originally appearing behind it_i+1,…,x_nIt can also appear in front of it, assuming that the text sequence of length T is [1,2, …, T]All combinations of (A)_T，a_tFor the t-th element in the sequence, a < t represents a permutation combination case, i.e., a ∈ A_TThe modeling process can be expressed as:

wherein θ is a model parameter with training;

further, XLNet adopts a dual Stream Attention mechanism, in which a Content Stream Attention indicates a Self-Attention mechanism containing position information and Content information, and a Query Stream Attention indicates an input Stream only containing position information;

when the Query Stream attribute is used for predicting the required predicted position, no content information of the current position is revealed, the two types of information supplement each other, the characteristics related to the context information are better extracted, and a specific double-flow attention mechanism is as follows:

wherein the content of the first and second substances,

only the position information of the input text, as the Q matrix in Self-orientation,

content information including input text, as K and V matrixes in Self-orientation;

furthermore, the XLNET language model takes a transform framework as a core, introduces a circulation mechanism and relative position coding, can better utilize context semantic information, and excavates potential hidden relations in text vectors; introducing a relative position coding mechanism formula:

wherein

Text vectors, R, representing words i, j, respectively_i-jRepresenting the relative position vector of the words i, j and W representing the weight matrix.

The step S3 specifically includes:

building a model by using XLNT pre-training word vector and inputting Embedding sequence E ═ (E)₁,e₂,…,e_n)；

Wherein e_iRepresenting a character x_iAn Embedding vector, which is the sum of the word Embedding (word Embedding), position Embedding (position Embedding) and segment Embedding (segment Embedding) of each character;

further inputting the input sequence E into a detection network BiGRU (bidirectional gated recurrent unit) neural network model;

the BiGRU is actually a simplification of the LSTM, which controls the passing and interception of information through the gate; the specific state calculation formula is as follows:

z_t＝σ(w_z·[h_t-1,x_t])

r_t＝σ(w_r·[h_t-1,x_t])

wherein σ is sigmoid function, z_t,r_tRespectively, update gate and reset gate, from the current time x_tAnd the last moment h_t-1And determining that the control information is saved and abandoned, wherein the updating door is used for controlling the degree of the state information of the previous moment being brought into the current state, and the larger the value of the updating door is, the more the state information of the previous moment is brought into the current state. How much information is written to the current candidate set before reset gate controls the previous state

The above step (1); and h_tThen the output of the current moment and the input of the hidden vector of the next moment are used; w is a_z，w_r，

Resetting the gate candidate set weight parameters for the update gates, respectively;

output vector G ═ G of BiGRU₁,g₂,…,g_n) Is a probability label between 0 and 1 for each character, a smaller value indicating a higher probability of the corresponding character being erroneous;

further, the error probability of each character in the text sequence is calculated by the following formula:

wherein P is_d(g_i1| X) is the error probability calculated by the detection network, σ is the sigmoid function, W is the weight matrix for the full connection, b_dIn order to be a term of the offset,

the last layer weight of the BiGRU;

further, the sequence e is input_iAnd use ofDetecting p calculated by network_iAnd (3) weighted summation, constructing soft-mask Embedding:

e′_i＝p_i.e_mask+(1-p_i).e_i

wherein e_iTo input an Embedding vector, e_maskIs the mask Embedding vector; if the error probability is high, e'_iApproximation mask Embedding vector e_mask(ii) a Otherwise e'_iWill be nearly equal to the input Embedding vector e_i；

Further, the error correction network is a sequence multi-classification model based on XLNET; the input is soft-mask Embedding sequence E '═ E'₁,e′₂,…,e′_n) The output sequence Y ═ Y (Y)₁,y₂,…,y_n)；

Further, all hidden states h of the last layer Encoder in the XLNet model are taken_i ^cAnd the input Embedding sequence vector e_iCorresponding addition is carried out with Residual Connection (Residual Connection) to obtain Residual Connection value h'_i：

Wherein the content of the first and second substances,

all hidden states of the last layer of Encoder in the XLNet model; then connecting the residual errors to a value h'_iInputting the residual word into a full-connection layer, and mapping the residual connection value into a vector with the same dimension as the candidate word list by using a hidden state of XLNet; the probability that character i can be corrected to candidate character j is then output using the softmax function:

P_c(y_i|X)＝softmax(Wh′_i+b)[j]。

compared with the prior art, the invention has the beneficial effects that:

the XLNET model used by the invention is based on the fact that unsupervised training is carried out on large-scale label-free data, pre-training can be carried out by combining context semantic information, and the characteristics of word level, syntactic structure and context semantic information are learned, so that the defect that static word embedding cannot represent word ambiguity is solved; the method uses the sequence label for the processing object of text error correction, so that various types of errors can be quickly and accurately corrected by using a sequence label method, and the method is not limited to spelling error correction; the method carries out text error correction based on XLNET, can carry out error correction on error texts in large-scale linguistic data and generate correct texts, simultaneously improves the problem of long time consumption of the traditional error correction method based on a translation model, and optimizes the serial process of generating correct sentences one by one for text error correction into the parallel process of carrying out error correction only on error contents by using the XLNET neural network.

Drawings

FIG. 1 is a flow chart of a text error correction method based on XLNET-BiGRU in the invention.

Detailed Description

The technical solutions of the present invention will be described clearly and completely with reference to the accompanying drawings, and it is to be understood that the described embodiments are only a part of the embodiments of the present invention, rather than all embodiments, and all other embodiments obtained by those skilled in the art without any inventive work based on the embodiments of the present invention belong to the protection scope of the present invention.

Example 1

As shown in fig. 1, the text error correction method based on XLNet-BiGRU of the present invention includes the following steps:

s1, training XLNT (generalized automated forecasting for Language understanding) Chinese model based on large-scale unmarked corpus.

The XLNET Model mainly comprises a ranking Language Model (Permutation Language Model), a Two-Stream Attention machine Model (Two-Stream Self-Attention) and a transform-XL core component.

Further, the contained permutation language model in the XLNET model aims to randomly shuffle the Chinese characters of the sentence in the text, for Chinese character x_iHan { x } originally appearing behind it_i+1,…,x_nIt can also appear in front of it, assuming that the text sequence of length T is [1,2, …, T]All combinations of (A)_T，a_tFor the t-th element in the sequence, a < t represents a permutation combination case, i.e. a is equal to A_TThe modeling process can be expressed as:

where θ is the model parameter with training.

Further, XLNet employs a dual Stream Attention mechanism, in which a Content Stream Attention indicates a Self-Attention mechanism that includes both location information and Content information, and a Query Stream Attention indicates a location-only input Stream. Therefore, when the Query Stream attribute is used for predicting the required predicted position, no content information of the current position is leaked, the Query Stream attribute and the content information supplement each other, and the characteristics related to the context information are better extracted, wherein a specific double-flow attention mechanism is as follows:

wherein the content of the first and second substances,

the content information containing the input text is used as K and V matrixes in the Self-orientation.

Furthermore, the XLNET language model takes a transform framework as a core, introduces a circulation mechanism and relative position coding, and can better utilize context semantic information to dig out potential hidden relations in text vectors. Introducing a relative position coding mechanism formula:

wherein

S2, preprocessing and labeling the text error correction corpus data, wherein the training data is a tuple pair consisting of an original sequence and a corrected sequence: (X)₁,Y₁),(X₂,Y₂),…,(X_N,Y_N)。

S3-1, further, using XLNet pre-training word vector construction model to input Embedding sequence E ═ (E)₁,e₂,…,e_n) Wherein e is_iRepresenting a character x_iThe Embedding vector is a sum of word Embedding (word Embedding), position Embedding (position Embedding), and segment Embedding (segment Embedding) of each character.

S3-2, further, the input sequence E is input into a bigru (bidirectional gated recurrent unit) neural network model. BiGRU is actually a simplification of LSTM, which controls the passing and interception of information through gates.

The specific state calculation formula is as follows:

z_t＝σ(w_z·[h_t-1,x_t])

r_t＝σ(w_r·[h_t-1,x_t])

wherein σ is sigmoid function, z_t,r_tRespectively, update gate and reset gate, from the current time x_tAnd the last moment h_t-1And determining, and controlling the storage and abandonment of the information. And h_tThen it is taken as the output of the current time instant and the input of the concealment vector for the next time instant.

Output vector G ═ G of BiGRU₁,g₂,…,g_n) Is a probability label between 0 and 1 for each character, with smaller values indicating a greater likelihood of the corresponding character being erroneous.

is the last layer weight of BiGRU.

S3-3, further inputting the sequence e_iAnd p calculated using the detection network_iAnd (3) weighted summation, constructing soft-mask Embedding:

e′_i＝p_i.e_mask+(1-p_i).e_i

wherein e_iTo input an Embedding vector, e_maskIs the mask Embedding vector. If the error probability is high, e'_iProximity maskCode Embedding vector e_mask(ii) a Otherwise e'_iWill be nearly equal to the input Embedding vector e_i。

S3-4, further, the error correction network is a sequence multi-classification model based on XLNET. The input is soft-mask Embedding sequence E '═ E'₁,e′₂,…,e′_n) The output sequence Y ═ Y (Y)₁,y₂,…,y_n)。

Further, all hidden states of the Encoder at the last layer in the XLNet model and an input Embedding sequence vector e are taken_iCorresponding addition is carried out with Residual Connection (Residual Connection) to obtain Residual Connection value h'_i：

Then connecting the residual errors to a value h'_iThe input is to a fully-connected layer that maps residual join values into vectors of the same dimension as the candidate vocabulary using XLNet's hidden state. The probability that character i can be corrected to candidate character j is then output using the softmax function:

P_c(y_i|X)＝softmax(Wh′_i+b)[j]

further, the character with the highest probability is taken to replace the text to be corrected.

According to the method, the pretreatment corpus sample of the XLNET-BiGRU text error correction model is constructed as follows:

the foregoing shows and describes the general principles, essential features, and advantages of the invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, and that the preferred embodiments of the invention and described in the specification are only preferred and not intended to limit the invention, and that various changes and modifications may be made without departing from the novel spirit and scope of the invention, which fall within the scope of the invention as claimed. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims

1. An XLNET-BiGRU text error correction method is characterized by comprising the following steps:

s2, preprocessing and labeling the text error correction corpus data;

2. The XLNet-BiGRU-based text correction method of claim 1, wherein: the step S1 specifically includes: randomly disordering the Chinese characters of the sentence in the text, for the Chinese character x_iHan { x } originally appearing behind it_i+1,…,x_nIt can also appear in front of it, assuming that the text sequence of length T is [1,2, …, T]All combinations of (A)_T，a_tFor the t-th element in the sequence, a < t represents a permutation combination case, i.e. a epsilon A_TThe modeling process can be expressed as:

wherein θ is a model parameter with training;

(QueryStream)

(ContentStream)

wherein the content of the first and second substances,

the content information containing the input text is used as K and V matrixes in the Self-orientation;

furthermore, the XLNET language model takes a transform frame as a core, introduces a circulation mechanism and relative position coding, can better utilize context semantic information, and excavates a potential hidden relation in a text vector; introducing a relative position coding mechanism formula:

wherein

3. The XLNet-BiGRU-based text correction method of claim 1, wherein: the step S3 specifically includes:

further inputting the input sequence E into a detection network BiGRU (bidirectional corrected neural unit) neural network model;

z_t＝σ(w_z·[h_t-1,x_t])

r_t＝σ(w_r·[h_t-1,x_t])

wherein σ is sigmoid function, z_t,r_tRespectively, update gate and reset gate, from the current time x_tAnd the last moment h_t-1Determining whether the control information is saved or abandoned, and updating the door to control the degree of the state information of the previous time being brought into the current state, wherein the larger the value of the updating door is, the more the state information of the previous time isThe more information is brought in. How much information is written to the current candidate set before reset gate controls the previous state

the last layer weight of the BiGRU;

further, the sequence e is input_iAnd p calculated using the detection network_iAnd (3) weighted summation, constructing soft-mask Embedding:

e′_i＝p_i.e_mask+(1-p_i).e_i

Further, all hidden states of the last layer Encoder in the XLNet model are taken

And the input Embedding sequence vector e_iCorresponding addition is carried out with Residual Connection (Residual Connection) to obtain Residual Connection value h'_i：

Wherein the content of the first and second substances,

all hidden states of the last layer of Encoder in the XLNet model;

then connecting the residual errors to a value h'_iInputting the residual error into a full-connection layer, and mapping the residual error connection value into a vector with the same dimension as the candidate word list by using a hidden state of XLNet; the probability that character i can be corrected to candidate character j is then output using the softmax function:

P_c(y_i|X)＝softmax(Wh′_i+b)[j]。