CN113947072A

CN113947072A - Text error correction method and text error correction device

Info

Publication number: CN113947072A
Application number: CN202111204836.XA
Authority: CN
Inventors: 何友鑫; 刘传厚; 朱星
Original assignee: Shanghai Shuidi Credit Service Co ltd
Current assignee: Shanghai Shuidi Credit Service Co ltd
Priority date: 2021-10-15
Filing date: 2021-10-15
Publication date: 2022-01-18

Abstract

The invention relates to a text error correction method and a text error correction device. The method comprises the following steps: a language model construction step, namely collecting a specific type of text, analyzing the specific type of text to form a document corpus aiming at the specific type of text, and training by utilizing a pre-training language model based on the document corpus to obtain the language model aiming at the specific type of text; a text error correction model construction step, namely constructing a text error correction task, finely adjusting the language model obtained in the language model construction step, and constructing an end-to-end text error correction model; and a text error correction step of inputting the text to be corrected into the text error correction model obtained in the text error correction model construction step to obtain the corrected text and the position information of the error. According to the text error correction method and the text error correction device, the problem that IPO verification needs a large number of manual service pain points is solved, the linguistic data do not need to be marked manually, and a large amount of labor cost is saved.

Description

Text error correction method and text error correction device

Technical Field

The present invention relates to the field of language processing technologies, and in particular, to a text error correction method and a text error correction apparatus.

Background

With the development of artificial intelligence technology and natural speech processing technology, natural speech processing technology has become an important direction for the development of artificial intelligence technology. At present, the application of artificial intelligence technology and natural speech processing technology to Chinese error correction systems mainly has the following two technical routes: 1. the Chinese error correction method based on the language model comprises the following steps: chinese correction based on a language model relies on a statistical language model trained on large-scale linguistic data to calculate the confusion score of a sentence, and a threshold value is set to judge whether the sentence is a reasonable natural language; 2. the Chinese error correction method based on the deep neural network model comprises the following steps: the existing main solution of the wrongly written character correction task is based on an end-to-end neural network generation model (Seq2Seq), and Chinese correction is regarded as a machine translation process, namely, a wrong sentence is translated into a correct sentence. The generation model solves the problem of converting a source sequence into a target sequence by using an encoder-decoder framework structure, an input sentence is represented as a vector by using one RNN (encoder), and the vector is decoded by using another RNN (decoder) to obtain a target output.

In addition, no text error correction method or error correction method that is particularly useful for the IPO (Initial Public Offering) posting has been proposed, and the following problems need to be solved:

(1) the IPO endorsement specification is long in length and multiple in word number, and a standard IPO endorsement specification generally has 300 and 400 pages and 30-40 ten thousand words;

(2) the related professional terms are more, the requirement on the speciality of the words used by the text is higher, and meanwhile, errors occurring in the text include but are not limited to wrongly written characters, and more errors are semantic and context logic errors;

(3) the IPO endorsement instruction is usually performed by a plurality of persons among dealer commissions service personnel, which all cause great difficulty in later manual verification work.

(4) Common errors are mainly: misuse of words, typographical errors, contextual conflicts, contextual logic errors, punctuation errors, numeric format errors, financial common sense errors, and the like.

In view of the above problems, no effective solution has been proposed.

Disclosure of Invention

In view of the above problems, the present invention is directed to a text correction method and a text correction device that can be applied to correct the progress of an IPO slip comment text.

A text error correction method of an aspect of the present invention is characterized in that,

a language model construction step, namely collecting a specific type of text, analyzing the specific type of text to form a document corpus aiming at the specific type of text, and training by utilizing a pre-training language model based on the document corpus to obtain the language model aiming at the specific type of text;

a text error correction model construction step, namely constructing a text error correction task, finely adjusting the language model obtained in the language model construction step, and constructing an end-to-end text error correction model; and

and a text error correction step of inputting the text to be corrected into the text error correction model obtained in the text error correction model construction step to obtain the corrected text and the position information of the error.

Optionally, a Bert-based pre-training language model is adopted in the language model construction step.

Optionally, a transformer model is adopted in the language model building step and the text correction model building step.

Optionally, the language model building step comprises the sub-steps of:

a collecting sub-step of collecting a specific type of text;

an analysis substep, analyzing the text of the specific type, cutting the text into paragraphs according to a specified length, and forming a document corpus aiming at the text of the specific type; and

and a training substep, based on the document corpus, utilizing a Bert pre-training language model and a random Chinese full word masking mode to train, and obtaining a language model based on the specific type of text.

Optionally, the text correction model building step includes the following sub-steps:

a task construction sub-step, namely generating sentence pairs of wrong corpora by adopting a self-supervision mode to construct a training task; and

and a fine tuning training substep, performing fine tuning of a downstream task above the training task, thereby establishing an end-to-end text error correction model.

Optionally, in the task construction substep, randomly replacing the characters masked in the random chinese full-word masking manner in the language model obtained in the language model construction substep with the characters in a prescribed manner, and generating a sentence pair of the wrong corpus to construct the training task.

Optionally, in the task construction substep, the characters of the prescribed manner include one or more of:

a near word vector;

homophonic characters;

form a near character; and

approximate pinyin characters.

A text error correction device according to an aspect of the present invention includes:

the language model construction module is used for collecting the text of a specific type, analyzing the text of the specific type to form a document corpus aiming at the text of the specific type, and training by utilizing a pre-training language model based on the document corpus to obtain the language model aiming at the text of the specific type;

the text error correction model building module is used for building a text error correction task and finely tuning the language model obtained by the language model building module so as to build an end-to-end text error correction model; and

and the text error correction module is used for inputting the text to be corrected into the text error correction model obtained in the text error correction model building module so as to obtain the corrected text and the position information of the error.

Optionally, a Bert-based pre-training language model is employed in the language model building module.

Optionally, a transformer model is adopted in the language model building module and the text correction model building module.

Optionally, the language model building module includes:

the collection submodule is used for collecting the specific type of text;

the analysis submodule is used for analyzing the text of the specific type, cutting the text into paragraphs according to the specified length and forming a document corpus aiming at the text of the specific type; and

and the training sub-module is used for training by utilizing a Bert pre-training language model and a random Chinese full word masking mode based on the document corpus to obtain a language model based on the specific type of text.

Optionally, the text correction model building module includes:

the task construction submodule is used for generating sentence pairs of error linguistic data in a self-supervision mode to construct a training task; and

and the fine tuning training submodule is used for fine tuning of a downstream task above the training task so as to establish an end-to-end text error correction model.

Optionally, in the task construction sub-module, randomly replacing the characters masked in the random chinese full-word masking manner in the language model obtained by the language model construction module with the characters in the prescribed manner, and generating a sentence pair of the wrong corpus to construct the training task.

Optionally, in the task building submodule, the characters of the prescribed manner include one or more of:

a near word vector;

homophonic characters;

form a near character; and

approximate pinyin characters.

The computer-readable medium of the present invention, on which a computer program is stored, is characterized in that,

the computer program, when executed by a processor, implements a text correction method as claimed.

Optionally, a computer device, comprising: a memory; a processor; and a computer program stored on the memory and executable on the processor, wherein execution of the computer program causes the processor to implement the text correction method of claim when the computer program is executed.

As described above, compared with the conventional statistical error correction model, the text error correction method and the text error correction device of the present invention have better accuracy and recall rate, can cover most error scenes, solve the problem that a large amount of manual service pain points are required for IPO verification, and have great commercial application value. Moreover, the method is realized by adopting a transformer model, and the linguistic data do not need to be manually marked, so that a large amount of labor cost is saved.

Drawings

Fig. 1 is a schematic diagram showing a flow of a text error correction method of the present invention.

Fig. 2 is a schematic diagram showing a specific flow of the language model construction step S100 in the text error correction method of the present invention.

Fig. 3 is a schematic diagram showing a specific flow of the text correction model building step S200 in the text correction method of the present invention.

Fig. 4 is a schematic diagram showing the structure of the text correction device of the present invention.

Detailed Description

The following description is of some of the several embodiments of the invention and is intended to provide a basic understanding of the invention. It is not intended to identify key or critical elements of the invention or to delineate the scope of the invention.

For the purposes of brevity and explanation, the principles of the present invention are described herein with reference primarily to exemplary embodiments thereof. However, those skilled in the art will readily recognize that the same principles are equally applicable to all types of text correction methods and text correction apparatuses, and that these same principles, as well as any such variations, may be implemented therein without departing from the true spirit and scope of the present patent application.

Moreover, in the following description, reference is made to the accompanying drawings that illustrate certain exemplary embodiments. Electrical, mechanical, logical, and structural changes may be made to these embodiments without departing from the spirit and scope of the invention. In addition, while a feature of the invention may have been disclosed with respect to only one of several implementations/embodiments, such feature may be combined with one or more other features of the other implementations/embodiments as may be desired and/or advantageous for any given or identified function. The following description is, therefore, not to be taken in a limiting sense, and the scope of the present invention is defined by the appended claims and their equivalents.

Terms such as "comprising" and "comprises" mean that, in addition to having elements (modules) and steps that are directly and explicitly stated in the description and claims, the solution of the invention does not exclude the presence of other elements (modules) and steps that are not directly or explicitly stated.

Before describing the text error correction method and the text error correction apparatus of the present invention, some technical terms appearing in the present invention will be briefly described.

(1) Language model

The language model is a language abstract mathematical modeling based on language objective facts, and is a corresponding relationship. The relationship between the language model and the language objective facts is like the relationship between mathematically abstract straight lines and concrete straight lines.

The language model can estimate the probability of a section of text and plays an important role in tasks such as information retrieval, machine translation, voice recognition and the like. The language model is divided into a statistical language model and a neural network language model.

(2) Self-supervision learning task Fill MASK

Equivalent to a full-fill, full-standard article, covering several positions, let the model automatically learn what the covered part is.

(3)BERT:Pre-training of Deep Bidirectional Transformers for Language Understanding

The model structures of the MLM (masked language model) pre-training language model and the transformer are proposed, which are the main cornerstones for the current mainstream NLP work.

(4)Revisiting Pre-Trained Models for Chinese Natural Language Processing

A new Chinese pre-training task based on the BERT deformation is proposed. The main innovation is that the Chinese full-word masking is adopted, the [ MASK ] characters are replaced by word vectors similar to Chinese, and the similarity of the replaced word vectors and the original words is adopted for the loss of the model.

(4) Kenlm language model

Is a character error correction tool based on a statistical model.

To determine whether a piece of text is natural language, the probability of its existence can be indicated by determining the probability distribution of the piece of text. The words in the language model are sequential, and given m words, the words are not a reasonable natural language, and the key is whether the arrangement sequence of the words is correct. The basic idea of a statistical language model is to compute conditional probabilities. For example, a segment of text has w1, w2, w3., and m words of wm form, and if it is not a sentence, the joint probability can be calculated by the following formula:

P(w1,w2,w3...,wm)＝P(w2|w1)P(w3|w1,w2)...P(wm|w1,w2,...wm-1)

in practice, if the text is long, the estimation of P (wi | w1, w 2.. wi-1) P (wi | w1, w 2.. wi-1) can be very difficult, so a simplified model N-gram language model appears: the current word is only related to the n words preceding it, and independently of the more preceding words, the above equation can be written as:

P(wi|w1,w2,...wi-1)＝P(wi|wi-(n-1),...,wi-1)

the classical statistical language model is the N-gram language model.

(5)Pycorrector

Is a Chinese text error correction tool. The method can correct the pronunciation-like and shape-like wrong characters (or variant characters), and can be used for correcting the errors of Chinese pinyin and stroke input methods. Based on the data set of Chinese linguistic data, the character error correction library of Kenlm and deep learning models is realized.

The neural network language model solves the problem that data sparsity occurs when n is large in an n-gram model. The Neural Network Language Model (NNLM) models the N-gram language model, estimates the probability of P (wi | wi-N +1, wi-N + 2.. wi-1). unlike the statistical language model, the NNLM does not estimate the N-gram conditional probability by counting, but directly solves the N-gram conditional probability by a neural network.

The NNLM is structurally divided into three layers, namely an input layer, a hidden layer and an output layer, if the sentence of ' the people's republic of China ' is to be generated, the first four words are generated, the fifth word is to be output under the condition of the first four words, the input layer is a vector for mapping the original words to the layer, of course, the vector is obtained by random initialization and is also a parameter (weight) which is adjusted in the training process, the word vectors are spliced together at the input layer and input to the hidden layer, the hidden layer is a nonlinear activation function, and the output layer is a vector with the size of a dictionary and represents the possibility that each word in the dictionary is used as the fifth word generated by a language model.

NNLM is also N-gram language model, for example, assuming we train a 7-gram language model, we would do the same sample for every 7 words in the training expectation, where the first 6 words are input and the seventh word is the correct output, and the model predicts the probability distribution of the seventh word on the dictionary at the output layer through the input of the 6 words. For the entire corpus, the language model needs to be maximized:

∑wi∈DlogP(wi|wi-n+1,...wi-1)

the training uses a stochastic gradient descent method (i.e., deriving extrema) to optimize the objective function.

Next, a text error correction method and a text error correction apparatus according to the present invention will be described.

The inventor of the invention finds that the existing solution is directed to general Chinese grammar errors, and in a Chinese text error correction task, the common error types comprise the following:

(1) harmonious words, such as with eyes-with glasses;

(2) confusing sound words, such as wandering girl-cowherd girl;

(3) the order of the words is reversed, such as Wudi Allen-Allen Wudy;

(4) completing words, if love is happy-if love is happy;

(5) misshapers such as jowar-sorghum;

(6) chinese Pinyin full spellings, such as xingfu-happiness;

(7) chinese pinyin abbreviations, such as sz-Shenzhen; and

(9) grammatical errors, such as imagination is hard-to-imagine.

The inventor of the present invention further finds that in the business scenario for the IPO endorsement specification, these low-level errors occur less frequently, and the error correction effect for some professional fields is very poor, such as the failure of context logic judgment in documents and the like.

By finding the technical problems, the text error correction method and the text error correction device of the invention are provided. The main technical idea of the invention is to take more consideration of error correction from the grammar level, firstly generate error-free non-learning texts, then generate error-free learning texts, train grammars and obtain error correction models. Therefore, it can also be said that the main solution of the present invention is to construct a language model based on a specific type of text (e.g. IPO bidding description), for example, to construct a Seq2Seq (end-to-end) error correction task using approximate word meaning and approximate pinyin, and to construct an error correction model, thereby enabling to parse error correction on the input specific type of text through the error correction model.

As shown in fig. 1, the text error correction method of the present invention includes:

language model construction step S100: collecting a specific type of text, analyzing the specific type of text and forming a document corpus aiming at the specific type of text, and training by utilizing a pre-training language model to obtain a language model based on the specific type of text;

text error correction model construction step S200: constructing a text error correction task, finely adjusting the language model obtained in the language model constructing step S100, and constructing an end-to-end text error correction model; and

text correction step S300: inputting the text to be corrected into the text correction model obtained in the text correction model building step S200 to obtain the corrected text and the position information of the error.

In step S100, a pre-training language model based on Bert is used. A transformer model is used in the language model building step S100 and the text error correction model building step S200.

Specifically, as shown in fig. 2, the language model building step S100 includes the following sub-steps:

a collecting substep S110 of collecting a specific type of text;

an analysis substep S120, analyzing the text of the specific type, cutting the text into paragraphs according to a specified length, and forming a document corpus for the text of the specific type; and

and a training substep S130, based on the document corpus, utilizing a Bert pre-training language model and a random Chinese full word masking mode to train, and obtaining a language model based on the specific type of text.

As shown in fig. 3, the text correction model building step S200 includes the following sub-steps:

a task construction substep S210, which adopts an automatic supervision mode to generate a sentence pair of the error corpus to construct a training task; and

and a fine tuning training substep S220, performing fine tuning of a downstream task on the training task, and thus establishing an end-to-end text error correction model.

In the task construction substep S210, randomly replacing the characters masked in the random chinese full-word masking manner in the language model obtained in the language model construction step with the characters in the prescribed manner, and generating a sentence pair of the wrong corpus to construct the training task. Wherein, the characters of the prescribed mode comprise one or more of the following items:

a near word vector;

homophonic characters;

form a near character; and

approximate pinyin characters.

Next, a text error correction method according to an example of the present invention will be described.

A text error correction method of one example of the present invention includes the steps of:

(1) language model construction procedure

And extracting the text and constructing a language model of a specific type. As an example, constructing a language model in an IPO scenario specifically includes: the method comprises the steps of collecting all published IPO files, analyzing the PDF files, converting the PDF files into texts, cutting the texts into paragraphs according to length requirements, forming a basic IPO document corpus, and training a large-scale language model based on IPO corpus by utilizing a random Chinese full-word covering mode and a distributed training scheme on the basis of pre-training parameters of Chinese Bert.

(2) Text error correction model construction step

Constructing a text error correction task and training an end-to-end error correction model, and specifically comprising the following steps: by replacing MASK characters in the language model with words with similar word vectors or similar pinyin randomly, sentence pairs of wrong linguistic data are automatically generated, and fine-tune of downstream tasks is carried out on the tasks, so that an end-to-end text error correction model of the Seq2Seq is constructed.

(3) Text error correction procedure

And inputting the text to be corrected into the trained text correction model to obtain the corrected text and the position information of the error. For example, as an example, in the case of applying the text correction model of the present invention, when the client requests a paragraph of text, the server returns the corrected text and the location information of the error by using the trained model.

In the language model and the end-to-end error correction model in (3) and (4), a transform model of a deep learning model is used. The invention uses the transformer model in the prior art. The prior art transformer model is the same as most seq2seq models, and the structure of the transformer is also composed of an encoder and a decoder, and the specific cases of the encoder and the decoder are as follows:

coding device

The encoder consists of N ═ 6 identical layers, each Layer consists of two sub-layers, namely a multi-head self-interaction mechanism and a full-connected feed-forward network.

Wherein each sub-layer is added with residual connection and normalization, so the output of the sub-layer can be expressed as:

sub_layer_output＝LayerNorm(x+(SubLayer(x)))

multi-head attention projects Q, K, V through h different linear transformations, and finally concatenates the different attention results to obtain the following:

MultiHead(Q，K，V)＝Concat(head₁，...，head_h)W^O

② decoder

The decoder and encoder have almost the same structure, but have one more sub-layer of attention, and here we first specify the decoder input/output and decoding process:

and (3) outputting: probability distribution of output words corresponding to the i position;

inputting: output of encoder & output of corresponding i-1 position decoder. So the middle attention is not self-attention, and its K, V comes from the encoder and Q from the output of the last position decoder;

and (3) decoding: the training and the prediction are different, during the training, the decoding is performed by decoding all at once, and the group channel (true value) of the previous step is used for prediction (the mask matrix is also changed, so that the future token cannot be seen during the decoding); when predicting, because there is no ground route (true value), it needs to predict one by one.

The text error correction method of the present invention is explained above, and next, the text error correction device of the present invention is explained.

As shown in fig. 4, the text correction apparatus of the present invention includes:

a language model construction module 100, configured to collect a specific type of text and analyze the specific type of text to form a document corpus for the specific type of text, and perform training by using a pre-training language model based on the document corpus to obtain a language model for the specific type of text;

the text error correction model building module 200 is configured to build a text error correction task and fine-tune the language model obtained by the language model building module 100 to build an end-to-end text error correction model; and

the text error correction module 300 is configured to input a text to be error corrected into the text error correction model obtained in the text error correction model building module 200, so as to obtain an error corrected text.

In this way, according to the text correction apparatus of the present invention, the text to be corrected can be obtained by inputting the text to be corrected to the text correction module 300. Further, the text error correction module 300 can also output location information of errors at the same time.

Wherein, the language model building module 100 adopts a pretrained language model based on Bert.

A transformer model is adopted in the language model building module 100 and the text error correction model building module 200.

Specifically, the language model building module 110 includes:

a collecting sub-module 110 for collecting a specific type of text;

the parsing submodule 120 is configured to parse the text of the specific type, and cut the text into paragraphs according to a predetermined length to form a document corpus for the text of the specific type; and

and the training sub-module 130 is configured to perform training by using a Bert pre-training language model and a random chinese full-word masking method based on the document corpus to obtain a language model based on the specific type of text.

The text error correction model building module 300 includes:

the task construction submodule 310 is configured to generate a sentence pair of the incorrect corpus in an auto-supervision manner to construct a training task; and

and a fine-tuning training sub-module 320, configured to perform fine tuning on a downstream task above the training task, so as to establish an end-to-end text error correction model.

In the task construction sub-module 310, randomly replacing the characters masked in the random chinese full-word masking manner in the language model obtained by the language model construction module 100 with the characters in the predetermined manner, and generating a sentence pair of the wrong corpus to construct the training task. As an example, the prescribed manner of characters include one or more of:

a near word vector;

homophonic characters;

form a near character; and

approximate pinyin characters.

As described above, the text error correction method and the text error correction apparatus according to the present invention construct sentence pairs of similar word vectors or homophones in the downstream task of the language model, and complete end-to-end error correction model training. Of course, besides the similar word vector or the homophone, as other transformation forms or alternatives, the training task can also be constructed by using similar words and the like.

By applying the text error correction method and the text error correction device, the pre-training and transformer model is applied to the field of IPO (Internet protocol over-the-road) endorsement specifications, so that a language model aiming at the field of the IPO endorsement specifications can be trained, and meanwhile, an error text sentence pair is generated in a self-supervision mode to perform a targeted downstream training task.

Compared with the traditional statistical error correction model, the text error correction method and the text error correction device have better accuracy and recall rate, can cover most error scenes, solve the problem that IPO (Internet protocol input/output) verification needs a large amount of manual service pain points, and have huge commercial application value. Moreover, the method is realized by adopting a transformer model, and the linguistic data do not need to be manually marked, so that a large amount of labor cost is saved.

The invention also provides a computer-readable medium, on which a computer program is stored, characterized in that the computer program realizes the text error correction method when being executed by a processor.

The present invention also provides a computer apparatus comprising: a memory; a processor; and a computer program stored on the memory and executable on the processor, wherein execution of the computer program causes the processor to implement the text correction method when executing the computer program.

The above examples mainly describe the text error correction method and the text error correction apparatus of the present invention. Although only a few embodiments of the present invention have been described in detail, those skilled in the art will appreciate that the present invention may be embodied in many other forms without departing from the spirit or scope thereof. Accordingly, the present examples and embodiments are to be considered as illustrative and not restrictive, and various modifications and substitutions may be made therein without departing from the spirit and scope of the present invention as defined by the appended claims.

Claims

1. A text error correction method, comprising:

a text error correction model construction step, namely constructing a text error correction task, and finely adjusting the language model obtained in the language model construction step to construct and obtain an end-to-end text error correction model; and

and a text error correction step, namely inputting the text to be corrected into the text error correction model obtained in the text error correction model building step, and obtaining the corrected text and the error position information.

2. The text correction method of claim 1,

and adopting a pretrained language model based on Bert in the language model construction step.

3. The text correction method of claim 1,

a transformer model is adopted in the language model building step and the text error correction model building step.

4. The text correction method of claim 1, wherein the language model building step comprises:

a collecting sub-step of collecting a specific type of text;

5. The text correction method of claim 4, wherein the text correction model constructing step comprises:

6. The text correction method of claim 5,

in the task construction substep, randomly replacing the characters covered in the random Chinese full word covering mode in the language model obtained in the language model construction step with the characters in a specified mode, and generating sentence pairs of wrong corpora to construct a training task.

7. The text correction method of claim 6,

in the task construction sub-step, the prescribed manner of characters include one or more of:

a near word vector;

homophonic characters;

form a near character; and

approximate pinyin characters.

8. A text correction apparatus, comprising:

9. The text correction apparatus of claim 8,

and adopting a pretrained language model based on Bert in the language model building module.

10. The text correction apparatus of claim 8,

and a transformer model is adopted in the language model building module and the text error correction model building module.

11. The text correction apparatus of claim 8, wherein the language model building module comprises:

the collection submodule is used for collecting the specific type of text;

12. The text correction apparatus of claim 11, wherein the text correction model building module comprises:

13. The text correction apparatus of claim 12,

in the task construction sub-module, randomly replacing the characters covered in the random Chinese full-word covering mode in the language model obtained by the language model construction module with the characters in the specified mode, and generating sentence pairs of wrong corpora to construct a training task.

14. The text correction apparatus of claim 13,

in the task building submodule, the characters of the prescribed way include one or more of:

a near word vector;

homophonic characters;

form a near character; and

approximate pinyin characters.

15. A computer-readable medium, having stored thereon a computer program,

the computer program, when executed by a processor, implements a text correction method as claimed in any one of claims 1 to 7.

16. A computer device, comprising: a memory; a processor; and a computer program stored on the memory and executable on the processor, wherein,

the computer program is operated to enable the processor to realize the text error correction method according to any one of claims 1 to 7 when the computer program is executed.