CN113947072A - Text error correction method and text error correction device - Google Patents

Text error correction method and text error correction device Download PDF

Info

Publication number
CN113947072A
CN113947072A CN202111204836.XA CN202111204836A CN113947072A CN 113947072 A CN113947072 A CN 113947072A CN 202111204836 A CN202111204836 A CN 202111204836A CN 113947072 A CN113947072 A CN 113947072A
Authority
CN
China
Prior art keywords
text
error correction
language model
training
specific type
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111204836.XA
Other languages
Chinese (zh)
Inventor
何友鑫
刘传厚
朱星
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Shuidi Credit Service Co ltd
Original Assignee
Shanghai Shuidi Credit Service Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Shuidi Credit Service Co ltd filed Critical Shanghai Shuidi Credit Service Co ltd
Priority to CN202111204836.XA priority Critical patent/CN113947072A/en
Publication of CN113947072A publication Critical patent/CN113947072A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/232Orthographic correction, e.g. spell checking or vowelisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to a text error correction method and a text error correction device. The method comprises the following steps: a language model construction step, namely collecting a specific type of text, analyzing the specific type of text to form a document corpus aiming at the specific type of text, and training by utilizing a pre-training language model based on the document corpus to obtain the language model aiming at the specific type of text; a text error correction model construction step, namely constructing a text error correction task, finely adjusting the language model obtained in the language model construction step, and constructing an end-to-end text error correction model; and a text error correction step of inputting the text to be corrected into the text error correction model obtained in the text error correction model construction step to obtain the corrected text and the position information of the error. According to the text error correction method and the text error correction device, the problem that IPO verification needs a large number of manual service pain points is solved, the linguistic data do not need to be marked manually, and a large amount of labor cost is saved.

Description

Text error correction method and text error correction device
Technical Field
The present invention relates to the field of language processing technologies, and in particular, to a text error correction method and a text error correction apparatus.
Background
With the development of artificial intelligence technology and natural speech processing technology, natural speech processing technology has become an important direction for the development of artificial intelligence technology. At present, the application of artificial intelligence technology and natural speech processing technology to Chinese error correction systems mainly has the following two technical routes: 1. the Chinese error correction method based on the language model comprises the following steps: chinese correction based on a language model relies on a statistical language model trained on large-scale linguistic data to calculate the confusion score of a sentence, and a threshold value is set to judge whether the sentence is a reasonable natural language; 2. the Chinese error correction method based on the deep neural network model comprises the following steps: the existing main solution of the wrongly written character correction task is based on an end-to-end neural network generation model (Seq2Seq), and Chinese correction is regarded as a machine translation process, namely, a wrong sentence is translated into a correct sentence. The generation model solves the problem of converting a source sequence into a target sequence by using an encoder-decoder framework structure, an input sentence is represented as a vector by using one RNN (encoder), and the vector is decoded by using another RNN (decoder) to obtain a target output.
In addition, no text error correction method or error correction method that is particularly useful for the IPO (Initial Public Offering) posting has been proposed, and the following problems need to be solved:
(1) the IPO endorsement specification is long in length and multiple in word number, and a standard IPO endorsement specification generally has 300 and 400 pages and 30-40 ten thousand words;
(2) the related professional terms are more, the requirement on the speciality of the words used by the text is higher, and meanwhile, errors occurring in the text include but are not limited to wrongly written characters, and more errors are semantic and context logic errors;
(3) the IPO endorsement instruction is usually performed by a plurality of persons among dealer commissions service personnel, which all cause great difficulty in later manual verification work.
(4) Common errors are mainly: misuse of words, typographical errors, contextual conflicts, contextual logic errors, punctuation errors, numeric format errors, financial common sense errors, and the like.
In view of the above problems, no effective solution has been proposed.
Disclosure of Invention
In view of the above problems, the present invention is directed to a text correction method and a text correction device that can be applied to correct the progress of an IPO slip comment text.
A text error correction method of an aspect of the present invention is characterized in that,
a language model construction step, namely collecting a specific type of text, analyzing the specific type of text to form a document corpus aiming at the specific type of text, and training by utilizing a pre-training language model based on the document corpus to obtain the language model aiming at the specific type of text;
a text error correction model construction step, namely constructing a text error correction task, finely adjusting the language model obtained in the language model construction step, and constructing an end-to-end text error correction model; and
and a text error correction step of inputting the text to be corrected into the text error correction model obtained in the text error correction model construction step to obtain the corrected text and the position information of the error.
Optionally, a Bert-based pre-training language model is adopted in the language model construction step.
Optionally, a transformer model is adopted in the language model building step and the text correction model building step.
Optionally, the language model building step comprises the sub-steps of:
a collecting sub-step of collecting a specific type of text;
an analysis substep, analyzing the text of the specific type, cutting the text into paragraphs according to a specified length, and forming a document corpus aiming at the text of the specific type; and
and a training substep, based on the document corpus, utilizing a Bert pre-training language model and a random Chinese full word masking mode to train, and obtaining a language model based on the specific type of text.
Optionally, the text correction model building step includes the following sub-steps:
a task construction sub-step, namely generating sentence pairs of wrong corpora by adopting a self-supervision mode to construct a training task; and
and a fine tuning training substep, performing fine tuning of a downstream task above the training task, thereby establishing an end-to-end text error correction model.
Optionally, in the task construction substep, randomly replacing the characters masked in the random chinese full-word masking manner in the language model obtained in the language model construction substep with the characters in a prescribed manner, and generating a sentence pair of the wrong corpus to construct the training task.
Optionally, in the task construction substep, the characters of the prescribed manner include one or more of:
a near word vector;
homophonic characters;
form a near character; and
approximate pinyin characters.
A text error correction device according to an aspect of the present invention includes:
the language model construction module is used for collecting the text of a specific type, analyzing the text of the specific type to form a document corpus aiming at the text of the specific type, and training by utilizing a pre-training language model based on the document corpus to obtain the language model aiming at the text of the specific type;
the text error correction model building module is used for building a text error correction task and finely tuning the language model obtained by the language model building module so as to build an end-to-end text error correction model; and
and the text error correction module is used for inputting the text to be corrected into the text error correction model obtained in the text error correction model building module so as to obtain the corrected text and the position information of the error.
Optionally, a Bert-based pre-training language model is employed in the language model building module.
Optionally, a transformer model is adopted in the language model building module and the text correction model building module.
Optionally, the language model building module includes:
the collection submodule is used for collecting the specific type of text;
the analysis submodule is used for analyzing the text of the specific type, cutting the text into paragraphs according to the specified length and forming a document corpus aiming at the text of the specific type; and
and the training sub-module is used for training by utilizing a Bert pre-training language model and a random Chinese full word masking mode based on the document corpus to obtain a language model based on the specific type of text.
Optionally, the text correction model building module includes:
the task construction submodule is used for generating sentence pairs of error linguistic data in a self-supervision mode to construct a training task; and
and the fine tuning training submodule is used for fine tuning of a downstream task above the training task so as to establish an end-to-end text error correction model.
Optionally, in the task construction sub-module, randomly replacing the characters masked in the random chinese full-word masking manner in the language model obtained by the language model construction module with the characters in the prescribed manner, and generating a sentence pair of the wrong corpus to construct the training task.
Optionally, in the task building submodule, the characters of the prescribed manner include one or more of:
a near word vector;
homophonic characters;
form a near character; and
approximate pinyin characters.
The computer-readable medium of the present invention, on which a computer program is stored, is characterized in that,
the computer program, when executed by a processor, implements a text correction method as claimed.
Optionally, a computer device, comprising: a memory; a processor; and a computer program stored on the memory and executable on the processor, wherein execution of the computer program causes the processor to implement the text correction method of claim when the computer program is executed.
As described above, compared with the conventional statistical error correction model, the text error correction method and the text error correction device of the present invention have better accuracy and recall rate, can cover most error scenes, solve the problem that a large amount of manual service pain points are required for IPO verification, and have great commercial application value. Moreover, the method is realized by adopting a transformer model, and the linguistic data do not need to be manually marked, so that a large amount of labor cost is saved.
Drawings
Fig. 1 is a schematic diagram showing a flow of a text error correction method of the present invention.
Fig. 2 is a schematic diagram showing a specific flow of the language model construction step S100 in the text error correction method of the present invention.
Fig. 3 is a schematic diagram showing a specific flow of the text correction model building step S200 in the text correction method of the present invention.
Fig. 4 is a schematic diagram showing the structure of the text correction device of the present invention.
Detailed Description
The following description is of some of the several embodiments of the invention and is intended to provide a basic understanding of the invention. It is not intended to identify key or critical elements of the invention or to delineate the scope of the invention.
For the purposes of brevity and explanation, the principles of the present invention are described herein with reference primarily to exemplary embodiments thereof. However, those skilled in the art will readily recognize that the same principles are equally applicable to all types of text correction methods and text correction apparatuses, and that these same principles, as well as any such variations, may be implemented therein without departing from the true spirit and scope of the present patent application.
Moreover, in the following description, reference is made to the accompanying drawings that illustrate certain exemplary embodiments. Electrical, mechanical, logical, and structural changes may be made to these embodiments without departing from the spirit and scope of the invention. In addition, while a feature of the invention may have been disclosed with respect to only one of several implementations/embodiments, such feature may be combined with one or more other features of the other implementations/embodiments as may be desired and/or advantageous for any given or identified function. The following description is, therefore, not to be taken in a limiting sense, and the scope of the present invention is defined by the appended claims and their equivalents.
Terms such as "comprising" and "comprises" mean that, in addition to having elements (modules) and steps that are directly and explicitly stated in the description and claims, the solution of the invention does not exclude the presence of other elements (modules) and steps that are not directly or explicitly stated.
Before describing the text error correction method and the text error correction apparatus of the present invention, some technical terms appearing in the present invention will be briefly described.
(1) Language model
The language model is a language abstract mathematical modeling based on language objective facts, and is a corresponding relationship. The relationship between the language model and the language objective facts is like the relationship between mathematically abstract straight lines and concrete straight lines.
The language model can estimate the probability of a section of text and plays an important role in tasks such as information retrieval, machine translation, voice recognition and the like. The language model is divided into a statistical language model and a neural network language model.
(2) Self-supervision learning task Fill MASK
Equivalent to a full-fill, full-standard article, covering several positions, let the model automatically learn what the covered part is.
(3)BERT:Pre-training of Deep Bidirectional Transformers for Language Understanding
The model structures of the MLM (masked language model) pre-training language model and the transformer are proposed, which are the main cornerstones for the current mainstream NLP work.
(4)Revisiting Pre-Trained Models for Chinese Natural Language Processing
A new Chinese pre-training task based on the BERT deformation is proposed. The main innovation is that the Chinese full-word masking is adopted, the [ MASK ] characters are replaced by word vectors similar to Chinese, and the similarity of the replaced word vectors and the original words is adopted for the loss of the model.
(4) Kenlm language model
Is a character error correction tool based on a statistical model.
To determine whether a piece of text is natural language, the probability of its existence can be indicated by determining the probability distribution of the piece of text. The words in the language model are sequential, and given m words, the words are not a reasonable natural language, and the key is whether the arrangement sequence of the words is correct. The basic idea of a statistical language model is to compute conditional probabilities. For example, a segment of text has w1, w2, w3., and m words of wm form, and if it is not a sentence, the joint probability can be calculated by the following formula:
P(w1,w2,w3...,wm)=P(w2|w1)P(w3|w1,w2)...P(wm|w1,w2,...wm-1)
in practice, if the text is long, the estimation of P (wi | w1, w 2.. wi-1) P (wi | w1, w 2.. wi-1) can be very difficult, so a simplified model N-gram language model appears: the current word is only related to the n words preceding it, and independently of the more preceding words, the above equation can be written as:
P(wi|w1,w2,...wi-1)=P(wi|wi-(n-1),...,wi-1)
the classical statistical language model is the N-gram language model.
(5)Pycorrector
Is a Chinese text error correction tool. The method can correct the pronunciation-like and shape-like wrong characters (or variant characters), and can be used for correcting the errors of Chinese pinyin and stroke input methods. Based on the data set of Chinese linguistic data, the character error correction library of Kenlm and deep learning models is realized.
The neural network language model solves the problem that data sparsity occurs when n is large in an n-gram model. The Neural Network Language Model (NNLM) models the N-gram language model, estimates the probability of P (wi | wi-N +1, wi-N + 2.. wi-1). unlike the statistical language model, the NNLM does not estimate the N-gram conditional probability by counting, but directly solves the N-gram conditional probability by a neural network.
The NNLM is structurally divided into three layers, namely an input layer, a hidden layer and an output layer, if the sentence of ' the people's republic of China ' is to be generated, the first four words are generated, the fifth word is to be output under the condition of the first four words, the input layer is a vector for mapping the original words to the layer, of course, the vector is obtained by random initialization and is also a parameter (weight) which is adjusted in the training process, the word vectors are spliced together at the input layer and input to the hidden layer, the hidden layer is a nonlinear activation function, and the output layer is a vector with the size of a dictionary and represents the possibility that each word in the dictionary is used as the fifth word generated by a language model.
NNLM is also N-gram language model, for example, assuming we train a 7-gram language model, we would do the same sample for every 7 words in the training expectation, where the first 6 words are input and the seventh word is the correct output, and the model predicts the probability distribution of the seventh word on the dictionary at the output layer through the input of the 6 words. For the entire corpus, the language model needs to be maximized:
∑wi∈DlogP(wi|wi-n+1,...wi-1)
the training uses a stochastic gradient descent method (i.e., deriving extrema) to optimize the objective function.
Next, a text error correction method and a text error correction apparatus according to the present invention will be described.
The inventor of the invention finds that the existing solution is directed to general Chinese grammar errors, and in a Chinese text error correction task, the common error types comprise the following:
(1) harmonious words, such as with eyes-with glasses;
(2) confusing sound words, such as wandering girl-cowherd girl;
(3) the order of the words is reversed, such as Wudi Allen-Allen Wudy;
(4) completing words, if love is happy-if love is happy;
(5) misshapers such as jowar-sorghum;
(6) chinese Pinyin full spellings, such as xingfu-happiness;
(7) chinese pinyin abbreviations, such as sz-Shenzhen; and
(9) grammatical errors, such as imagination is hard-to-imagine.
The inventor of the present invention further finds that in the business scenario for the IPO endorsement specification, these low-level errors occur less frequently, and the error correction effect for some professional fields is very poor, such as the failure of context logic judgment in documents and the like.
By finding the technical problems, the text error correction method and the text error correction device of the invention are provided. The main technical idea of the invention is to take more consideration of error correction from the grammar level, firstly generate error-free non-learning texts, then generate error-free learning texts, train grammars and obtain error correction models. Therefore, it can also be said that the main solution of the present invention is to construct a language model based on a specific type of text (e.g. IPO bidding description), for example, to construct a Seq2Seq (end-to-end) error correction task using approximate word meaning and approximate pinyin, and to construct an error correction model, thereby enabling to parse error correction on the input specific type of text through the error correction model.
Fig. 1 is a schematic diagram showing a flow of a text error correction method of the present invention.
As shown in fig. 1, the text error correction method of the present invention includes:
language model construction step S100: collecting a specific type of text, analyzing the specific type of text and forming a document corpus aiming at the specific type of text, and training by utilizing a pre-training language model to obtain a language model based on the specific type of text;
text error correction model construction step S200: constructing a text error correction task, finely adjusting the language model obtained in the language model constructing step S100, and constructing an end-to-end text error correction model; and
text correction step S300: inputting the text to be corrected into the text correction model obtained in the text correction model building step S200 to obtain the corrected text and the position information of the error.
In step S100, a pre-training language model based on Bert is used. A transformer model is used in the language model building step S100 and the text error correction model building step S200.
Fig. 2 is a schematic diagram showing a specific flow of the language model construction step S100 in the text error correction method of the present invention.
Specifically, as shown in fig. 2, the language model building step S100 includes the following sub-steps:
a collecting substep S110 of collecting a specific type of text;
an analysis substep S120, analyzing the text of the specific type, cutting the text into paragraphs according to a specified length, and forming a document corpus for the text of the specific type; and
and a training substep S130, based on the document corpus, utilizing a Bert pre-training language model and a random Chinese full word masking mode to train, and obtaining a language model based on the specific type of text.
Fig. 3 is a schematic diagram showing a specific flow of the text correction model building step S200 in the text correction method of the present invention.
As shown in fig. 3, the text correction model building step S200 includes the following sub-steps:
a task construction substep S210, which adopts an automatic supervision mode to generate a sentence pair of the error corpus to construct a training task; and
and a fine tuning training substep S220, performing fine tuning of a downstream task on the training task, and thus establishing an end-to-end text error correction model.
In the task construction substep S210, randomly replacing the characters masked in the random chinese full-word masking manner in the language model obtained in the language model construction step with the characters in the prescribed manner, and generating a sentence pair of the wrong corpus to construct the training task. Wherein, the characters of the prescribed mode comprise one or more of the following items:
a near word vector;
homophonic characters;
form a near character; and
approximate pinyin characters.
Next, a text error correction method according to an example of the present invention will be described.
A text error correction method of one example of the present invention includes the steps of:
(1) language model construction procedure
And extracting the text and constructing a language model of a specific type. As an example, constructing a language model in an IPO scenario specifically includes: the method comprises the steps of collecting all published IPO files, analyzing the PDF files, converting the PDF files into texts, cutting the texts into paragraphs according to length requirements, forming a basic IPO document corpus, and training a large-scale language model based on IPO corpus by utilizing a random Chinese full-word covering mode and a distributed training scheme on the basis of pre-training parameters of Chinese Bert.
(2) Text error correction model construction step
Constructing a text error correction task and training an end-to-end error correction model, and specifically comprising the following steps: by replacing MASK characters in the language model with words with similar word vectors or similar pinyin randomly, sentence pairs of wrong linguistic data are automatically generated, and fine-tune of downstream tasks is carried out on the tasks, so that an end-to-end text error correction model of the Seq2Seq is constructed.
(3) Text error correction procedure
And inputting the text to be corrected into the trained text correction model to obtain the corrected text and the position information of the error. For example, as an example, in the case of applying the text correction model of the present invention, when the client requests a paragraph of text, the server returns the corrected text and the location information of the error by using the trained model.
In the language model and the end-to-end error correction model in (3) and (4), a transform model of a deep learning model is used. The invention uses the transformer model in the prior art. The prior art transformer model is the same as most seq2seq models, and the structure of the transformer is also composed of an encoder and a decoder, and the specific cases of the encoder and the decoder are as follows:
coding device
The encoder consists of N ═ 6 identical layers, each Layer consists of two sub-layers, namely a multi-head self-interaction mechanism and a full-connected feed-forward network.
Wherein each sub-layer is added with residual connection and normalization, so the output of the sub-layer can be expressed as:
sub_layer_output=LayerNorm(x+(SubLayer(x)))
multi-head attention projects Q, K, V through h different linear transformations, and finally concatenates the different attention results to obtain the following:
MultiHead(Q,K,V)=Concat(head1,...,headh)WO
② decoder
The decoder and encoder have almost the same structure, but have one more sub-layer of attention, and here we first specify the decoder input/output and decoding process:
and (3) outputting: probability distribution of output words corresponding to the i position;
inputting: output of encoder & output of corresponding i-1 position decoder. So the middle attention is not self-attention, and its K, V comes from the encoder and Q from the output of the last position decoder;
and (3) decoding: the training and the prediction are different, during the training, the decoding is performed by decoding all at once, and the group channel (true value) of the previous step is used for prediction (the mask matrix is also changed, so that the future token cannot be seen during the decoding); when predicting, because there is no ground route (true value), it needs to predict one by one.
The text error correction method of the present invention is explained above, and next, the text error correction device of the present invention is explained.
Fig. 4 is a schematic diagram showing the structure of the text correction device of the present invention.
As shown in fig. 4, the text correction apparatus of the present invention includes:
a language model construction module 100, configured to collect a specific type of text and analyze the specific type of text to form a document corpus for the specific type of text, and perform training by using a pre-training language model based on the document corpus to obtain a language model for the specific type of text;
the text error correction model building module 200 is configured to build a text error correction task and fine-tune the language model obtained by the language model building module 100 to build an end-to-end text error correction model; and
the text error correction module 300 is configured to input a text to be error corrected into the text error correction model obtained in the text error correction model building module 200, so as to obtain an error corrected text.
In this way, according to the text correction apparatus of the present invention, the text to be corrected can be obtained by inputting the text to be corrected to the text correction module 300. Further, the text error correction module 300 can also output location information of errors at the same time.
Wherein, the language model building module 100 adopts a pretrained language model based on Bert.
A transformer model is adopted in the language model building module 100 and the text error correction model building module 200.
Specifically, the language model building module 110 includes:
a collecting sub-module 110 for collecting a specific type of text;
the parsing submodule 120 is configured to parse the text of the specific type, and cut the text into paragraphs according to a predetermined length to form a document corpus for the text of the specific type; and
and the training sub-module 130 is configured to perform training by using a Bert pre-training language model and a random chinese full-word masking method based on the document corpus to obtain a language model based on the specific type of text.
The text error correction model building module 300 includes:
the task construction submodule 310 is configured to generate a sentence pair of the incorrect corpus in an auto-supervision manner to construct a training task; and
and a fine-tuning training sub-module 320, configured to perform fine tuning on a downstream task above the training task, so as to establish an end-to-end text error correction model.
In the task construction sub-module 310, randomly replacing the characters masked in the random chinese full-word masking manner in the language model obtained by the language model construction module 100 with the characters in the predetermined manner, and generating a sentence pair of the wrong corpus to construct the training task. As an example, the prescribed manner of characters include one or more of:
a near word vector;
homophonic characters;
form a near character; and
approximate pinyin characters.
As described above, the text error correction method and the text error correction apparatus according to the present invention construct sentence pairs of similar word vectors or homophones in the downstream task of the language model, and complete end-to-end error correction model training. Of course, besides the similar word vector or the homophone, as other transformation forms or alternatives, the training task can also be constructed by using similar words and the like.
By applying the text error correction method and the text error correction device, the pre-training and transformer model is applied to the field of IPO (Internet protocol over-the-road) endorsement specifications, so that a language model aiming at the field of the IPO endorsement specifications can be trained, and meanwhile, an error text sentence pair is generated in a self-supervision mode to perform a targeted downstream training task.
Compared with the traditional statistical error correction model, the text error correction method and the text error correction device have better accuracy and recall rate, can cover most error scenes, solve the problem that IPO (Internet protocol input/output) verification needs a large amount of manual service pain points, and have huge commercial application value. Moreover, the method is realized by adopting a transformer model, and the linguistic data do not need to be manually marked, so that a large amount of labor cost is saved.
The invention also provides a computer-readable medium, on which a computer program is stored, characterized in that the computer program realizes the text error correction method when being executed by a processor.
The present invention also provides a computer apparatus comprising: a memory; a processor; and a computer program stored on the memory and executable on the processor, wherein execution of the computer program causes the processor to implement the text correction method when executing the computer program.
The above examples mainly describe the text error correction method and the text error correction apparatus of the present invention. Although only a few embodiments of the present invention have been described in detail, those skilled in the art will appreciate that the present invention may be embodied in many other forms without departing from the spirit or scope thereof. Accordingly, the present examples and embodiments are to be considered as illustrative and not restrictive, and various modifications and substitutions may be made therein without departing from the spirit and scope of the present invention as defined by the appended claims.

Claims (16)

1. A text error correction method, comprising:
a language model construction step, namely collecting a specific type of text, analyzing the specific type of text to form a document corpus aiming at the specific type of text, and training by utilizing a pre-training language model based on the document corpus to obtain the language model aiming at the specific type of text;
a text error correction model construction step, namely constructing a text error correction task, and finely adjusting the language model obtained in the language model construction step to construct and obtain an end-to-end text error correction model; and
and a text error correction step, namely inputting the text to be corrected into the text error correction model obtained in the text error correction model building step, and obtaining the corrected text and the error position information.
2. The text correction method of claim 1,
and adopting a pretrained language model based on Bert in the language model construction step.
3. The text correction method of claim 1,
a transformer model is adopted in the language model building step and the text error correction model building step.
4. The text correction method of claim 1, wherein the language model building step comprises:
a collecting sub-step of collecting a specific type of text;
an analysis substep, analyzing the text of the specific type, cutting the text into paragraphs according to a specified length, and forming a document corpus aiming at the text of the specific type; and
and a training substep, based on the document corpus, utilizing a Bert pre-training language model and a random Chinese full word masking mode to train, and obtaining a language model based on the specific type of text.
5. The text correction method of claim 4, wherein the text correction model constructing step comprises:
a task construction sub-step, namely generating sentence pairs of wrong corpora by adopting a self-supervision mode to construct a training task; and
and a fine tuning training substep, performing fine tuning of a downstream task above the training task, thereby establishing an end-to-end text error correction model.
6. The text correction method of claim 5,
in the task construction substep, randomly replacing the characters covered in the random Chinese full word covering mode in the language model obtained in the language model construction step with the characters in a specified mode, and generating sentence pairs of wrong corpora to construct a training task.
7. The text correction method of claim 6,
in the task construction sub-step, the prescribed manner of characters include one or more of:
a near word vector;
homophonic characters;
form a near character; and
approximate pinyin characters.
8. A text correction apparatus, comprising:
the language model construction module is used for collecting the text of a specific type, analyzing the text of the specific type to form a document corpus aiming at the text of the specific type, and training by utilizing a pre-training language model based on the document corpus to obtain the language model aiming at the text of the specific type;
the text error correction model building module is used for building a text error correction task and finely tuning the language model obtained by the language model building module so as to build an end-to-end text error correction model; and
and the text error correction module is used for inputting the text to be corrected into the text error correction model obtained in the text error correction model building module so as to obtain the corrected text and the position information of the error.
9. The text correction apparatus of claim 8,
and adopting a pretrained language model based on Bert in the language model building module.
10. The text correction apparatus of claim 8,
and a transformer model is adopted in the language model building module and the text error correction model building module.
11. The text correction apparatus of claim 8, wherein the language model building module comprises:
the collection submodule is used for collecting the specific type of text;
the analysis submodule is used for analyzing the text of the specific type, cutting the text into paragraphs according to the specified length and forming a document corpus aiming at the text of the specific type; and
and the training sub-module is used for training by utilizing a Bert pre-training language model and a random Chinese full word masking mode based on the document corpus to obtain a language model based on the specific type of text.
12. The text correction apparatus of claim 11, wherein the text correction model building module comprises:
the task construction submodule is used for generating sentence pairs of error linguistic data in a self-supervision mode to construct a training task; and
and the fine tuning training submodule is used for fine tuning of a downstream task above the training task so as to establish an end-to-end text error correction model.
13. The text correction apparatus of claim 12,
in the task construction sub-module, randomly replacing the characters covered in the random Chinese full-word covering mode in the language model obtained by the language model construction module with the characters in the specified mode, and generating sentence pairs of wrong corpora to construct a training task.
14. The text correction apparatus of claim 13,
in the task building submodule, the characters of the prescribed way include one or more of:
a near word vector;
homophonic characters;
form a near character; and
approximate pinyin characters.
15. A computer-readable medium, having stored thereon a computer program,
the computer program, when executed by a processor, implements a text correction method as claimed in any one of claims 1 to 7.
16. A computer device, comprising: a memory; a processor; and a computer program stored on the memory and executable on the processor, wherein,
the computer program is operated to enable the processor to realize the text error correction method according to any one of claims 1 to 7 when the computer program is executed.
CN202111204836.XA 2021-10-15 2021-10-15 Text error correction method and text error correction device Pending CN113947072A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111204836.XA CN113947072A (en) 2021-10-15 2021-10-15 Text error correction method and text error correction device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111204836.XA CN113947072A (en) 2021-10-15 2021-10-15 Text error correction method and text error correction device

Publications (1)

Publication Number Publication Date
CN113947072A true CN113947072A (en) 2022-01-18

Family

ID=79330689

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111204836.XA Pending CN113947072A (en) 2021-10-15 2021-10-15 Text error correction method and text error correction device

Country Status (1)

Country Link
CN (1) CN113947072A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114861637A (en) * 2022-05-18 2022-08-05 北京百度网讯科技有限公司 Method and device for generating spelling error correction model and method and device for spelling error correction
US11747970B2 (en) 2021-09-23 2023-09-05 International Business Machines Corporation Interactive graphical display of multiple overlapping hypotheses or document versions

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11747970B2 (en) 2021-09-23 2023-09-05 International Business Machines Corporation Interactive graphical display of multiple overlapping hypotheses or document versions
CN114861637A (en) * 2022-05-18 2022-08-05 北京百度网讯科技有限公司 Method and device for generating spelling error correction model and method and device for spelling error correction

Similar Documents

Publication Publication Date Title
US20210390271A1 (en) Neural machine translation systems
CN109840287B (en) Cross-modal information retrieval method and device based on neural network
US20240161732A1 (en) Multi-dialect and multilingual speech recognition
CN108363790B (en) Method, device, equipment and storage medium for evaluating comments
Rastogi et al. Multi-task learning for joint language understanding and dialogue state tracking
CN114787914A (en) System and method for streaming end-to-end speech recognition with asynchronous decoder
CN111310440B (en) Text error correction method, device and system
CN111339278B (en) Method and device for generating training speech generating model and method and device for generating answer speech
CN114580382A (en) Text error correction method and device
CN111831789A (en) Question-answer text matching method based on multilayer semantic feature extraction structure
JP2008165786A (en) Sequence classification for machine translation
Hu et al. Misspelling correction with pre-trained contextual language model
JP2008165783A (en) Discriminative training for model for sequence classification
US11797761B2 (en) Device, method and program for natural language processing
CN111625634A (en) Word slot recognition method and device, computer-readable storage medium and electronic device
CN111401084A (en) Method and device for machine translation and computer readable storage medium
CN113947072A (en) Text error correction method and text error correction device
Moeng et al. Canonical and surface morphological segmentation for nguni languages
KR20230009564A (en) Learning data correction method and apparatus thereof using ensemble score
US20090240501A1 (en) Automatically generating new words for letter-to-sound conversion
CN112599129B (en) Speech recognition method, apparatus, device and storage medium
CN114595700A (en) Zero-pronoun and chapter information fused Hanyue neural machine translation method
CN111783435A (en) Shared vocabulary selection method and device and storage medium
JP2019204415A (en) Wording generation method, wording device and program
Zhang et al. Character-Aware Sub-Word Level Language Modeling for Uyghur and Turkish ASR

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination