CN113449514A

CN113449514A - Text error correction method and device suitable for specific vertical field

Info

Publication number: CN113449514A
Application number: CN202110687769.5A
Authority: CN
Inventors: 励建科; 陈再蝶; 朱晓秋; 周杰; 樊伟东
Original assignee: Zhejiang Kangxu Technology Co ltd
Current assignee: Kangxu Technology Co ltd
Priority date: 2021-06-21
Filing date: 2021-06-21
Publication date: 2021-09-28
Anticipated expiration: 2041-06-21
Also published as: CN113449514B

Abstract

The invention discloses a text error correction method and a text error correction device suitable for a specific vertical field, which comprise the following steps: s1, importing the text into a pre-trained Bert error correction model to perform text word sense error correction; s2, guiding the text subjected to error correction by the Bert error correction model into the pinyin error correction model, and performing secondary error correction; and S3, introducing the text subjected to the secondary error correction by the pinyin error correction model into the hot word replacement rule model, and performing the third error correction. In the invention, the text input by a user is firstly poured into a Bert error correction model for text error correction, and then the text which is corrected once is led into a pinyin error correction model for secondary error correction, so that after semantic correction is carried out on the text, the text is corrected aiming at the proper nouns in the vertical field to achieve the enhancement effect, the accuracy of text error correction is improved, and then the text after secondary error correction is poured into a hot word replacement rule model for hot word replacement, dialect and other spoken texts are converted into proper nouns, and the error correction effect is enhanced again.

Description

Text error correction method and device suitable for specific vertical field

Technical Field

The invention relates to the technical field of natural language processing, in particular to a text error correction method and a text error correction device suitable for a specific vertical field.

Background

Natural Language Processing (NLP) is an artificial intelligence for professional analysis of human language, modern NLP is a mixed discipline integrating linguistics, computer disciplines and machine learning, and in order to make NLP respond to input text more accurately, we need to correct the text, thereby reducing noise. At present, semantic analysis is mainly focused on text error correction to find and replace wrongly written characters, and text error correction models in the market are mainly classified into machine learning and deep learning.

However, firstly, the machine learning model cannot fit data, so that the accuracy is low, while the deep learning model requires a large amount of accurate corpora and a large amount of time for training, and in the vertical field, the accuracy of the common deep model still needs to be improved due to the corpus noise problem;

secondly, there are many proper nouns in the vertical domain that will be used in this scenario, it is difficult to detect wrongly written words in proper nouns only by semantic error correction, and the model may even change correct words to wrong ones based on corpus;

finally, because of dialects or personal habits, there may be multiple names for the same thing, which may cause noise, making it difficult for NLP to obtain correct information, but these terms are not strictly wrong, and general error correction is difficult to react to these words.

Disclosure of Invention

In order to solve the technical problems mentioned in the background art, a text error correction method and an error correction device thereof are provided, which are suitable for a specific vertical field.

In order to achieve the purpose, the invention adopts the following technical scheme:

a text error correction method suitable for a specific vertical field comprises the following steps:

s1, importing the text into a pre-trained Bert error correction model to perform text word sense error correction;

s11, segmenting the text into short sentences according to punctuation marks;

s12, performing mask processing on the first word in the short sentence;

s13, carrying out short sentence prediction on the words subjected to mask processing through a pre-trained Bert error correction model, and storing all prediction results in a first list, wherein the prediction results in the first list are arranged in a sequence from large to small according to prediction scores;

s131, if the masked word is in the first list, the masked word is regarded as correct;

s132, if the masked characters are not in the first list, all common characters with the same pronunciation as the masked characters are obtained according to the pinyin and stored in a second list;

s1321, if the same word exists in the first list and the second list, regarding the masked word as a wrongly-written word, and selecting the word with the highest prediction score from the first list to replace the masked word so as to achieve the purpose of error correction;

s1322, if the words in the first list and the second list are not consistent, the masked words are regarded as correct;

s14, after the first character of the short sentence is judged, the next character in the short sentence is subjected to mask processing and the step S13 is repeated until all Chinese characters in the text are detected and corrected;

s2, guiding the text subjected to error correction by the Bert error correction model into the pinyin error correction model, and performing secondary error correction;

s21, converting all texts subjected to error correction by the Bert error correction model into pinyin;

s22, comparing the pinyin of the hot word with the pinyin of the text in sequence from small to large according to the number of words;

s23, when the hot word pinyin and the text pinyin are completely the same, replacing the part of the text which is the same as the hot word pinyin with the hot word;

s24, repeating the step S22 and the step S23 until all hotwords are checked.

S3, importing the text subjected to the secondary error correction by the pinyin error correction model into a hot word replacement rule model, and performing a third error correction;

s31, importing the text subjected to the secondary error correction by the pinyin error correction model into a hot word replacement rule model;

and S32, traversing the text by using the key list, replacing the key detected by the text with a corresponding value when the key is a word needing error correction, namely the corresponding correct word, and outputting the final error-corrected text.

As a further description of the above technical solution:

the text error correction device comprises a pre-trained Bert error correction model, a pinyin error correction model and a hot word replacement rule model, wherein the Bert error correction model is a Multi-layer bidirectional Transformers encoder, the Embelling of the Bert error correction model is formed by summing three Embelling, the three Embelling are Token Embelling, Segment Embelling and Position Embelling respectively, the Bert error correction model uses Multi _ Head Attenttion for coding, dimension expansion is carried out on the input Embelling, three dimensions of Key, Query and Value are obtained respectively, Multi _ Head division is carried out on each dimension, each divided Head is carried out with other words by self-attribute, new vectors are obtained, the new vectors of each Head are spliced, and linear conversion is carried out through a weight matrix, and a final Multi-Head Attention Value is obtained.

As a further description of the above technical solution:

the pinyin error correction model comprises a database, wherein the database contains hot words in a certain field and corresponding hot word pinyin and word number, and the hot words in the certain field are derived from proper nouns in the field.

As a further description of the above technical solution:

the hot word replacement rule model comprises a dictionary, wherein words needing to be corrected are set as keys in the dictionary, corresponding correct words are set as values, and all the keys are stored in a key list.

As a further description of the above technical solution:

the pre-trained Bert error correction model is pre-trained through two models, wherein the two models comprise a Masked language model and a Next sense prediction;

the Masked language mode inputs randomly covered tokens in a corpus and predicts the randomly covered tokens to pre-train a Bert error correction model;

the Next sense prediction is performed by inputting a sentence a and a sentence B, wherein the sentence B is 50% likely to be the Next sentence of the sentence a and 50% likely to be a random sentence in the corpus, and the Bert error correction model is used for pre-training whether the sentence B is the Next sentence of the sentence a.

As a further description of the above technical solution:

the corpus comprises corpora of the hot words in a vertical field of a certain field.

In summary, due to the adoption of the technical scheme, the invention has the beneficial effects that: in the invention, the text input by the user is firstly poured into a Bert error correction model for text error correction, and then the text which is corrected once is led into a pinyin error correction model for secondary error correction, after the text is semantically corrected, the special nouns in the vertical field are corrected to achieve the effect of strengthening, the accuracy of text error correction is improved, then the text after secondary error correction is poured into a hot word replacement rule model for hot word replacement, spoken language texts such as dialects and the like are converted into the special nouns, the error correction effect is strengthened again, by the three sets of error correction systems, not only can a basic error correction be performed semantically on a text through context, but also a certain degree of replacement error correction can be performed on proper nouns and special nouns in the vertical field and dialect in an application scene environment, which is difficult to realize by a single bert error correction model.

Drawings

FIG. 1 is a flowchart illustrating a text error correction method applicable to a specific vertical domain according to an embodiment of the present invention;

FIG. 2 is a schematic diagram illustrating a Bert error correction flow of a text error correction method applicable to a specific vertical domain according to an embodiment of the present invention;

FIG. 3 is a diagram illustrating a pinyin error correction flow of a text error correction method applicable to a specific vertical domain according to an embodiment of the present invention;

FIG. 4 is a schematic diagram illustrating a flow of hotword replacement rules of a text error correction method applicable to a specific vertical domain according to an embodiment of the present invention;

FIG. 5 is a schematic diagram illustrating a structure of an input part of a Bert error correction model of a text error correction apparatus suitable for a specific vertical domain according to an embodiment of the present invention;

fig. 6 is a schematic flow chart illustrating Multi _ Head attachment in the Bert error correction model of a text error correction apparatus suitable for a specific vertical domain according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Example one

Referring to fig. 1-6, the present invention provides a technical solution: a text error correction method suitable for a specific vertical field comprises the following steps:

s11, segmenting the text into short sentences according to punctuation marks;

s12, performing mask processing on the first word in the short sentence;

s2, guiding the text corrected by the Bert error correction model into the pinyin error correction model, performing secondary error correction, and performing enhancement aiming at the vertical field, wherein the Bert error correction model may not find errors or even change the originally correct words into the errors based on the corpus because a plurality of proper nouns which can be used in the small scene exist;

for example, a long positive bank card is broken into a long symbolic bank card by a text error, and the error may not be sensed only by semantic error correction of a Bert error correction model, so that a pinyin error correction model is used for reinforcement, and proper nouns of small scenes, such as the names of the Wuhua eight men in the field of banks, are stored in a database together with corresponding pinyin and word numbers, such as [ "great wall credit card", "chang + cheng + xin + yon + yong + ka", 5 ];

s24, repeating the step S22 and the step S23 until all hotwords are checked.

S3, importing the text subjected to the secondary error correction by the pinyin error correction model into a hot word replacement rule model for carrying out third error correction, and further processing the text subjected to the pinyin error correction by using the hot word replacement rule model in order to further optimize an error correction result, wherein the texts such as spoken languages and dialects are possibly ignored by semantic error correction of a Bert error correction model, and meanwhile, the pinyin error correction model can also disregard the texts because the difference between the texts and the pronunciation of proper nouns is huge;

for example, the text needed by us is "personal credit", but the text input is "private credit", for the Bert error correction model, the semantic meaning of "private credit" is not problematic, and meanwhile, "si + ren + dai", 3 "is obviously different from" ge + dai ", 2", and pinyin error correction cannot respond;

for another example, "me" has several different reading methods in chinese, such as "me", "your", etc., these words can not be recognized by the Bert error correction model and the pinyin error correction model, so we use the hot word replacement rule model to correct the errors of these texts and replace them with the words we need;

Please refer to fig. 4 and 5, a text error correction device suitable for a specific vertical field includes a pre-trained Bert error correction model, a pinyin error correction model, and a hot word replacement rule model, where the Bert error correction model is a Multi-layer bidirectional Transformers encoder, an Embedding of the Bert error correction model is formed by summing three Embeddings, which are Token Embedding, Segment Embedding, and Position Embedding, respectively, the Bert error correction model uses Multi _ Head attribute to encode, and performs dimension expansion on the input Embedding to obtain three dimensions of Key, Query, and Value, and performs Multi _ Head division on each dimension, and each divided Head performs self-entry with other words to obtain a new vector, and then performs linear conversion on the new vector of each Head, and performs linear conversion on a weight matrix to obtain a final Multi-Head Attention Value;

the Bert error correction model depends on Multi _ Head attachment and bidirectional encoding to enable unsupervised learning of the model to be more effective, because a Transformer is used, the Bert error correction model is more efficient than previous models and can capture dependence of longer distance, bidirectional context information in the true sense can be captured, and in order to enable the Bert error correction model to play a better effect in the vertical field, the Bert error correction model is trained by adding linguistic data in the relevant vertical field in the linguistic data, so that the identification capability of the Bert error correction model in the field is improved.

Specifically, the pinyin error correction model comprises a database, wherein the database contains hot words in a certain field and corresponding hot word pinyin and word numbers, and the hot words in the certain field are derived from proper nouns in the field;

the method is characterized in that a Pinyin error correction model is used for carrying out secondary error correction on a text subjected to Bert semantic error correction, correction of proper nouns in related fields is emphasized, wrong characters of the proper nouns are difficult to detect through context, therefore, the wrong characters are probably ignored by semantic error correction, the proper nouns are set as hot words by using the Pinyin error correction model, when the hot word Pinyin and the text Pinyin are completely the same, the corresponding characters are replaced by the hot words to ensure the correctness of the proper noun text, the method is convenient to update, and updating can be completed only by adding or deleting the proper nouns in a hot word list, for example, a large amount of time can be saved in fields with frequent product changes, such as a bank field and the like.

Specifically, the hot word replacement rule model includes a dictionary, the words to be corrected in the dictionary are set as keys, the corresponding correct words are set as values, and all the keys are stored in a key list.

Specifically, the pre-trained Bert error correction model is pre-trained through two models, wherein the two models comprise a Masked language model and a Next sensitivity prediction;

the Masked language model pre-trains the Bert error correction model by inputting randomly covered tokens in the corpus and predicting the randomly covered tokens;

next sense prediction allows the Bert error correction model to pre-train whether sentence B is the Next sentence of sentence A by inputting sentence A and sentence B, where sentence B is 50% likely to be the Next sentence of sentence A and 50% likely to be a random one sentence in the corpus.

Specifically, the corpus includes corpora of hot words in a vertical field of a certain field, pre-training requires a large amount of corpora support, and in order to improve the recognition capability of the Bert error correction model in the vertical field, we add corpora of hot words in a corresponding field to the corpora for update training, for example, we hope to use a relevant model in a bank field and add corpora update training including hot words in the bank vertical field.

The method is characterized in that a hot word replacement rule model is used for carrying out third error correction on a text after secondary error correction so as to strengthen the error correction effect, different people can call different things of the same kind, for NLP, noise can be caused to influence task efficiency, however, the words are not wrong words strictly, so semantic error correction and pinyin error correction possibly ignore the words, the different calls are set as hot words, when the hot words exist in the text, the hot words are replaced by the words required by the NLP, the generation of noise is reduced to the maximum extent, and the method is the same as a pinyin error correction part, the updating operation is very simple and convenient, and only the words required to be corrected and the corresponding corrected words need to be added in the hot word rule;

in the invention, the text input by the user is firstly poured into a Bert error correction model for text error correction, and then the text which is corrected once is led into a pinyin error correction model for secondary error correction, after the text is semantically corrected, the special nouns in the vertical field are corrected to achieve the effect of strengthening, the accuracy of text error correction is improved, then the text after secondary error correction is poured into a hot word replacement rule model for hot word replacement, spoken language texts such as dialects and the like are converted into the special nouns, the error correction effect is strengthened again, by the three sets of error correction systems, not only can a basic error correction be performed semantically on a text through context, but also a certain degree of replacement error correction can be performed on proper nouns and special nouns in the vertical field and dialect in an application scene environment, which is difficult to realize by a single bert error correction model.

The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art should be considered to be within the technical scope of the present invention, and the technical solutions and the inventive concepts thereof according to the present invention should be equivalent or changed within the scope of the present invention.

Claims

1. A text error correction method suitable for a specific vertical field is characterized by comprising the following steps:

s11, segmenting the text into short sentences according to punctuation marks;

s12, performing mask processing on the first word in the short sentence;

s24, repeating the step S22 and the step S23 until all hotwords are checked.

2. The text error correction device is characterized by comprising a pre-trained Bert error correction model, a pinyin error correction model and a hot word replacement rule model, wherein the Bert error correction model is a Multi-layer bidirectional Transformers encoder, the Embelling of the Bert error correction model is formed by summing three kinds of Embelling, the three kinds of Embelling are Token Embellings, Segment Embellings and Position Embellings respectively, the Bert error correction model is encoded by using Multi _ Head Attention, three dimensions of Key, Query and Value are obtained by carrying out dimension expansion on the input Embelling, Multi _ Head division is carried out on each dimension, each divided Head is selected-attribute with other words, new vectors of each Head are obtained, and Multi-Head Attention values are obtained by carrying out linear conversion on the new vectors of each Head through a weight matrix.

3. The apparatus of claim 2, wherein the pinyin error correction model comprises a database, the database containing hot words of a certain domain and corresponding hot word pinyin and word counts, the hot words of the certain domain being derived from proper nouns of the certain domain.

4. The apparatus according to claim 2, wherein the hot word replacement rule model comprises a dictionary, the dictionary sets the word to be corrected as key, sets the corresponding correct word as value, and stores all the keys in a key list.

5. The apparatus of claim 2, wherein the pre-trained Bert error correction model is pre-trained by two models, the two models comprising a Masked language model and a Next content prediction;

6. The apparatus according to claim 5, wherein the corpus contains corpus of hot words in a vertical domain of a certain domain.