CN113449514B

CN113449514B - Text error correction method and device suitable for vertical field

Info

Publication number: CN113449514B
Application number: CN202110687769.5A
Authority: CN
Inventors: 励建科; 陈再蝶; 朱晓秋; 周杰; 樊伟东
Original assignee: Zhejiang Kangxu Technology Co ltd
Current assignee: Kangxu Technology Co ltd
Priority date: 2021-06-21
Filing date: 2021-06-21
Publication date: 2023-10-31
Anticipated expiration: 2041-06-21
Also published as: CN113449514A

Abstract

The invention discloses a text error correction method and a text error correction device suitable for the vertical field, comprising the following steps: s1, importing a text into a pretrained Bert error correction model, and performing text word sense error correction; s2, importing the text subjected to error correction by the Bert error correction model into a Pinyin error correction model, and performing secondary error correction; and S3, importing the text subjected to the second error correction by the pinyin error correction model into a hotword replacement rule model, and performing third error correction. According to the text correction method and device, the text input by the user is poured into the Bert correction model to correct the text, the corrected text is imported into the Pinyin correction model to correct the text secondarily, so that after the text is corrected semantically, proper nouns in the vertical field are corrected to achieve the reinforcing effect, the accuracy of text correction is improved, the text after the secondary correction is poured into the hot word replacement rule model to replace the hot words, spoken text such as dialects is converted into proper nouns, and the correction effect is enhanced again.

Description

Text error correction method and device suitable for vertical field

Technical Field

The invention relates to the technical field of natural language processing, in particular to a text error correction method and an error correction device thereof, which are applicable to the vertical field.

Background

Natural Language Processing (NLP) is an artificial intelligence for specialized analysis of human language, and modern NLP is a hybrid discipline that incorporates linguistics, computer science, and machine learning, and in order for NLP to respond more accurately to input text, we need to correct the text, thereby reducing noise. At present, text error correction mainly focuses on semantic analysis to find and replace wrongly written characters, and a text error correction model on the market is mainly divided into two main categories, namely machine learning and deep learning.

However, firstly, the machine learning model cannot fit data, so that the accuracy is low, while the deep learning model needs a large amount of accurate corpus, and meanwhile, a large amount of time is needed for training, and in the vertical field, the accuracy of the common deep model still needs to be improved due to the corpus noise problem;

secondly, there are many proper nouns in the vertical field that will be used in this scenario, it is difficult to detect misplaced words in the proper nouns by means of semantic error correction alone, and the model may even change the correct words to be incorrect based on the corpus;

finally, because of dialects or personal habits, there may be multiple ways of referring to the same thing, which may cause noise such that it is difficult for the NLP to get the correct information, but these terms are not strictly wrong, and general error correction is difficult to react to these words.

Disclosure of Invention

In order to solve the technical problems mentioned in the background art, a text error correction method and an error correction device thereof suitable for the vertical field are provided.

In order to achieve the above purpose, the present invention adopts the following technical scheme:

a text error correction method suitable for the vertical field comprises the following steps:

s1, importing a text into a pretrained Bert error correction model, and performing text word sense error correction;

s11, segmenting the text into short sentences according to punctuation marks;

s12, carrying out mask processing on a first word in the short sentence;

s13, performing short sentences on the words subjected to mask processing through a pretrained Bert error correction model to predict, and storing all prediction results in a first list, wherein the prediction results in the first list are arranged according to the order of the prediction scores from large to small;

s131, if the masked word is in the first list, the masked word is regarded as correct;

s132, if the masked words are not in the first list, acquiring all common words with the same pronunciation as the masked words according to pinyin and storing the common words in the second list;

s1321, if there is the same word in List one and List two, the word to be masked

Regarding as wrongly written words, selecting the word with the highest predictive score from the first list to replace the masked word so as to achieve the purpose of error correction;

s1322, if the words in list one and list two are not identical, the word to be masked

Is considered correct;

s14, after judging the first word of the short sentence, carrying out mask processing on the next word in the short sentence, and repeating the step S13 until all Chinese characters in the text are detected and corrected;

s2, importing the text subjected to error correction by the Bert error correction model into a Pinyin error correction model, and performing secondary error correction;

s21, converting all texts subjected to error correction by the Bert error correction model into pinyin;

s22, sequentially comparing the spelling of the hot word with the spelling of the text from small to large according to the number of words;

s23, when the hot word spelling is completely the same as the text spelling, the hot word spelling is the same as the text spelling in the text

Partial replacement with hotwords;

s24, repeating the step S22 and the step S23 until all hot words are checked.

S3, importing the text subjected to the second error correction by the pinyin error correction model into a hotword replacement rule model, and performing third error correction;

s31, importing the text subjected to the second error correction by the pinyin error correction model into a hotword replacement rule model;

and S32, traversing the text by using the key list, replacing the text with a corresponding value, namely a corresponding correct word when the text detects the key, namely the word needing error correction, and outputting the text subjected to final error correction.

As a further description of the above technical solution:

the text error correction device comprises a pretrained Bert error correction model, a Pinyin error correction model and a hot word replacement rule model, wherein the Bert error correction model is a Multi-layer bidirectional Transformers encoder, the Embedding of the Bert error correction model is formed by summing three Embedding, the three Embedding are Token Embeddings, segment Embeddings and Position Embeddings respectively, the Bert error correction model uses Multi-Head Attention for encoding, three dimensions of Key, query and Value are obtained respectively through dimension expansion of the input Embedding, multi-Head division is carried out on each dimension, each Head divided is then carried out with other words, so that a new vector is obtained, the new vector of each Head is spliced, and a final Multi-Head Attention Value is obtained through linear conversion of a weight matrix.

As a further description of the above technical solution:

the pinyin error correction model comprises a database, wherein the database contains hot words in a certain field and corresponding hot word pinyin and word numbers, and the hot words in the certain field are derived from proper nouns in the field.

As a further description of the above technical solution:

the hot word replacement rule model comprises a dictionary, wherein words to be corrected are set as keys in the dictionary, corresponding correct words are set as values, and all the keys are stored in a key list.

As a further description of the above technical solution:

the pretrained Bert error correction model is pretrained by two models, including Masked language mode and Next sentence prediction;

the Masked language mode pre-trains the Bert error correction model by inputting randomly masked tokens in the corpus and predicting the randomly masked tokens;

the Next sentence prediction is configured to pre-train the Bert error correction model on whether the sentence B is the next sentence of the sentence a by inputting the sentence a and the sentence B, wherein the sentence B is 50% likely to be the next sentence of the sentence a and 50% likely to be a random sentence in the corpus.

As a further description of the above technical solution:

the corpus comprises the corpus of hot words in a vertical field of a certain field.

In summary, due to the adoption of the technical scheme, the beneficial effects of the invention are as follows: according to the text correction method, text input by a user is poured into the Bert correction model for text correction, the corrected text is imported into the Pinyin correction model for secondary correction, so that after the text is subjected to semantic correction, proper nouns in the vertical field are corrected to achieve the enhancement effect, the accuracy of text correction is improved, the text subjected to secondary correction is poured into the hot word replacement rule model for hot word replacement, spoken text such as dialect is converted into proper nouns, the correction effect is enhanced again, through the three correction systems, the text can be subjected to basic correction from the semantic through the context, and the correction can be performed to a certain degree of replacement correction aiming at proper nouns in the vertical field, specific nouns and dialect slang under the application scene environment, which is difficult to achieve by the single Bert correction model.

Drawings

Fig. 1 shows a schematic flow chart of a text error correction method applicable to the vertical field according to an embodiment of the present invention;

fig. 2 shows a schematic diagram of a Bert error correction flow of a text error correction method applicable to the vertical field according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a Pinyin correction flow of a text correction method applicable to the vertical field according to an embodiment of the present invention;

fig. 4 shows a schematic flow chart of a hotword replacement rule applicable to a text error correction method in the vertical field according to an embodiment of the present invention;

fig. 5 shows a schematic diagram of a Bert error correction model input part of a text error correction device applicable to a specific vertical field according to an embodiment of the present invention;

fig. 6 shows a schematic flow diagram of multi_head Attention in a Bert error correction model of a text error correction device suitable for a specific vertical field according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Example 1

Referring to fig. 1-6, the present invention provides a technical solution: a text error correction method suitable for the vertical field comprises the following steps:

s11, segmenting the text into short sentences according to punctuation marks;

s12, carrying out mask processing on a first word in the short sentence;

Is considered correct;

s2, importing the text subjected to error correction by the Bert error correction model into a pinyin error correction model, performing secondary error correction, and enhancing the vertical field, wherein a plurality of proper nouns which can be used in a small scene exist in the small scene, and the Bert error correction model can not find the errors and even change the originally correct words into errors based on corpus;

for example, text mistakes a "long positive bank card" as a "long sign bank card", and the semantic error correction by the Bert error correction model alone may not be able to sense this error, so we use the Pinyin error correction model to strengthen by storing proper nouns of small scenes, such as the card name of the five-flower eight-door in the banking field, as hotwords and corresponding Pinyin together with the number of words in a database, such as [ "great wall credit card", "chang+cheng+xin+yong+ka",5];

Partial replacement with hotwords;

s24, repeating the step S22 and the step S23 until all hot words are checked.

S3, importing the text subjected to the second correction by the pinyin correction model into a hotword replacement rule model for performing third correction, and further processing the text subjected to the pinyin correction by using the hotword replacement rule model for further optimizing the correction result, wherein the text subjected to the pinyin correction is likely to be ignored by semantic correction of the Bert correction model due to spoken language and dialect, and the pinyin correction model can disregard the text due to the large pronunciation difference with proper nouns;

for example, the text we need is "credit", but the text input is "private credit", and for the Bert error correction model, the semantics of "private credit" are not problematic, and [ "si+ren+dai",3] is significantly different from [ "ge+dai",2], and the pinyin error correction does not respond;

for another example, "me" has several different reading methods in chinese, such as "me", "no", and so on, which are also not recognized by the Bert error correction model and the pinyin error correction model, so we use the hot word replacement rule model to correct errors in these texts, replacing them with words that we need;

Referring to fig. 4 and 5, a text error correction device suitable for a specific vertical field includes a pretrained Bert error correction model, a pinyin error correction model and a hotword replacement rule model, wherein the Bert error correction model is a Multi-layer bi-directional Transformers encoder, the components of the Bert error correction model are formed by summing three components, namely Token components, segment Embeddings and Position Embeddings, the Bert error correction model uses Multi-Head position to encode, the input components are subjected to dimension expansion to respectively obtain three dimensions of Key, query and Value, and each dimension is divided by Multi-Head, each Head divided by Multi-Head is then subjected to self-attribute with other words, so that a new vector is obtained, the new vector of each Head is spliced, and a final Multi-Head Attention Value is obtained by linear conversion of a weight matrix;

the Bert error correction model is more effective in unsupervised learning by means of Multi-Head Attention and bidirectional encoding, and because a Transformer is used, the Bert error correction model is more efficient and can capture dependence of a longer distance than a previous model, and can capture bidirectional context information in a true sense.

Specifically, the pinyin error correction model includes a database, wherein the database contains hot words in a certain field and corresponding hot word pinyin and word numbers, and the hot words in the certain field are derived from proper nouns in the field;

the text corrected by the Bert semantic error correction is corrected secondarily by using the Pinyin error correction model, correction of proper nouns in the related field is emphasized, and the proper nouns are difficult to detect through contexts, so that the proper nouns are likely to be ignored by the semantic error correction, the proper nouns are set to be hot words by using the Pinyin error correction model, when the hot word Pinyin is identical to the text Pinyin, corresponding characters are replaced by the hot words, so that the correctness of the proper noun text is ensured, and the method is convenient to update, and updating can be completed only by adding or deleting proper nouns in a hot word list, for example, a great amount of time can be saved in the fields with frequent product changes such as the banking field.

Specifically, the hot word replacement rule model includes a dictionary in which words to be corrected are set as keys, corresponding correct words are set as values, and all the keys are stored in a key list.

Specifically, the pretrained Bert error correction model is pretrained by two models, including Masked language mode and Next sentence prediction;

masked language mode pretraining the Bert error correction model by inputting randomly masked tokens in the corpus and predicting the randomly masked tokens;

next sentence prediction by entering sentences A and B, where sentence B is 50% likely to be the next sentence of sentence A and 50% likely to be a random sentence in the corpus, let the Bert error correction model pretrain whether sentence B is the next sentence of sentence A.

Specifically, the corpus contains the corpus of hot words in a vertical field of a certain field, a large amount of corpus support is needed for pre-training, in order to improve the recognition capability of the Bert error correction model in the vertical field, the corpus of hot words in the corresponding field is added into the corpus for updating training, for example, the corpus updating training of hot words in the vertical field of a bank is added by using a related model in the bank field.

The text subjected to secondary correction is subjected to third correction by using a hot word replacement rule model so as to strengthen the correction effect, different people can call the same thing differently, noise can be caused to influence task efficiency for NLP, however, the words are not wrong words strictly, so semantic correction and pinyin correction are likely to ignore them, the different calls are set as hot words, when the hot words exist in the text, the hot words are replaced by words required by NLP, so that noise generation is reduced to the greatest extent, and the updating operation of the method is very simple and convenient just by adding words needing correction and corresponding corrected words in the hot word rule;

according to the text correction method, text input by a user is poured into the Bert correction model for text correction, the corrected text is imported into the Pinyin correction model for secondary correction, so that after the text is subjected to semantic correction, proper nouns in the vertical field are corrected to achieve the enhancement effect, the accuracy of text correction is improved, the text subjected to secondary correction is poured into the hot word replacement rule model for hot word replacement, spoken text such as dialect is converted into proper nouns, the correction effect is enhanced again, through the three correction systems, the text can be subjected to basic correction from the semantic through the context, and the correction can be performed to a certain degree of replacement correction aiming at proper nouns in the vertical field, specific nouns and dialect slang under the application scene environment, which is difficult to achieve by the single Bert correction model.

The foregoing is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art, who is within the scope of the present invention, should make equivalent substitutions or modifications according to the technical scheme of the present invention and the inventive concept thereof, and should be covered by the scope of the present invention.

Claims

1. A text error correction method suitable for the vertical field is characterized by comprising the following steps:

s11, segmenting the text into short sentences according to punctuation marks;

s12, carrying out mask processing on a first word in the short sentence;

Is considered correct;

Partial replacement with hotwords;

s24, repeating the step S22 and the step S23 until all hot words are checked;

2. A text error correction device for implementing the text error correction method applicable to the vertical field as claimed in claim 1, characterized in that the text error correction device comprises a pretrained Bert error correction model, a pinyin error correction model and a hotword substitution rule model, the Bert error correction model is a Multi-layer bi-directional Transformers encoder, the components of the Bert error correction model are summed by three components, the three components are Token components, segment Embeddings and Position Embeddings respectively, the Bert error correction model uses Multi-Head Attention to encode, three dimensions of Key, query and Value are obtained respectively by dimension expansion of the input components, multi-Head is divided for each dimension, each Head is divided from other words by self-attitudes, thus obtaining new vectors, each Head's new vectors are spliced, and finally a Multi-Head conversion Value is obtained by a weight matrix.

3. The text error correction apparatus of claim 2, wherein the pinyin error correction model includes a database containing hotwords of a domain and corresponding hotword pinyin and word counts, the hotwords of the domain originating from proper nouns of the domain.

4. The text error correction apparatus of claim 2, wherein the hot word replacement rule model includes a dictionary that sets a word to be corrected as a key, sets a corresponding correct word as a value, and stores all keys in a key list.

5. The text error correction apparatus of claim 2, wherein the pre-trained Bert error correction model is pre-trained by two models, the two models comprising Masked language mode and Next sentence prediction;

6. The text correction apparatus of claim 5, wherein the corpus comprises a corpus of hot words in a domain vertical to a domain.