CN112560451B

CN112560451B - Wrongly written character proofreading method and device for automatically generating training data

Info

Publication number: CN112560451B
Application number: CN202110190708.8A
Authority: CN
Inventors: 蓝建敏; 池沐霖
Original assignee: Excellence Information Technology Co ltd
Current assignee: Excellence Information Technology Co ltd
Priority date: 2021-02-20
Filing date: 2021-02-20
Publication date: 2021-05-14
Anticipated expiration: 2041-02-20
Also published as: CN112560451A

Abstract

The invention discloses a wrongly written and wrongly written proofreading method and a device for automatically generating training data, wherein the method comprises the steps of performing word segmentation processing on given linguistic data to obtain a plurality of first phrases; generating a plurality of confusable word sets according to each first phrase; selecting a first phrase to be replaced from a plurality of first phrases of a given corpus, and then taking a confusable word set with the same core phrase and the first phrase to be replaced as a selected word set; replacing a first phrase to be replaced in a given corpus with a similar phrase in a selected word set to generate a wrong corpus; taking the given corpus and the error corpus as a training data set, and training the wrongly written or mispronounced character proofreading model according to the training data set; and checking the text to be checked according to the wrongly written character checking model. The method and the device can solve the problems of long time consumption and low efficiency in manually collecting the wrong corpora in the prior art.

Description

Wrongly written character proofreading method and device for automatically generating training data

Technical Field

The invention relates to the technical field of computers, in particular to a wrongly written or mispronounced character proofreading method and device for automatically generating training data.

Background

The error character proofreading is one of the works of text proofreading. With the development of science and technology, automatic model building and error correction through machine learning are becoming popular. A large amount of training data is needed in the process of training a model, and the existing training data needs to manually collect the wrong corpora of a user and then label the wrong corpora to generate a training sample. Manually collecting the wrong corpora with the wrongly written characters is time-consuming, labor-consuming and low in efficiency.

Disclosure of Invention

The embodiment of the invention provides a wrongly written or mispronounced character proofreading method and device for automatically generating training data, which can automatically generate a wrongly written or mispronounced character with wrongly written or mispronounced characters, train a wrongly written or mispronounced character model through the generated wrongly written or mispronounced character model, finally carry out wrongly written or mispronounced character proofreading through the wrongly written or mispronounced character model, reduce labor consumption and improve efficiency.

An embodiment of the present invention provides a method for correcting wrongly written characters by automatically generating training data, including:

obtaining given linguistic data and performing word segmentation processing on the given linguistic data to obtain a plurality of first phrases;

generating a plurality of confusable word sets according to the first phrases; each confusable word set comprises a core phrase and a plurality of similar phrases corresponding to the core phrase, and the input operation of each similar phrase and the core phrase under the same input method is the same; the core phrase is the first phrase;

selecting a first phrase to be replaced from a plurality of first phrases of the given corpus, and then taking a confusable word set with the same core phrase as the first phrase to be replaced as a selected word set; replacing the first phrase to be replaced in the given corpus with a similar phrase in the selected word set to generate an error corpus;

taking the given corpus and the error corpus as training data sets, and training an error word proofreading model according to the training data sets;

and checking the text to be checked according to the wrongly written character checking model.

Further, after generating a plurality of confusable word sets according to each of the first phrases, the method further includes:

and calculating the cosine distance of the word vector between the core word group in each confusable word set and each similar word group, and eliminating the similar word groups of which the cosine distance of the word vector exceeds a preset threshold value.

Further, the input method comprises any one or combination of the following: the method comprises a five-stroke input method, a pinyin input method and a stroke input method.

Further, when the proofreading text to be proofread is proofread through the proofreading model for the wrongly written words, if a wrong phrase is identified, a confusable word set with the same core phrase and the wrong phrase is used as a second selected word set; and sequentially replacing the wrong word group with a similar word group in the second selected word set, and re-correcting the replaced text to be corrected until the wrong word correction model outputs a result of correct text detection.

On the basis of the embodiment of the method item, the invention correspondingly provides an embodiment of a device item;

the invention provides a wrongly written or mispronounced character proofreading device capable of automatically generating training data, which comprises a word segmentation module, an easily confused word set generation module, a wrong corpus generation module, a model training module and a proofreading module;

the word segmentation module is used for acquiring given linguistic data and performing word segmentation processing on the given linguistic data to acquire a plurality of first phrases;

the confusable word set generating module is used for generating a plurality of confusable word sets according to the first phrases; each confusable word set comprises a core phrase and a plurality of similar phrases corresponding to the core phrase, and the input operation of each similar phrase and the core phrase under the same input method is the same; the core phrase is the first phrase;

the wrong corpus generating module is used for selecting a first phrase to be replaced from a plurality of first phrases of the given corpus, and then taking a confusable word set with the same core phrase and the first phrase to be replaced as a selected word set; replacing the first phrase to be replaced in the given corpus with a similar phrase in the selected word set to generate an error corpus;

the model training module is used for taking the given corpus and the error corpus as a training data set and training an error word proofreading model according to the training data set;

and the proofreading module is used for proofreading the text to be proofread according to the wrongly written character proofreading model.

Further, the device also comprises a phrase eliminating module; and the phrase eliminating module is used for calculating the word vector cosine distance between the core phrase in each confusable word set and each similar phrase and eliminating the similar phrases of which the word vector cosine distance exceeds a preset threshold value.

Further, when the proofreading module proofreads the text to be proofread through the proofreading model for the wrongly written words, if a wrong phrase is identified, the confusable word set with the same core phrase as the wrong phrase is used as a second selected word set; and sequentially replacing the wrong word group with a similar word group in the second selected word set, and re-correcting the replaced text to be corrected until the wrong word correction model outputs a result of correct text detection.

The invention has the following beneficial effects:

the method comprises the steps of segmenting a given corpus to obtain a plurality of first phrases, generating an easily-confused word set corresponding to each phrase according to the first phrases, replacing the selected first phrase in the given corpus according to the easily-confused word set, automatically generating error corpuses, taking the error corpuses and the given corpuses as training data sets required by a training model, performing model training to obtain an error corpuses correction model, and finally performing correction on a text to be corrected according to the error corpuses correction model.

Drawings

Fig. 1 is a flowchart illustrating a method for correcting wrongly written or mispronounced words of automatically generated training data according to an embodiment of the present invention.

Fig. 2 is a schematic structural diagram of a wrongly written or mispronounced character checking apparatus for automatically generating training data according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

As shown in fig. 1, an embodiment of the present invention provides a method for correcting wrongly written or mispronounced words of automatically generated training data, including:

step S101: obtaining given linguistic data and performing word segmentation processing on the given linguistic data to obtain a plurality of first phrases.

Step S102: generating a plurality of confusable word sets according to the first phrases; each confusable word set comprises a core phrase and a plurality of similar phrases corresponding to the core phrase, and the input operation of each similar phrase and the core phrase under the same input method is the same; the core phrase is the first phrase.

Step S103: selecting a first phrase to be replaced from a plurality of first phrases of the given corpus, and then taking a confusable word set with the same core phrase as the first phrase to be replaced as a selected word set; and replacing the first phrase to be replaced in the given corpus with the similar phrase in the selected word set to generate an error corpus.

Step S104: and taking the given corpus and the error corpus as a training data set, and training an error character proofreading model according to the training data set.

Step S105: and checking the text to be checked according to the wrongly written character checking model.

For the step S101, in the present invention, the given corpus may include text artifacts produced by the user in daily life, such as world documents produced in daily life, and data crawled on the web, such as in forums, microblogs, comments, and the like; generally speaking, the linguistic data obtained from the text results generated by users in daily life are generally specified in terms of words, most of which are written words, and the linguistic data crawled through forums and online comments generally tend to be spoken words; when the original corpus is collected, two forms of corpus of written expression and spoken expression are comprehensively considered, so that the application range of a wrongly-written and wrongly-written proofreading model trained subsequently is wider;

in step S102, the present invention simulates wrongly written characters generated by the user to construct the confusable word sets. Specifically, a first phrase is selected as a core phrase, then a plurality of phrases which have the same input operation as the core phrase under the same input method are extracted as similar phrases according to the existing input method, and then a confusable word set corresponding to each first phrase is constructed according to the core phrase and the extracted similar phrases. For example, after word segmentation is performed on a given corpus, a first word group of 'proposed' is obtained, then the 'proposed' is taken as a core word group, words which are the same as the 'proposed' input operation under a pinyin input method are 'rejected', 'kicked out', 'shaved off' and the like, words which are the same as the 'proposed' input operation under a five-stroke input method are 'pinched out', 'pulled out' and the like, and then a confusion word set corresponding to the 'proposed' can be obtained as { proposed | rejected, kicked out, shaved out, pinched out, pulled out }; the confusion word set is 'proposed' as a core word group; the rejecting, kicking out, shaving, pinching out and pulling out are all similar phrases.

It should be noted that the above "input operation is the same" means that the order and type of the entered letters are the same, for example, in the above example, in the case of the pinyin input method, the corresponding entered letter is "tihu" and the corresponding entered letter is also "tihu" by removing, kicking out and shaving, and this case is referred to as the same input operation. In addition, it should be noted that the input method may include any one or a combination of the following: the method comprises a five-stroke input method, a pinyin input method and a stroke input method. That is, when generating the confusable word set of each first phrase, the confusable word set may be constructed by only one input method or may be constructed by a combination of a plurality of input methods. For example, if the method is constructed according to the pinyin input method only, the resulting confusing word set is: { extracting | removing, kicking out and shaving }; if the method is constructed according to the pinyin input method and the wubi input method, the obtained confusion word set is { propose | remove, kick out, shave off, pinch out and pull out }; no matter which construction method is adopted to generate the confusion word set, the condition that each similar phrase in the final confusion word and the core phrase are identical and the input operation under the same input method is identical is met. And constructing the confusable word set of each first phrase in the given corpus according to the mode to generate the plurality of confusable word sets.

In step S103, a first phrase to be replaced is selected from the given corpus, and then any similar phrase is extracted from the confusable word set corresponding to the first phrase for replacement. For example, if the given corpus is 'propose suggestion', the selected first phrase to be replaced is 'propose', then firstly searching the confusable word set of which the core phrase is 'propose' to obtain { propose | reject, kick out, shave off }; then, any word group is selected from the group consisting of 'removing, kicking and shaving' to replace 'proposing', and a 'removing suggestion', 'kicking suggestion' and 'shaving suggestion' of the replaced wrong corpus are obtained. It is understood that a plurality of "suggestions" may occur in a given corpus, and all the "suggestions" may be replaced at the time of replacement, and the word frequency in the "suggestions" may be replaced according to a corresponding ratio, for example, a ratio of 2:1, that is, if the "suggestions" occur 10 times in the given corpus, the "suggestions" occurring 5 times may be replaced. In addition, the same similar phrase can be selected during each replacement, for example, 5 times of replacement replace 'propose' with 'reject', or the similar phrase can be randomly selected for replacement, for example, the 'propose' is replaced with 'reject' during the first replacement, and the 'propose' is replaced with 'kick' during the second replacement.

In an optional embodiment, after generating a plurality of confusable word sets according to each of the first phrases, the method further includes: and calculating the cosine distance of the word vector between the core word group in each confusable word set and each similar word group, and eliminating the similar word groups of which the cosine distance of the word vector exceeds a preset threshold value.

Similar phrases which are too close to the core phrases in each confusion word set are removed through word vector distance calculation, and the preset threshold value can be set to be 0.9; and calculating the cosine distance of the word vector between each similar phrase and the core phrase one by one, and if the cosine distance of the word vector is greater than 0.9, indicating that the similar phrase is very similar to the meaning of the core phrase, and removing the similar phrase. The step is mainly to eliminate the similar meaning words with similar word meanings, and because the Chinese expression is changeable, the same meaning can be expressed by different words which are called as the similar meaning words; however, the similar phrases in the confusing word set and the core phrase that are similar to each other are replaced with the similar phrases that are similar to each other in the confusing word set by the word vector distance calculation in this step, in order to avoid that the similar phrases in the confusing word set and the core phrase that are similar to each other are replaced with the similar phrases that are similar to each other in the confusing word set are replaced with the similar phrases that are similar to each other in the core phrase by the wrong phrase, the similar phrases in the confusing word set and the core phrase that are similar to each other are eliminated by the word vector distance calculation in this step.

And step S104, taking the given corpus as a positive sample, taking the automatically generated error corpus as a negative sample, and then performing model training by using a sequence labeling model algorithm, such as bi-lstm + crf, to obtain the wrongly written and mispronounced character proofreading model.

For step S105, in a preferred embodiment, when the text to be corrected is corrected by the wrongly written character correction model, if a wrong phrase is identified, a confusable word set having a core phrase that is the same as the wrong phrase is used as a second selected word set; and sequentially replacing the wrong word group with a similar word group in the second selected word set, and re-correcting the replaced text to be corrected until the wrong word correction model outputs a result of correct text detection.

Of course, in the actual process, a plurality of models can be trained by using a plurality of different algorithms, and the plurality of models can be used for analyzing the correctness of the sentence and taking the weighted result for correcting the error.

By implementing the embodiment of the invention, training data required by model training can be automatically generated without manual acquisition, the efficiency of model training is improved, the loss of manpower and material resources is reduced, and simultaneously, after a user provides a work document of the user as a given corpus, the training corpus conforming to the habit of the user can be generated, so that the trained model is ensured, and wrongly-written characters frequently appearing in work of the user can be corrected.

As shown in fig. 2, the present invention provides an embodiment of an apparatus corresponding to the embodiment of the method;

the embodiment of the invention provides a wrongly written or mispronounced character proofreading device capable of automatically generating training data, which comprises a word segmentation module, an easily confused word set generation module, a wrong corpus generation module, a model training module and a proofreading module, wherein the word segmentation module is used for segmenting words;

In a preferred embodiment, the system further comprises a phrase eliminating module;

and the phrase eliminating module is used for calculating the word vector cosine distance between the core phrase in each confusable word set and each similar phrase and eliminating the similar phrases of which the word vector cosine distance exceeds a preset threshold value.

In a preferred embodiment, when the proofreading module proofreads the text to be proofread through the mispronounced word proofreading model, if a wrong phrase is identified, a confusable word set with a core phrase being the same as the wrong phrase is used as a second selected word set; and sequentially replacing the wrong word group with a similar word group in the second selected word set, and re-correcting the replaced text to be corrected until the wrong word correction model outputs a result of correct text detection.

It should be noted that the above-mentioned embodiment of the apparatus of the present invention corresponds to the embodiment of the method of the present invention, and the method for correcting wrongly written characters of automatically generated training data of the present invention can be implemented. In addition, the above-described device embodiments are merely illustrative, where the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may also be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. In addition, in the drawings of the embodiment of the apparatus provided by the present invention, the connection relationship between the modules indicates that there is a communication connection between them, and may be specifically implemented as one or more communication buses or signal lines. One of ordinary skill in the art can understand and implement it without inventive effort.

While the foregoing is directed to the preferred embodiment of the present invention, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention.

Claims

1. A wrongly written or mispronounced character proofreading method for automatically generating training data is characterized by comprising the following steps:

2. The method of automatically generating proofreading of wrongly written words of training data according to claim 1, further comprising, after generating a plurality of confusable word sets from each of said first phrases:

3. The method of automatically generating a wrongly written proofreading of training data as set forth in claim 1, wherein the input method includes any one or a combination of the following: the method comprises a five-stroke input method, a pinyin input method and a stroke input method.

4. The method according to claim 1, wherein when the text to be corrected is corrected by the wrongly written character correcting model, if a wrong phrase is identified, an confusable word set having a core phrase identical to the wrong phrase is used as a second selected word set; and sequentially replacing the wrong word group with a similar word group in the second selected word set, and re-correcting the replaced text to be corrected until the wrong word correction model outputs a result of correct text detection.

5. A wrongly written or mispronounced character proofreading device capable of automatically generating training data is characterized by comprising a word segmentation module, an easily confused word set generation module, a wrong corpus generation module, a model training module and a proofreading module;

6. The apparatus according to claim 5, further comprising a phrase culling module;

7. The apparatus according to claim 5, wherein the proofreading module, when proofreading the text to be proofread through the proofreading model for the mispronounced word, takes a confusable word set having a core word set identical to the incorrect word set as a second selected word set if the incorrect word set is identified; and sequentially replacing the wrong word group with a similar word group in the second selected word set, and re-correcting the replaced text to be corrected until the wrong word correction model outputs a result of correct text detection.