CN115310432A - Wrongly written character detection and correction method - Google Patents

Wrongly written character detection and correction method Download PDF

Info

Publication number
CN115310432A
CN115310432A CN202210975544.4A CN202210975544A CN115310432A CN 115310432 A CN115310432 A CN 115310432A CN 202210975544 A CN202210975544 A CN 202210975544A CN 115310432 A CN115310432 A CN 115310432A
Authority
CN
China
Prior art keywords
wrongly written
sentence
character
characters
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210975544.4A
Other languages
Chinese (zh)
Inventor
郑海涛
马仕镕
李映辉
江勇
夏树涛
肖喜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen International Graduate School of Tsinghua University
Original Assignee
Shenzhen International Graduate School of Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen International Graduate School of Tsinghua University filed Critical Shenzhen International Graduate School of Tsinghua University
Priority to CN202210975544.4A priority Critical patent/CN115310432A/en
Publication of CN115310432A publication Critical patent/CN115310432A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/232Orthographic correction, e.g. spell checking or vowelisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a method for detecting and correcting wrongly written characters, which comprises the following steps: obtaining a comparative learning model, comprising the following modules: the main module is a pre-training language model, and the auxiliary module comprises: the character-pronunciation encoding module, the character-shape encoding module and the dictionary encoding module; model training: the main module is trained by using the wrongly written or mispronounced character correction task, a contrast learning task is added, positive examples and negative examples required by word sound, character pattern and dictionary knowledge construction are respectively constructed by aiming at the word sound, the character pattern and the dictionary, and the auxiliary module is used for coding information of the word sound, the character pattern and the dictionary definition and common knowledge to guide the main module to learn the word sound, the character pattern, the word definition and the common knowledge, so that the main module contains the knowledge required by the wrongly written or mispronounced character detection and correction task; model reasoning: and only the main module is reserved for reasoning so as to ensure the reasoning efficiency of the model. The invention improves the detection and correction effects of wrongly written characters, so that wrongly written characters which are difficult to be found by the existing method can be found, and further, the wrongly written characters can be effectively corrected.

Description

Wrongly written character detection and correction method
Technical Field
The invention relates to the field of computer application, in particular to a wrongly written character detection and correction method.
Background
The wrongly-written characters detection and correction refers to a technology for automatically detecting and correcting wrongly-written characters appearing in the spelling process of the Chinese characters. In recent years, the mainstream technology has achieved a good effect by detecting and correcting wrongly written characters using a language model pre-trained on a large corpus, and in particular, a Bidirectional Encoder (BERT) based on a converter has been widely used for this task. Some recent works also introduce pronunciation and font information of Chinese characters to assist the language model to better complete the task of detecting and correcting wrongly written characters.
The most similar prior implementation scheme of the invention is based on a BERT pre-training language model, after a sentence containing wrongly-written characters is input, the language model is used for extracting the semantic features of each Chinese character in the sentence, and the pronunciation and font features of the Chinese character are extracted through other deep neural networks, the three features are fused through a multi-mode gate control fusion unit constructed based on a converter (Transformer), and finally, the sentence with the wrongly-written characters corrected is output. The method achieves an effect over previous mainstream methods in the task of wrongly-written word detection and correction.
However, the prior art still has the following disadvantages: the pre-trained language model still has insufficient ability to detect and correct wrongly written words, and a considerable portion of wrongly written words are still difficult to be found or corrected.
Disclosure of Invention
The invention aims to solve the problem of improving the capability of detecting and correcting wrongly-written characters of a pre-training language model, and provides a wrongly-written character detecting and correcting method.
In order to achieve the purpose, the invention adopts the following technical scheme:
the invention discloses a method for detecting and correcting wrongly written characters, which comprises the following steps:
s1, obtaining a comparison learning model, wherein the comparison learning model comprises the following modules: the main module is a pre-training language model, and the auxiliary module comprises: the character-pronunciation encoding module, the character-shape encoding module and the dictionary encoding module;
s2, model training: directly training a main module by using a wrongly written or mispronounced character correcting task, adding a contrast learning task, constructing positive examples and negative examples required for contrast learning respectively aiming at word sound, character patterns and dictionary knowledge, and coding information of the word sound, the character patterns and the dictionary definitions of the Chinese characters by using an auxiliary module respectively, thereby guiding the main module to learn the word sound, the character patterns, the word definitions and the common knowledge of the Chinese characters, so that the main module already contains the knowledge required by the wrongly written or mispronounced character detecting and correcting task after the training stage is finished;
s3, model reasoning: only the main module is reserved for reasoning so as to ensure the reasoning efficiency of the model.
In some embodiments, the comparative learning task in step S2 includes: the word-pronunciation comparison learning task draws the distance between characters with similar pronunciation in a model representation space and pushes the distance between characters with different pronunciations away, the word-shape comparison learning task training model can distinguish the Chinese characters with similar character patterns from the Chinese characters with dissimilar character patterns in the representation space, and the dictionary comparison learning task enhances the capability of the model in understanding word definitions and common knowledge and guides the model to be linked with the related word definitions and common knowledge when detecting and correcting spelling errors.
In some embodiments, the training process of the dictionary contrast learning task comprises the following steps:
a1, obtaining a sentence X with wrongly written characters and a correct sentence which is corresponding to the sentence and does not contain the wrongly written characters, and determining a phrase corresponding to the position of the wrongly written characters;
a2, obtaining the phrase inParaphrase sentences in dictionary as positive examples of dictionary comparison learning task
Figure BDA0003798183980000021
Randomly selecting N paraphrase sentences corresponding to other words from the dictionary as negative examples of the task
Figure BDA0003798183980000022
A3, the sentence X with the wrongly written characters obtains the representation D of each character in the corresponding sentence through the encoder of the main module o The paraphrase sentences of positive examples and negative examples respectively obtain the representation D of each character in the sentences through the dictionary coding module in the auxiliary module p And
Figure BDA0003798183980000023
a4: and calculating the similarity between the sentence X with the wrongly written characters and the positive example and the negative example, namely acquiring all indexes { s, s + 1., s + w } of the phrase at the position of the index s where the wrongly written characters are positioned, obtaining sentence-level representations corresponding to the sentence X with the wrongly written characters, the positive example paraphrase sentence and the negative example paraphrase in an average pooling mode, and calculating the cosine similarity as the similarity between the sentence X with the wrongly written characters and the corresponding positive example paraphrase sentence and the negative example sentence.
In some embodiments, the similarity in step A4 is represented by the following formula:
Figure BDA0003798183980000031
Figure BDA0003798183980000032
wherein w represents the number of characters contained in the phrase at the position of the wrongly written character, p represents a positive example in the comparative learning, and n i The ith negative example in comparative learning is shown.
In some embodiments, the pronunciation comparison learning task, the glyph comparison learning task and the dictionary comparison learning task all use InfoNCE as an objective function, and the pre-training language model is a BERT pre-training language model.
In some embodiments, the training process of the pronunciation comparison learning task comprises the following steps:
b1, obtaining a sentence X containing wrongly written characters;
b2, replacing the wrongly written characters with characters similar to the pinyin to obtain a new sentence, and taking the sentence as a positive example of a word-pronunciation comparison learning task
Figure BDA0003798183980000033
Replacing wrongly written characters with other random Chinese characters to obtain N negative cases
Figure BDA0003798183980000034
B3, the sentence X with the wrongly written characters passes through an encoder of the main module to obtain the representation P of each character in the corresponding sentence o And constructing positive and negative examples of the task, wherein the paraphrase sentences of the positive and negative examples respectively obtain the representation P of each character in the sentence through the character-pronunciation coding module in the auxiliary module p And
Figure BDA0003798183980000035
and B4, calculating the similarity of the sentence X with the wrongly written characters and the positive examples and all the negative examples expressed at the aspect of the character pitch.
In some embodiments, the similarity in step B4 is represented by the following formula:
Figure BDA0003798183980000036
Figure BDA0003798183980000037
where s represents the location of the modified kanji.
In some embodiments, the training process of the glyph contrast learning task comprises the following steps:
c1, obtaining a sentence X containing wrongly written characters;
c2, replacing the wrongly written characters in the sentence X with the wrongly written characters with similar characters, and taking the new sentence as a positive example of a character pattern comparison learning task
Figure BDA0003798183980000041
In addition, N negative examples are obtained by replacing wrongly-written characters with other random Chinese characters
Figure BDA0003798183980000042
C3, respectively obtaining the representation V of each character in the corresponding sentence by the sentence X with the wrongly written characters through the main module encoder, the positive example and the negative example through the font encoding module in the auxiliary module o ,V p And
Figure BDA0003798183980000043
and C4, respectively calculating the similarity of the sentence X with the wrongly written characters and the positive example and the negative example expressed at the character tone level.
In some embodiments, the similarity in step C4 is represented by the following formula:
Figure BDA0003798183980000044
Figure BDA0003798183980000045
where s represents the location of the modified Chinese character.
The invention also discloses a computer readable storage medium, on which a computer program is stored, which, when executed by a processor, is capable of implementing the above-mentioned method for detecting and correcting a wrongly written word.
The invention has the following beneficial effects:
the invention leads the model to jointly learn the character pronunciation, character pattern and dictionary knowledge of the Chinese character by introducing the definition and common knowledge of the Chinese character in the dictionary and adding the contrast learning task in the model training stage, and can enhance the capability of the pre-training language model to detect and correct the wrongly written character, thereby improving the detection and correction effect of the wrongly written character, finding the wrongly written character which is difficult to be found by the existing method, and further effectively correcting the wrongly written character.
Drawings
FIG. 1 is a flow chart of a method for detecting and correcting a wrongly written word according to an embodiment of the present invention.
Detailed Description
The embodiments of the present invention will be described in detail below. It should be emphasized that the following description is merely exemplary in nature and is not intended to limit the scope of the invention or its application.
Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of the embodiments of the present invention, "a plurality" means two or more unless specifically limited otherwise.
The wrongly written characters are corrected and detected, and not only pronunciation and visual information of the Chinese characters need to be considered, but also semantic and common knowledge contained in the Chinese characters often need to be considered. The masking pre-training strategy adopted by the existing language model in the pre-training determines that the learned semantic knowledge of the language model is more in the collocation among Chinese characters rather than the definition and common sense of the connotation of the Chinese characters. The prior art does not try to introduce word definition and common knowledge to improve the task effect of detecting and correcting wrongly written characters, and the correct wrongly written characters are difficult to find or modify.
The embodiment of the invention mainly aims to provide a wrongly-written or mispronounced character detection and correction method combining multi-mode information of Chinese character pronunciation and character pattern and dictionary knowledge, wherein two key points are mainly involved, and firstly, the definition and common knowledge of Chinese characters in a dictionary are introduced to enhance the capacity of detecting and correcting wrongly-written or mispronounced characters by a pre-training language model; 2. a unified contrast learning framework is used for solving the problem of wrongly-written character detection and correction, and a contrast learning task is added in a model training stage to guide a model to jointly learn the pronunciation, the font and the dictionary knowledge of the Chinese characters. The dictionary contains the definition and common knowledge of words and can be used for enhancing the detection and correction effect of wrongly written words.
The method for detecting and correcting wrongly written characters provided by the embodiment of the invention is shown in fig. 1, and the method is a unified comparative learning framework and comprises a main module and three auxiliary modules: the main module is a pre-training language model, and the pre-training language model is specifically a BERT pre-training language model; the auxiliary module comprises a character pronunciation coding module, a character pattern coding module and a dictionary coding module. In the model training stage, in addition to directly training the main module by using the wrongly written or mispronounced character correction task, three comparison learning tasks are added, positive examples and negative examples required by comparison learning are respectively constructed aiming at the pronunciation, the font and the dictionary knowledge, and the three auxiliary modules are used for respectively coding the pronunciation, the font and the dictionary paraphrase information of the Chinese characters, so that the main module is guided to learn the pronunciation, the font, the word definition, the common knowledge and other knowledge of the Chinese characters. After the training stage is finished, the main module already contains various knowledge required by the wrongly-written character detection and correction task, so that only the main module is reserved in the model reasoning stage, and the reasoning efficiency of the model is ensured.
Example (b):
s1, obtaining a comparison learning model:
the comparative learning model comprises the following modules: the main module is a pre-training language model; the auxiliary module comprises: the character-pronunciation encoding module, the character-shape encoding module and the dictionary encoding module.
The method for detecting and correcting wrongly written characters provided by the embodiment of the invention is divided into two parts, namely model training and model reasoning.
S2, model training:
the model training part consists of four tasks: the method comprises the following steps of correcting wrongly written characters, comparing and learning words and pronunciations, comparing and learning words and dictionaries, and specifically describing four tasks of a model training part as follows:
wrongly written character correcting training task
Given a sentence X = { X ] of length n containing wrongly written words 1 ,x 2 ,…,x n In which x i Which is the ith character of sentence X. The main module, BERT pre-training language model, encodes the sentence and predicts the corrected sentence Y = { Y } with wrong characters 1 ,y 2 ,…,y n In which y is i The ith character of sentence Y. In addition, the corresponding correct sentence L = { L ] without wrongly written characters is given 1 ,l 2 ,…,l n In which l i For the ith character of the sentence L, the objective function of the wrongly written character correction training task needs to maximize the probability that the predicted sentence is consistent with the correct sentence after correcting the wrongly written character, and the expression is as follows:
Figure BDA0003798183980000061
where p represents the conditional probability.
Word and pronunciation comparison learning task
The pronunciation of the Chinese characters is expressed by pinyin, and in order to enable a model to better detect and correct wrongly written characters with similar pronunciation, a character-pronunciation comparison learning task is provided, which aims to shorten the distance of characters with similar pronunciation in a model expression space and push away the distance between characters with different pronunciations. When dealing with spelling errors, the models will preferentially imagine their corresponding similarly pronounced characters.
Specifically, for a sentence X containing wrongly-written characters in a training sample, the wrongly-written characters are replaced by characters similar to pinyin to obtain a new sentence, and the sentence is used as a positive example of a pronunciation comparison learning task
Figure BDA0003798183980000071
In addition, N negative examples are obtained by replacing wrongly-written characters with other random Chinese characters
Figure BDA0003798183980000072
The original sentence X passesThe encoder of the main module can obtain a representation P of each character in the corresponding sentence o The positive example and the negative example of the task are constructed, and the representation P of each character in the corresponding sentence is obtained through the pronunciation coding module in the auxiliary module p And
Figure BDA0003798183980000073
respectively calculating the similarity of the original sentence, the positive examples and all the negative examples expressed in the level of the character pronunciation, wherein the expression is as follows:
Figure BDA0003798183980000074
Figure BDA0003798183980000075
where s represents the location of the modified kanji character,
Figure BDA00037981839800000712
representing a vector or matrix obtained by transposing any of the representative vectors or matrices P. And only the expression vector of the position of the modified Chinese character is considered when the similarity is calculated.
The InfonCE function is used as an optimization objective function of the pronunciation comparison learning task. The purpose of this function is to pull the distance between the original sentence and the positive example and pull the distance between the original sentence and the negative example in the representation space of the main module. The concrete form is as follows:
Figure BDA0003798183980000076
font comparison learning task
Similar to the character-sound comparison learning task, the character-shape comparison learning task is provided to train the model to distinguish the characters with similar character shapes from the characters with dissimilar character shapes in the representation space, so that the capability of detecting and correcting the characters with similar and wrong shapes by the model is improved.
In particular toIn addition, basically consistent with the above paragraph, for sentence X with wrongly written characters in the training set, the wrongly written characters are replaced with characters with similar character patterns, and the new sentence is used as a correct example of the character pattern comparison learning task
Figure BDA0003798183980000077
In addition, N negative examples are obtained by replacing wrongly written characters with other random Chinese characters
Figure BDA0003798183980000078
The original sentence X is passed through main module coder, positive example and negative example and passed through font coding module in auxiliary module to respectively obtain representation V of every character in correspondent sentence o ,V p And
Figure BDA0003798183980000079
respectively calculating the similarity of the original sentence, the positive examples and all the negative examples expressed in the level of the character pronunciation, wherein the expression is as follows:
Figure BDA00037981839800000710
Figure BDA00037981839800000711
wherein
Figure BDA0003798183980000087
Representing the vector or matrix obtained by transposing any expression vector or matrix V.
InfonCE is also used as the optimization objective function of the glyph contrast learning task:
Figure BDA0003798183980000081
dictionary comparison learning task
When wrongly written characters cannot be corrected by depending on the pronunciation and the font information of the Chinese characters, the word definition and the common knowledge contained in the dictionary are very useful for correcting spelling errors. Dictionary-contrast learning tasks are presented to enhance the ability of models to understand word meanings and related common sense and to guide the models in correlating with related word meanings and making appropriate corrections when spelling errors are found.
Specifically, a sentence X with wrongly written characters in the training sample is given, a correct sentence without wrongly written characters corresponding to the sentence is found, and a phrase corresponding to the position of the wrongly written characters is found from the correct sentence. Looking up the dictionary to obtain the paraphrase of the phrase in the dictionary as the positive example of the dictionary comparison learning task
Figure BDA0003798183980000082
In addition, N paraphrase sentences corresponding to other words are randomly selected from the dictionary to serve as negative examples of the task
Figure BDA0003798183980000083
The original sentence X is passed through the coder of main module to obtain representation Do of every character in the correspondent sentence, and the paraphrase sentences of positive example and negative example are passed through the dictionary coding module in auxiliary module to respectively obtain representation D of every character in the sentence p And
Figure BDA0003798183980000084
when the similarity between an original sentence and positive examples and negative examples is calculated, all indexes { s, s + 1.., s + w } of a phrase at the position of an index s where wrongly written characters are located are obtained, wherein w represents the number of characters contained in the phrase at the position of the wrongly written characters, p represents positive examples in comparison learning, and n represents negative examples i The ith negative example in comparative learning is shown. After sentence-level expressions corresponding to an original sentence, a positive example paraphrase sentence and a negative example paraphrase sentence are obtained in an average pooling mode, cosine similarity is calculated to serve as similarity between the original sentence and the corresponding positive and negative example paraphrase sentences, and the expression is as follows:
Figure BDA0003798183980000085
Figure BDA0003798183980000086
InfonCE is still used as the objective function for the dictionary-contrast learning task:
Figure BDA0003798183980000091
the four training tasks can be synchronously applied to the training process of the model, are used for guiding the model to learn the pronunciation, the font and the dictionary knowledge of the Chinese characters, and can complete the detection and correction tasks of wrongly-written characters. The objective functions to be optimized of all tasks are weighted and summed to be used as the total objective function of model training, and the expression is as follows:
Figure BDA0003798183980000092
in the above formula, λ 1234 And respectively representing the weights of the target functions corresponding to the four adjustable training tasks in the total target function.
S3, model reasoning:
in the model reasoning phase, only the main modules which already contain various required knowledge are reserved for detecting and correcting wrongly written characters. Similar to the task of correcting wrongly written words, a sentence X = { X ] containing wrongly written words is input to the model 1 ,x 2 ,...,x n The model predicts and outputs a sentence Y = { Y) after the correction of the wrongly written characters is finished 1 ,y 2 ,...,y n And (5) regarding Chinese characters different from the Chinese characters in the original sentence as detected wrongly-written characters, so as to complete the detection and correction of wrongly-written characters.
The experimental effect ratio of the wrongly written character detection and correction method and the existing method on the wrongly written character detection and correction task is shown in table 1:
TABLE 1
Figure BDA0003798183980000101
Among them, SIGHAN13, SIGHAN14, and SIGHAN15 are three widely used sets of wrongly-written-word detection and correction evaluation data, and LEAD is a method for detecting and correcting wrongly-written-words provided by the present invention. The table shows the accuracy, recall rate and F1 index of the wrongly-written character detection and correction respectively, wherein the F1 index is the weighted average of the accuracy and recall rate, and can reflect the detection and correction effect of the wrongly-written character most comprehensively. The experimental results shown in table 1 indicate that, on three evaluation data sets in which the task of detecting and correcting the wrongly written characters is widely used, the F1 indexes of the detection and correction of the wrongly written characters by the method of the present invention exceed all existing methods, wherein, on the data sets of SIGHAN13 and SIGHAN15, the accuracy, recall rate and F1 indexes of the correction of the wrongly written characters by the method of the present invention completely exceed the most advanced methods in the prior art, which indicates that the embodiment of the present invention achieves experimental effects exceeding all existing technologies.
The method for detecting and correcting the wrongly-written characters provided by the embodiment of the invention enhances the capability of detecting and correcting the wrongly-written characters of the pre-training language model, and the effect of the method for detecting and correcting the wrongly-written characters on the task of detecting and correcting the wrongly-written characters is better than that of the existing various methods on the premise of not reducing the reasoning efficiency.
Examples of the experiments
Take the following sentences with wrongly written words as an example: "meet each and every difficulty and overcome it. "in this sentence," difficulty and difficulty "should be replaced with" difficulty ". However, the difference between the "fixed" and "difficult" characters is large in both the character form and the pronunciation, and the "fixed difficult" is replaced by the "hard difficult" with the character pronunciation closer without the assistance of other knowledge. The method provided by the embodiment of the invention introduces paraphrase information of words in the dictionary in the model training stage, so that the model trained by the method can be linked to the paraphrases for 'difficulty' in the dictionary in the process of correcting the sentence: "difficulty" (the name) problem or obstacle which is difficult to solve in work and life, overcome- "and then link to the sentence and appear" overcome ", then can make correct judgement.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The foregoing is a further detailed description of the invention in connection with specific/preferred embodiments and it is not intended to limit the invention to the specific embodiments described. It will be apparent to those skilled in the art that numerous alterations and modifications can be made to the described embodiments without departing from the inventive concepts herein, and such alterations and modifications are to be considered as within the scope of the invention. In the description herein, references to the description of the term "one embodiment," "some embodiments," "preferred embodiments," "an example," "a specific example," or "some examples" or the like are intended to mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Those skilled in the art will be able to combine and combine features of different embodiments or examples and features of different embodiments or examples described in this specification without contradiction. Although embodiments of the present invention and its advantages have been described in detail, it should be understood that various changes, substitutions and alterations can be made herein without departing from the scope of the application.

Claims (10)

1. A method for detecting and correcting wrongly written characters is characterized by comprising the following steps:
s1, obtaining a comparison learning model, wherein the comparison learning model comprises the following modules: the main module is a pre-training language model, and the auxiliary module comprises: the character-pronunciation encoding module, the character-shape encoding module and the dictionary encoding module;
s2, model training: directly training a main module by using a wrongly written or mispronounced character correcting task, adding a contrast learning task, constructing positive examples and negative examples required for contrast learning respectively aiming at word sound, character patterns and dictionary knowledge, and coding information of the word sound, the character patterns and the dictionary definitions of the Chinese characters by using an auxiliary module respectively, thereby guiding the main module to learn the word sound, the character patterns, the word definitions and the common knowledge of the Chinese characters, so that the main module already contains the knowledge required by the wrongly written or mispronounced character detecting and correcting task after the training stage is finished;
s3, model reasoning: and only the main module is reserved for reasoning so as to ensure the reasoning efficiency of the model.
2. The method for detecting and correcting wrongly written words as claimed in claim 1, wherein the step S2 of comparing and learning includes: the word-pronunciation comparison learning task draws the distance between characters with similar pronunciation in a model representation space and pushes the distance between characters with different pronunciations away, the word-shape comparison learning task training model can distinguish the Chinese characters with similar character patterns from the Chinese characters with dissimilar character patterns in the representation space, and the dictionary comparison learning task enhances the capability of the model in understanding word definitions and common knowledge and guides the model to be linked with the related word definitions and common knowledge when detecting and correcting spelling errors.
3. The method for detecting and correcting wrongly written words as claimed in claim 2, wherein the training process of the dictionary contrast learning task comprises the following steps:
a1, obtaining a sentence X with wrongly written characters and a correct sentence which is corresponding to the sentence and does not contain the wrongly written characters, and determining a phrase corresponding to the position of the wrongly written characters;
a2, obtaining the paraphrase of the phrase in the dictionary as the positive example of the dictionary comparison learning task
Figure FDA0003798183970000011
Randomly selecting other words in a dictionaryN paraphrasing sentences corresponding to the language are used as negative examples of the task
Figure FDA0003798183970000012
A3, the sentence X with the wrongly written characters obtains the representation D of each character in the corresponding sentence through an encoder of the main module o The paraphrase sentences of positive examples and negative examples respectively obtain the representation D of each character in the sentences through the dictionary coding module in the auxiliary module p And
Figure FDA0003798183970000021
a4: and calculating the similarity between the sentence X with the wrongly written characters and the positive example and the negative example, namely acquiring all indexes { s, s + 1., s + w } of the phrase at the position of the index s where the wrongly written characters are positioned, obtaining sentence-level representations corresponding to the sentence X with the wrongly written characters, the positive example paraphrase sentence and the negative example paraphrase in an average pooling mode, and calculating the cosine similarity as the similarity between the sentence X with the wrongly written characters and the corresponding positive example paraphrase sentence and the negative example sentence.
4. The method for detecting and correcting a wrongly written word as claimed in claim 3, wherein the similarity in step A4 is expressed by the following formula:
Figure FDA0003798183970000022
Figure FDA0003798183970000023
wherein w represents the number of characters contained in the phrase at the position of the wrongly written character, p represents a positive example in the comparative learning, and n i The ith negative example in comparative learning is shown.
5. The method as claimed in claim 2, wherein the word-pronunciation comparison learning task, the word-shape comparison learning task and the dictionary comparison learning task all use InfoNCE as an objective function, and the pre-training language model is a BERT pre-training language model.
6. The method for detecting and correcting wrongly written words as claimed in claim 2, wherein the training process of the pronunciation-contrast learning task comprises the steps of:
b1, obtaining a sentence X containing wrongly written characters;
b2, replacing the wrongly written characters with characters similar to pinyin to obtain a new sentence, and taking the sentence as a positive example of a pronunciation comparison learning task
Figure FDA0003798183970000024
Replacing wrongly-written characters with other random Chinese characters to obtain N negative examples
Figure FDA0003798183970000025
B3, the sentence X with the wrongly written characters passes through an encoder of the main module to obtain the representation P of each character in the corresponding sentence o And constructing positive and negative examples of the task, wherein the paraphrase sentences of the positive and negative examples respectively obtain the representation P of each character in the sentence through the character-pronunciation coding module in the auxiliary module p And
Figure FDA0003798183970000026
and B4, calculating the similarity of the sentence X with the wrongly written characters and the positive examples and all the negative examples expressed at the aspect of the character pitch.
7. The method for detecting and correcting a wrongly written word as claimed in claim 6, wherein the similarity in step B4 is expressed by the following formula:
Figure FDA0003798183970000031
Figure FDA0003798183970000032
where s represents the location of the modified Chinese character.
8. The method for detecting and correcting wrongly written characters as claimed in claim 2, wherein the training process of the glyph comparison learning task comprises the steps of:
c1, obtaining a sentence X containing wrongly written characters;
c2, replacing the wrongly written characters in the sentence X with the wrongly written characters with similar characters and taking the new sentence as a positive example of the character comparison learning task
Figure FDA0003798183970000033
In addition, N negative examples are obtained by replacing wrongly written characters with other random Chinese characters
Figure FDA0003798183970000034
C3, the sentence X with the wrongly written characters respectively obtains the representation V of each character in the corresponding sentence through the main module encoder, the positive example and the negative example through the character pattern encoding module in the auxiliary module o ,V p And
Figure FDA0003798183970000035
and C4, respectively calculating the similarity of the sentence X with the wrongly written characters and the positive example and the negative example expressed at the word tone level.
9. The method for detecting and correcting a wrongly written word as claimed in claim 8, wherein the similarity in step C4 is expressed by the following formula:
Figure FDA0003798183970000036
Figure FDA0003798183970000037
where s represents the location of the modified Chinese character.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, is adapted to carry out the method for detecting and correcting wrongly written words as claimed in any one of the claims 1 to 9.
CN202210975544.4A 2022-08-15 2022-08-15 Wrongly written character detection and correction method Pending CN115310432A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210975544.4A CN115310432A (en) 2022-08-15 2022-08-15 Wrongly written character detection and correction method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210975544.4A CN115310432A (en) 2022-08-15 2022-08-15 Wrongly written character detection and correction method

Publications (1)

Publication Number Publication Date
CN115310432A true CN115310432A (en) 2022-11-08

Family

ID=83863224

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210975544.4A Pending CN115310432A (en) 2022-08-15 2022-08-15 Wrongly written character detection and correction method

Country Status (1)

Country Link
CN (1) CN115310432A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116127953A (en) * 2023-04-18 2023-05-16 之江实验室 Chinese spelling error correction method, device and medium based on contrast learning

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116127953A (en) * 2023-04-18 2023-05-16 之江实验室 Chinese spelling error correction method, device and medium based on contrast learning

Similar Documents

Publication Publication Date Title
CN110489555B (en) Language model pre-training method combined with similar word information
CN107305768B (en) Error-prone character calibration method in voice interaction
CN111144110A (en) Pinyin marking method, device, server and storage medium
CN114118065B (en) Method and device for correcting Chinese text errors in electric power field, storage medium and computing equipment
CN111753545A (en) Nested entity recognition method and device, electronic equipment and storage medium
CN110826334A (en) Chinese named entity recognition model based on reinforcement learning and training method thereof
CN113033200B (en) Data processing method, text recognition model generation method and text recognition method
CN114298010A (en) Text generation method integrating dual-language model and sentence detection
CN115310432A (en) Wrongly written character detection and correction method
CN112183060B (en) Reference resolution method of multi-round dialogue system
CN115270771B (en) Fine-grained self-adaptive Chinese spelling error correction method assisted by word-sound prediction task
CN116702760A (en) Geographic naming entity error correction method based on pre-training deep learning
CN111291550A (en) Chinese entity extraction method and device
Htun et al. Improving transliteration mining by integrating expert knowledge with statistical approaches
CN113486160B (en) Dialogue method and system based on cross-language knowledge
CN115240712A (en) Multi-mode-based emotion classification method, device, equipment and storage medium
CN115171647A (en) Voice synthesis method and device with natural pause processing, electronic equipment and computer readable medium
CN115310433A (en) Data enhancement method for Chinese text proofreading
CN115099222A (en) Punctuation mark misuse detection and correction method, device, equipment and storage medium
CN114444492A (en) Non-standard word class distinguishing method and computer readable storage medium
CN116306596B (en) Method and device for performing Chinese text spelling check by combining multiple features
CN117574882A (en) Progressive multitasking Chinese mispronounced character correcting method
Wen et al. English Text Spelling Error Detection and Correction Based on Multi-feature data Fusion Algorithm
KR101846461B1 (en) Maximum Likelihood-based Automatic Lexicon Generation
CHEN Syntax Error Detection in English Text Images Based on Sparse Representation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination