CN117574882A

CN117574882A - Progressive multitasking Chinese mispronounced character correcting method

Info

Publication number: CN117574882A
Application number: CN202311600238.3A
Authority: CN
Inventors: 郑海涛; 李映辉; 黄浩靖; 江勇; 夏树涛; 肖喜
Original assignee: Shenzhen International Graduate School of Tsinghua University
Current assignee: Shenzhen International Graduate School of Tsinghua University
Priority date: 2023-11-28
Filing date: 2023-11-28
Publication date: 2024-02-20

Abstract

A Chinese wrongly written or mispronounced character correcting method of progressive multitasking includes the following steps: s1, detecting subtasks to identify which sentence is the wrongly written word; s2, the reasoning subtask conducts reasoning according to the wrongly written words detected by the detection subtask so as to determine whether the reason for the wrongly written words is near or near; and S3, searching the correct words in the corresponding external confusion set according to the positions and the error types of the wrongly written words provided by the detection subtask and the reasoning subtask by the searching subtask. The method can be used for improving the performance of the existing error correction model by decomposing the error correction task into three difficult subtasks, respectively identifying wrongly written words, attributing wrongly written words and introducing external knowledge to correct the wrongly written words, and can be used for realizing plug-and-play performance by combining a trained module with any non-autoregressive error correction model, thereby directly improving the model performance and saving the training time.

Description

Progressive multitasking Chinese mispronounced character correcting method

Technical Field

The invention relates to a computer word processing technology, in particular to a Chinese wrongly written or wrongly written word correcting method with progressive multitasking.

Background

The task of detecting and correcting Chinese misplaced words refers to automatically detecting and correcting spelling errors occurring during spelling of Chinese characters in a Chinese sentence. The task has important application in the fields of search engines, intelligent writing assistant products, voice and optical character recognition and the like, and is focused on by researchers. The existing mainstream technology obtains excellent correction effect by performing unsupervised training through a large-scale text corpus and performing supervised training on a specific Chinese spelling error correction dataset through a bidirectional encoder representation technology (Bidirectional Encoder Representation from Transformers, BERT) based on a transformer. Some recent works also make the model learn the graphic information and the pinyin information of the Chinese characters through the confusion set in the pre-training stage, so that the Chinese misplaced character detection and correction tasks are better completed.

The implementation scheme is based on a BERT pre-training language model, which obtains semantic information of each Chinese character in an input sentence, encodes Chinese character strokes and pinyin sequences through a neural network, extracts voice information and graphic information of the sentence, fuses the three Chinese character information by utilizing a self-adaptive gating mechanism, and finally outputs a corrected correct sentence, thereby improving language understanding capability and spelling error correction capability of the model and obtaining the effect exceeding that of the previous most advanced method.

However, currently, when external knowledge (e.g., confusion sets, dictionary information, etc.) is introduced, the knowledge is often implicitly injected into the chinese misprinted word correction model in the form of embedded tokens, which is not sufficiently interpretable and inefficient.

It should be noted that the information disclosed in the above background section is only for understanding the background of the present application and thus may include information that does not form the prior art that is already known to those of ordinary skill in the art.

Disclosure of Invention

The invention aims to overcome the defects of the background technology and provide a progressive multitasking Chinese wrongly written characters correction method.

In order to achieve the above purpose, the present invention adopts the following technical scheme:

a Chinese wrongly written or mispronounced character correcting method of progressive multitasking includes the following steps:

s1, detecting subtasks to identify which sentence is the wrongly written word;

s2, the reasoning subtask conducts reasoning according to the wrongly written words detected by the detection subtask so as to determine whether the reason for the wrongly written words is near or near;

and S3, searching the correct words in the corresponding external confusion set according to the positions and the error types of the wrongly written words provided by the detection subtask and the reasoning subtask by the searching subtask.

Further:

in step S1, a given sentence X is input into the pre-training encoder E to obtain a representation H of the sentence, and then the probability that the ith chinese character in the sentence is a wrong character is calculated:

H＝E(X)＝{h ₁ ，h ₂ ，...，h _T }

wherein h is _i ∈R ^hidden Hidden is the hidden layer dimension size of E,and->Belongs to the parameter which can be learned,>is a two-dimensional vector based on +.>Obtaining a prediction result of the detection subtask:

wherein the method comprises the steps ofThe value range of (1) is {0,1},0 indicates that the Chinese character is correct, and 1 indicates that the Chinese character is incorrect.

For training of the detection subtask, a cross entropy loss function is used to calculate its loss

In step S2, the probability of inference prediction of the ith Chinese character in the sentence is calculatedAnd obtain the reasoning and predicting resultThe method is as follows:

wherein the method comprises the steps ofAnd->As a learnable parameter, when->When 1, it indicates that the Chinese character is near-sound error, when +.>When 0, the Chinese character is a near-shape error.

For training of the reasoning subtasks, the cross entropy loss function is used to calculate the loss thereof

In step S3, for each chinese character in X, the probability that it is a chinese character in the pre-training vocabulary is predicted:

wherein W is _S ∈R ^{vocab×hidden} And b _S ∈R ^vocab Is a learnable parameter, vocab is the size of the pre-trained vocabulary;

the search subtask utilizes the error position information of the detection subtask and the error type information of the reasoning subtask to combine with the external confusion set to construct a search matrix:

C＝{c ₁ ，c ₂ ，…，c _T }，c _i ∈R ^vocab ，C∈R ^T×vocab

wherein the vector isAnd->The dimension of (2) is the size of the vocabulary and the value range is {0,1}, -the value is }>Represents x _i The element of the corresponding position of the word in the vector in the near confusion set of the corresponding Chinese character is set as 1, and the other elements are 0; />Then indicate x _i The element of the corresponding position of the character in the vector corresponding to the shape near confusion set of the Chinese character is set as 1, and the other elements are 0;

after obtaining the search matrix C, obtaining an initial probability matrix P based on C ^s Is described in (1) a modified probability matrix P:

P＝P ^s ⊙C＝{p ₁ ，p ₂ ，…，p _T }

wherein +..

In step S3, cross entropy loss is used in the search task:

the three training tasks of the detection subtask, the reasoning subtask and the searching subtask are synchronously applied to the training process of the model, and the loss functions to be optimized of all tasks are weighted and summed to be the total loss of model training:

in the model reasoning phase, the trained encoder E and the detection subtask and the reasoning subtask are combined with a new chinese misprinted word correction model or pre-training model E ', wherein text is input to both encoders E and E':

H＝E(X)，H′＝E′(X)

h can be used as the input of the detection subtask and the reasoning subtask, the results of the detection subtask and the reasoning subtask are generated, and then a search matrix C is constructed according to the mode in the search subtask; inputting H 'into an output layer of the Chinese mispronounced character correction model to obtain an output probability matrix of which P' is a new Chinese spelling correction model or a pre-training model:

C＝D-R Module(H)

P′＝Output Layer(H′)

the initial probability matrix P' can be enhanced by searching the matrix C.

A computer readable storage medium storing a computer program which, when executed by a processor, implements a progressive multitasking chinese misprint correction method as described.

The invention has the following beneficial effects:

the invention provides a plug-and-play progressive multitask Chinese wrongly written and wrongly written word correcting method, which is characterized in that an error correcting task is decomposed into three difficult subtasks, wrongly written and wrongly written words are respectively identified, wrongly written and wrongly written words are attributed, and external knowledge is introduced to correct, so that the performance of an existing error correcting model is improved, and the method can combine a trained module with any non-autoregressive error correcting model to realize plug-and-play property, directly improve the model performance and save training time. The advantages are that:

firstly, the traditional Chinese wrongly written and wrongly written characters correcting task is divided into three subtasks with gradually increased difficulty, each subtask has the function of assisting a model to learn more knowledge in data, and external knowledge is naturally introduced, so that the interpretability is improved; 2. the method can be used in plug and play, combines the trained model with other Chinese wrongly written characters correction models, and can directly improve the performance of the error correction model.

Experiments prove that when the Chinese wrongly written and wrongly written word correcting task is decomposed into three subtasks with gradually increased difficulty, external knowledge is introduced through the searching subtasks, the performance of the wrongly written and wrongly written word detecting and correcting model is obviously improved, and the method exceeds various existing methods. The research result has important significance for improving the task of detecting and correcting the mispronounced Chinese characters in the field of natural language processing.

Other advantages of embodiments of the present invention are further described below.

Drawings

FIG. 1 is a flowchart of a method for correcting Chinese misprints according to an embodiment of the present invention.

FIG. 2 is a block diagram of a method for correcting Chinese misprints in a plug-and-play progressive multi-task system according to an embodiment of the present invention.

Detailed Description

The following describes embodiments of the present invention in detail. It should be emphasized that the following description is merely exemplary in nature and is in no way intended to limit the scope of the invention or its applications.

Referring to fig. 1, the embodiment of the invention provides a plug-and-play progressive multi-task Chinese misprinting word correcting method, which comprises the following steps:

s1, detecting subtasks to identify which sentence is the wrongly written word;

The invention decomposes the wrongly written or mispronounced word correction task into a plurality of subtasks and explicitly introduces external knowledge to enhance the model's ability to detect and correct chinese wrongly written words.

The invention provides a method for decomposing a Chinese wrongly written and wrongly written correction task into three subtasks with progressively increased difficulty to enhance the performance of the Chinese wrongly written and wrongly written correction model, and external knowledge is naturally introduced into the search subtask as an aid, so that the problem of insufficient interpretability in an implicit mode can be solved by the explicit mode. The method provided by the invention can be combined with the existing Chinese mispronounced character correction model, and can combine the trained module with a new model, so that the mispronounced character detection and correction capability of the new model is enhanced, the plug-and-play property is reflected, and the extra training expenditure is reduced. The effect of the proposal provided by the invention on the task of detecting and correcting wrongly written characters exceeds the existing various methods.

FIG. 2 illustrates a framework of a plug-and-play progressive multi-tasking Chinese misprinting correction method in accordance with an embodiment of the present invention. When correcting wrongly written words in sentences, it is important to identify which wrongly written word and to determine the type of the error, and due to the characteristics of Chinese, the wrongly written words in Chinese are often classified as near-pronunciation wrongly written words and near-shape wrongly written words. The method decomposes the traditional Chinese wrongly written and wrongly written characters correction task into three subtasks with gradually increased difficulty, which are respectively as follows: the method comprises the following steps of detecting subtasks, reasoning subtasks and searching subtasks. These three tasks answer "which is the mispronounced word? "," why will this error be caused? Which is the correct word? "these three problems". The detection subtask is responsible for identifying which of sentences is the wrongly written word, the reasoning subtask is responsible for reasoning whether the cause of the errors is near or near according to the wrongly written word detected by the detection subtask, and the searching subtask is responsible for searching the correct word in the corresponding confusion set according to the position and the error type of the wrongly written word provided by the first two subtasks. In the search subtask, we use an external knowledge confusion set commonly used in the art of misclassification to reduce the search space of the model. Because most of the Chinese mispronounced character correction fields are model structures which are not autoregressive, the method can directly insert a trained module into a new Chinese mispronounced character correction model in an reasoning stage so as to directly improve the performance of the model, save training time and realize plug-and-play.

As shown in fig. 2, the embodiment of the invention provides a plug-and-play progressive multi-task Chinese wrongly written and wrongly written word correcting method, which is used for detecting and correcting wrongly written words. The framework of the method consists of three subtask modules with gradually increased difficulty, which are respectively as follows: the method comprises the following steps of detecting subtasks, reasoning subtasks and searching subtasks. The detection subtask is responsible for identifying which of sentences is the wrongly written word, the reasoning subtask is responsible for reasoning whether the cause of the errors is near or near according to the wrongly written word detected by the detection subtask, and the searching subtask is responsible for searching the correct word in the corresponding confusion set according to the position and the error type of the wrongly written word provided by the first two subtasks.

Task definition, given length T, sentence X= { X containing wrongly written words ₁ ,x ₂ ,...,x _T The model predicts the sentence and outputs the corrected sentence Y= { Y ₁ ,y ₂ ,...,y _T }。

Detecting subtasks

The detection subtask is to detect the wrong Chinese characters from the sentence, i.e. predict whether each Chinese character in the sentence is a pair or a mistake. Therefore, we input a given sentence X into the pre-training encoder E to obtain a representation H of the sentence, and then calculate the probability that the i-th chinese character in the sentence is a wrong character:

H＝E(X)＝{h ₁ ，h ₂ ，...，h _T }

wherein h is _i ∈R ^hidden Hidden is the hidden layer dimension size of E,and->Belongs to a parameter which can be learned. />Is a two-dimensional vector based on +.>The prediction result of the detection subtask is obtained:

wherein the method comprises the steps ofThe value range of (1) is {0,1},0 indicates that the Chinese character is correct, and 1 indicates that the Chinese character is incorrect. For training of the detection subtask we use the cross entropy loss function to calculate its loss +.>

Inference subtasks

After detecting the subtask, we know which Chinese characters in the sentence are wrong, and then the reasoning task is to predict whether the mistake is caused by a near-pitch or a near-shape. Similar to the detection subtask, we predict probability by computing the reasoning of the ith Chinese character in the sentenceAnd get the reasoning forecast result->The way of (2) is as follows:

wherein the method comprises the steps ofAnd->Is a parameter that can be learned. When->When 1, it indicates that the Chinese character is near-sound error, when +.>When 0, the Chinese character is a near-shape error. For training of the inference subtask we use the cross entropy loss function to calculate its loss +.>

Search subtasks

Spellings appear

The searching subtask is to find the correct Chinese characters in the corresponding confusion set according to the results of the detection subtask and the reasoning subtask and combining the information of the external confusion set. Specifically, for each Chinese character in X, we predict the probability that it should be a Chinese character in the pre-trained vocabulary:

wherein W is _S ∈R ^{vocab×hidden} And b _S ∈R ^vocab Is a learnable parameter and vocab is the size of the pre-trained vocabulary. The main innovation of the search subtask is to construct a finer search matrix by utilizing the error position information of the detection subtask and the error type information of the reasoning subtask and combining an external confusion set so as to reduce the search space:

C＝{c ₁ ，c ₂ ，…，c _T }，c _i ∈R ^vocab ，C∈R ^T×vocab

wherein the vector isAnd->The dimension of (2) is the size of the vocabulary and the value range is {0,1}, -the value is }>Represents x _i The element of the corresponding position of the word in the vector in the near confusion set of the corresponding Chinese character is set as 1, and the other elements are 0; />Then indicate x _i The element of the corresponding position of the word in the vector corresponding to the shape near confusion set of the Chinese character is set to be 1, and the other elements are set to be 0.

After obtaining the search matrix C, we obtain the initial probability matrix P based on C ^s Is described in (1) a modified probability matrix P:

P＝P ^s ⊙C＝{p ₁ ，p ₂ ，…，p _T }

wherein +.. By this operation, for a Chinese character that has been determined to be a near-pronunciation error, the probability that it is predicted to be a Chinese character that is dissimilar to its pronunciation will be set to 0. The same is true for the case of similar shapes. Therefore, we enhance the probability representation of more suitable candidate characters in the search task by searching the matrix, while also reducing the search space of the candidate characters. We also use cross entropy loss in the search task:

multitasking learning

The three training tasks are synchronously applied to the training process of the model, and can complete the task of detecting and correcting wrongly written characters. The loss functions to be optimized for all tasks will be weighted summed as the total loss for model training:

plug-and-play model reasoning

As shown in FIG. 2, during the model reasoning phase, we can combine the trained encoder E and detection and reasoning sub-module with a new Chinese misprinted word correction model or pre-training model E' without having to retrain the detection and reasoning module. This is done by inputting text into two encoders E and E':

H＝E(X)，H′＝E′(X)

h can be used as the input of the detection sub-module and the reasoning sub-module, the result of the detection and reasoning sub-module is generated, and then the search matrix C is constructed according to the mode in the search sub-task. Inputting H 'into an output layer of the Chinese mispronounced character correction model to obtain an output probability matrix of which P' is a new Chinese spelling correction model or a pre-training model:

C＝D-R Module(H)

P′＝Output Layer(H′)

finally, the initial probability matrix P' can be enhanced by searching the matrix C.

In summary, the invention provides a plug and play progressive multi-task Chinese wrongly written and wrongly written word correcting method, which is characterized in that an error correcting task is decomposed into three difficult subtasks, the wrongly written and wrongly written words are respectively identified, the wrongly written and wrongly written words are attributed, and external knowledge is introduced to correct, so that the performance of an existing error correcting model is improved, and the method can combine a trained module with any non-autoregressive error correcting model to realize plug and play, directly improve the model performance and save training time. The advantages are that:

Specific application scenarios of the present invention include, but are not limited to:

can be used for developing a writing assistant product and correcting spelling errors in user literal works.

Can be used to correct user input errors in search engines and recall more accurate search results to users.

Can be used for correcting recognition errors generated by a voice recognition (ASR) system and an Optical Character Recognition (OCR) system, and improving the accuracy of the ASR and the OCR.

The embodiments of the present invention also provide a storage medium storing a computer program which, when executed, performs at least the method as described above.

The embodiment of the invention also provides a control device, which comprises a processor and a storage medium for storing a computer program; wherein the processor is adapted to perform at least the method as described above when executing said computer program.

The embodiments of the present invention also provide a processor executing a computer program, at least performing the method as described above.

The storage medium may be implemented by any type of volatile or non-volatile storage device, or combination thereof. The nonvolatile Memory may be a Read Only Memory (ROM), a programmable Read Only Memory (PROM, programmable Read-Only Memory), an erasable programmable Read Only Memory (EPROM, erasableProgrammable Read-Only Memory), an electrically erasable programmable Read Only Memory (EEPROM, electricallyErasable Programmable Read-Only Memory), a magnetic random Access Memory (FRAM, ferromagneticRandom Access Memory), a Flash Memory (Flash Memory), a magnetic surface Memory, an optical disk, or a compact disk Read Only (CD-ROM, compact Disc Read-Only Memory); the magnetic surface memory may be a disk memory or a tape memory. The volatile memory may be random access memory (RAM, random Access Memory), which acts as external cache memory. By way of example, and not limitation, many forms of RAM are available, such as static random access memory (SRAM, static Random Access Memory), synchronous static random access memory (SSRAM, synchronousStatic Random Access Memory), dynamic random access memory (DRAM, dynamic Random AccessMemory), synchronous dynamic random access memory (SDRAM, synchronous Dynamic Random AccessMemory), double data rate synchronous dynamic random access memory (ddr SDRAM, double Data RateSynchronous Dynamic Random Access Memory), enhanced synchronous dynamic random access memory (ESDRAM, enhanced Synchronous Dynamic Random Access Memory), synchronous link dynamic random access memory (SLDRAM, syncLink Dynamic Random Access Memory), direct memory bus random access memory (DRRAM, direct Rambus Random Access Memory). The storage media described in embodiments of the present invention are intended to comprise, without being limited to, these and any other suitable types of memory.

In the several embodiments provided by the present invention, it should be understood that the disclosed systems and methods may be implemented in other ways. The above described device embodiments are only illustrative, e.g. the division of the units is only one logical function division, and there may be other divisions in practice, such as: multiple units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed. In addition, the various components shown or discussed may be coupled or directly coupled or communicatively coupled to each other via some interface, whether indirectly coupled or communicatively coupled to devices or units, whether electrically, mechanically, or otherwise.

The units described as separate units may or may not be physically separate, and units displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units; some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present invention may be integrated in one processing unit, or each unit may be separately used as one unit, or two or more units may be integrated in one unit; the integrated units may be implemented in hardware or in hardware plus software functional units.

Those of ordinary skill in the art will appreciate that: all or part of the steps for implementing the above method embodiments may be implemented by hardware associated with program instructions, where the foregoing program may be stored in a computer readable storage medium, and when executed, the program performs steps including the above method embodiments; and the aforementioned storage medium includes: a mobile storage device, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk or an optical disk, or the like, which can store program codes.

Alternatively, the above-described integrated units of the present invention may be stored in a computer-readable storage medium if implemented in the form of software functional modules and sold or used as separate products. Based on such understanding, the technical solutions of the embodiments of the present invention may be embodied in essence or a part contributing to the prior art in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the methods described in the embodiments of the present invention. And the aforementioned storage medium includes: a removable storage device, ROM, RAM, magnetic or optical disk, or other medium capable of storing program code.

The methods disclosed in the method embodiments provided by the invention can be arbitrarily combined under the condition of no conflict to obtain a new method embodiment.

The features disclosed in the several product embodiments provided by the invention can be combined arbitrarily under the condition of no conflict to obtain new product embodiments.

The features disclosed in the embodiments of the method or the apparatus provided by the invention can be arbitrarily combined without conflict to obtain new embodiments of the method or the apparatus.

The foregoing is a further detailed description of the invention in connection with the preferred embodiments, and it is not intended that the invention be limited to the specific embodiments described. It will be apparent to those skilled in the art that several equivalent substitutions and obvious modifications can be made without departing from the spirit of the invention, and the same should be considered to be within the scope of the invention.

Claims

1. A Chinese wrongly written or mispronounced character correcting method of progressive multitasking is characterized by comprising the following steps:

s1, detecting subtasks to identify which sentence is the wrongly written word;

2. The method for correcting Chinese mispronounced characters in progressive multitasking according to claim 1, wherein in step S1, a given sentence X is input into a pre-training encoder E to obtain a sentence representation H, and then the probability that the ith Chinese character in the sentence is the mispronounced character is calculated:

H＝E(X)＝{h ₁ ，h ₂ ，...，h _T }

3. A method of correcting chinese mispronounced characters in a progressive multitask as defined in claim 2, wherein for training of said detection subtask, a cross entropy loss function is used to calculate its loss

4. A progressive multitasking chinese misprint correction method as claimed in any one of claims 1 to 3, wherein in step S2 the probability of misprint is predicted by calculating the reasoning of the ith chinese character in the sentenceAnd get the reasoning forecast result->The method is as follows:

5. The method for correcting Chinese mispronounced characters of a progressive multitask according to claim 4, wherein for training of said inference subtask, a cross entropy loss function is used to calculate its loss

6. A method of correcting chinese misprinted words by progressive multitasking according to any one of claims 1 to 3, characterized in that in step S3, for each chinese character in X, the probability that it is a chinese character in the pre-trained vocabulary is predicted:

C＝{c ₁ ，c ₂ ，…，c _T }，c _i ∈R ^vocab ，C∈R ^T×vocab

P＝P ^s ⊙C＝{p ₁ ，p ₂ ，…，p _T }

wherein +..

7. A method of correcting chinese misprints for progressive multitasking according to any one of claims 1 to 3, characterized in that in step S3, a cross entropy penalty is used in the search task:

8. a method of correcting chinese misprints for a progressive multi-task as claimed in any one of claims 1 to 7, wherein three training tasks of the detection subtask, the inference subtask and the search subtask are synchronously applied to a training process of a model, and a loss function to be optimized for all tasks is weighted and summed as a total loss of model training:

9. a method of correcting chinese misprinted words for a progressive multitask according to any one of claims 1 to 8, characterized in that in the model reasoning phase the trained encoder E and the detection subtask and the reasoning subtask are combined with a new chinese misprinted word correction model or pre-training model E ', wherein by inputting text into both encoders E and E':

H＝E(X)，H′＝E′(X)

C＝D-R Module(H)

P′＝Output Layer(H′)

the initial probability matrix P' can be enhanced by searching the matrix C.

10. A computer readable storage medium storing a computer program which, when executed by a processor, implements a progressive multitasking chinese misprint word correction method as claimed in any one of claims 1 to 9.