CN117973372A - Chinese grammar error correction method based on pinyin constraint - Google Patents
Chinese grammar error correction method based on pinyin constraint Download PDFInfo
- Publication number
- CN117973372A CN117973372A CN202410144119.XA CN202410144119A CN117973372A CN 117973372 A CN117973372 A CN 117973372A CN 202410144119 A CN202410144119 A CN 202410144119A CN 117973372 A CN117973372 A CN 117973372A
- Authority
- CN
- China
- Prior art keywords
- model
- sound
- error correction
- confusion
- bart
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000012937 correction Methods 0.000 title claims abstract description 71
- 238000000034 method Methods 0.000 title claims abstract description 20
- 238000001514 detection method Methods 0.000 claims abstract description 45
- 239000011159 matrix material Substances 0.000 claims abstract description 25
- 238000012549 training Methods 0.000 claims description 16
- 230000011218 segmentation Effects 0.000 claims description 9
- 230000007246 mechanism Effects 0.000 claims description 7
- 230000000694 effects Effects 0.000 claims description 3
- 238000007781 pre-processing Methods 0.000 claims description 3
- 238000012512 characterization method Methods 0.000 claims 2
- 238000013507 mapping Methods 0.000 claims 2
- 238000012795 verification Methods 0.000 claims 1
- 238000003058 natural language processing Methods 0.000 abstract description 4
- 230000006870 function Effects 0.000 description 10
- 101100078144 Mus musculus Msrb1 gene Proteins 0.000 description 9
- 238000006467 substitution reaction Methods 0.000 description 8
- 238000005516 engineering process Methods 0.000 description 4
- 230000008569 process Effects 0.000 description 4
- 238000010200 validation analysis Methods 0.000 description 3
- 238000011156 evaluation Methods 0.000 description 2
- 230000000717 retained effect Effects 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- 101001121408 Homo sapiens L-amino-acid oxidase Proteins 0.000 description 1
- 102100026388 L-amino-acid oxidase Human genes 0.000 description 1
- 230000004888 barrier function Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 238000012217 deletion Methods 0.000 description 1
- 230000037430 deletion Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 230000002708 enhancing effect Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/253—Grammatical analysis; Style critique
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
- G06N3/0455—Auto-encoder networks; Encoder-decoder networks
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Biomedical Technology (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Evolutionary Computation (AREA)
- Data Mining & Analysis (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Biophysics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Machine Translation (AREA)
Abstract
本发明涉及一种基于拼音约束的中文语法纠错方法,属自然语言处理技术领域。本发明首先基于原始BART模型构建端到端的语法纠错基础模型,该模型可以充分利用预训练语言模型自身的强表征能力,提升纠错性能;然后,在BART编码结束后增加一个检测层,通过有效错误检测缓解过度校正的问题;接着,利用字符的音似混淆集构建音似混淆矩阵,并将音似混淆矩阵与检测层输出做融合,得到输入句子中包涵错误的字符的音似信息;最后,利用音似信息对解码端的输出概率做约束,从而得到更加准确的纠错结果。
The present invention relates to a Chinese grammar error correction method based on pinyin constraints, and belongs to the technical field of natural language processing. The present invention first constructs an end-to-end grammar error correction basic model based on the original BART model, and the model can make full use of the strong representation ability of the pre-trained language model itself to improve the error correction performance; then, after the BART encoding is completed, a detection layer is added to alleviate the problem of over-correction through effective error detection; then, the sound-like confusion set of the characters is used to construct a sound-like confusion matrix, and the sound-like confusion matrix is fused with the output of the detection layer to obtain the sound-like information of the characters containing errors in the input sentence; finally, the sound-like information is used to constrain the output probability of the decoding end, so as to obtain a more accurate error correction result.
Description
技术语言Technical Language
本发明涉及一种基于拼音约束的中文语法纠错方法,属于自然语言处理技术领域。The invention relates to a Chinese grammar error correction method based on pinyin constraints, belonging to the technical field of natural language processing.
背景技术Background technique
中文语法纠错是自然语言处理领域的一个关键任务,其目标是识别并纠正中文文本中的语法错误。这些错误包括但不限于词序错误、词性不匹配、以及结构上的语病,它们可能会削弱文本的清晰度和可理解性。鉴于此,对于中文语法校正技术的需求正逐渐增长。Chinese grammar correction is a key task in the field of natural language processing, which aims to identify and correct grammatical errors in Chinese text. These errors include but are not limited to incorrect word order, mismatched parts of speech, and structural errors, which may weaken the clarity and comprehensibility of the text. In view of this, the demand for Chinese grammar correction technology is gradually growing.
为了增强中文语法纠错的准确度与处理速度,开发高性能的纠错模型变得至关重要。这些模型能够自动侦测并改正文中的语法问题。进一步地,随着中文文本在各种领域和情境下的广泛应用,中文语法校正模型也需适应不同的专业领域和变化多端的使用环境,以满足各式各样的用户和应用情景。In order to enhance the accuracy and processing speed of Chinese grammar correction, it is crucial to develop high-performance correction models. These models can automatically detect and correct grammatical problems in texts. Furthermore, with the widespread application of Chinese texts in various fields and contexts, Chinese grammar correction models also need to adapt to different professional fields and changing usage environments to meet a wide range of users and application scenarios.
中文语法错误检测的目标是自动检测出中文自然语句中的语法错误,例如:成分缺失或多余,语序不当等。中文语法的检测任务一般包含:是否存在错误、错误类型、错误发生位置等。合理利用语法检测可以有有效提升纠错性能。The goal of Chinese grammar error detection is to automatically detect grammatical errors in natural Chinese sentences, such as missing or redundant components, improper word order, etc. The detection tasks of Chinese grammar generally include: whether there is an error, the type of error, the location of the error, etc. Reasonable use of grammar detection can effectively improve error correction performance.
总之,中文语法纠错技术对于提升文本质量、增强用户体验以及满足多样化的应用需求具有至关重要的作用。语法检测是这一目标的关键技术手段之一,它有助于确保文本的正确性和专业性,减少误解和沟通障碍。此外,它能够提高非母语者的写作质量,促进语言学习,并且在自然语言处理领域中,它能增强机器翻译和语音识别等技术的准确性。随着技术的不断进步,语法检测将继续推动中文语法纠错领域的发展In short, Chinese grammar correction technology plays a vital role in improving text quality, enhancing user experience, and meeting diverse application needs. Grammar detection is one of the key technical means to achieve this goal. It helps ensure the correctness and professionalism of the text and reduce misunderstandings and communication barriers. In addition, it can improve the writing quality of non-native speakers, promote language learning, and in the field of natural language processing, it can enhance the accuracy of technologies such as machine translation and speech recognition. With the continuous advancement of technology, grammar detection will continue to promote the development of the field of Chinese grammar correction.
发明内容Summary of the invention
本发明提供了一种基于拼音约束的中文语法纠错方法,以解决中文语法纠错精确率较低的问题,本发明在MagicData任务上取得了较好的实验结果。The present invention provides a Chinese grammar correction method based on pinyin constraints to solve the problem of low accuracy of Chinese grammar correction. The present invention has achieved good experimental results on the MagicData task.
本发明的技术方案是:一种基于拼音约束的中文语法纠错方法,所述方法的具体步骤如下:The technical solution of the present invention is: a Chinese grammar error correction method based on pinyin constraints, the specific steps of the method are as follows:
Step1、基于预训练模型BART构造序列到序列的语法纠错基础模型。语法纠错基础模型分别采用多层多头注意力机制作为编码器和解码器有效捕捉上下文信息,同时还可以充分利用BART预训练语言模型的强表征能力增强纠错效果;Step 1: Construct a sequence-to-sequence grammatical error correction basic model based on the pre-trained model BART. The grammatical error correction basic model uses a multi-layer multi-head attention mechanism as the encoder and decoder to effectively capture context information. At the same time, it can also make full use of the strong representation ability of the BART pre-trained language model to enhance the error correction effect;
Step2、在基于BART的语法纠错基础模型的编码端增加检测层,尝试通过检测模块过滤掉正确的句子不进行纠正,从而缓解过度纠正问题;Step 2: Add a detection layer to the encoding end of the BART-based grammatical error correction basic model, and try to filter out correct sentences without correction through the detection module, thereby alleviating the over-correction problem;
Step3、利用字符的音似混淆集构建音似混淆矩阵,并将音似混淆矩阵与检测层输出做融合,得到输入句子中包涵错误的字符的音似信息;Step 3: Use the phonetic confusion set of characters to construct a phonetic confusion matrix, and fuse the phonetic confusion matrix with the output of the detection layer to obtain the phonetic information of the wrong characters in the input sentence;
Step4、利用音似信息对解码端的输出概率做约束,从而得到更加准确的纠错结果。Step 4: Use the sound-like information to constrain the output probability of the decoding end, so as to obtain more accurate error correction results.
作为本发明的进一步方案,所述Step1的具体步骤如下:As a further solution of the present invention, the specific steps of Step 1 are as follows:
Step1.1、获取开源并已经预训练好的中文BART模型作为我们的预训练语言模型;Step 1.1. Obtain the open-source and pre-trained Chinese BART model as our pre-trained language model;
Step1.2、由于BART预训练语言模型使用多层多头注意力机制构建编码器和解码器,我们基于上一步下载的BART模型进行修改以适应语法纠错的任务,从而有效利用预训练语言模型的强表征能力增强语法纠错模型;Step 1.2. Since the BART pre-trained language model uses a multi-layer multi-head attention mechanism to build the encoder and decoder, we modify the BART model downloaded in the previous step to adapt to the task of grammatical error correction, thereby effectively utilizing the strong representation ability of the pre-trained language model to enhance the grammatical error correction model;
Step1.3、获取开放访问的MagicData语音识别数据集,并从网上下载语音识别模型;Step 1.3. Obtain the open access MagicData speech recognition dataset and download the speech recognition model from the Internet;
Step1.4、利用语音识别模型自动生成包含音似错误的句子,将这些包含音似错误的句子与原始数据集中正确的句子进行组合,就可以形成正确-错误句对,从而构建出语法纠错任务的基础数据集MagicData。Step 1.4: Use the speech recognition model to automatically generate sentences containing phonetic errors. These sentences containing phonetic errors are combined with the correct sentences in the original dataset to form correct-incorrect sentence pairs, thereby constructing the basic dataset MagicData for grammatical correction tasks.
作为本发明的进一步方案,所述Step2包括如下:As a further solution of the present invention, the Step 2 includes the following:
处理MagicData文本数据集,构造当前数据集分类标签,在基础模型编码端增加检测层,缓解过度纠正问题。Process the MagicData text dataset, construct the classification label of the current dataset, and add a detection layer to the encoding end of the basic model to alleviate the over-correction problem.
具体步骤如下:Specific steps are as follows:
对训练集和验证集的中的带标签的句子进行预处理,按照替换错误和非替换错误进行二分类,其中替换错误是语法纠错任务中四种错误类型之一,四种错误类型分别为替换错误、词序错误、冗余错误、缺失错误。二分类的标签分别为0和1。只有当出现替换错误时为1,其他情况为0;Preprocess the labeled sentences in the training set and validation set, and classify them into two categories according to substitution errors and non-substitution errors. Substitution errors are one of the four error types in the grammatical error correction task, namely substitution errors, word order errors, redundancy errors, and missing errors. The labels of the two categories are 0 and 1 respectively. Only when a substitution error occurs is the label 1, and otherwise it is 0;
Step2的具体步骤如下:The specific steps of Step 2 are as follows:
Step2.1、在基于BART的语法纠错基础模型的编码器后增加检测层。这里将语法错误检测看成一个简单的二分类任务,如果当前位置存在替换错误则输出1,否则为0;Step 2.1. Add a detection layer after the encoder of the BART-based grammatical error correction basic model. Here, grammatical error detection is regarded as a simple binary classification task. If there is a substitution error at the current position, the output is 1, otherwise it is 0;
Step2.2、对训练集和验证集中带标签的句子进行预处理,如果当前词存在替换错误则标注为1,否则标注为0;Step 2.2: Preprocess the labeled sentences in the training set and the validation set. If the current word has a replacement error, it is marked as 1, otherwise it is marked as 0.
Step2.3、对语法错误检测和语法纠错两个任务的损失进行加权求和作为最终的损失,并通过最小化最终损失更新全部模型参数。Step 2.3: Take the weighted sum of the losses of the two tasks of grammatical error detection and grammatical error correction as the final loss, and update all model parameters by minimizing the final loss.
作为本发明的进一步方案,所述Step3的具体步骤如下:As a further solution of the present invention, the specific steps of Step 3 are as follows:
Step3.1、从一些开源的网站下载音似字符混淆集和形似字符混淆集。对下载的所有混淆集进行去重与合并,得到最终包涵音似和形似的混淆集;Step 3.1. Download the confusion sets of similar-sounding characters and similar-looking characters from some open source websites. De-duplicate and merge all the downloaded confusion sets to obtain the final confusion sets that include similar-sounding and similar-looking characters.
Step3.2、对混淆集进行预处理,使用BART-large模型自带的tokenizer对字典中的字符进行id化,将首个字符id做为Key,与该字符音似的id作为Value,保存为字典格式文件;Step 3.2, pre-process the confusion set, use the tokenizer that comes with the BART-large model to id the characters in the dictionary, use the first character id as the key, and the id that sounds similar to the character as the value, and save it as a dictionary format file;
Step3.3、在模型训练时读取该字典,并当模型对输入句子进行分词id化时,对当前分词id进行该字典的映射,获取当前分词id音似字符信息,并对当前音似信息构造混淆音似矩阵;Step 3.3, read the dictionary during model training, and when the model performs word segmentation id on the input sentence, map the dictionary to the current word segmentation id, obtain the phonetic character information of the current word segmentation id, and construct a confusion phonetic matrix for the current phonetic information;
Step3.4、根据上一步得到的混淆音似矩阵与模型检测得到的结果相乘,获取音似信息的混淆矩阵。由于检测结果中为二分类,只有当替换错误出现时为1。这样保证只有在当前字符出现替换错误时,音似信息才会保留。Step 3.4, multiply the confusion matrix obtained in the previous step by the result of model detection to obtain the confusion matrix of the sound-like information. Since the detection result is a binary classification, it is 1 only when a replacement error occurs. This ensures that the sound-like information will only be retained when a replacement error occurs in the current character.
作为本发明的进一步方案,所述Step4的具体步骤如下:As a further solution of the present invention, the specific steps of Step 4 are as follows:
Step4.1、将Step3.4得到的音似信息混淆矩阵映射为输出概率维度的矩阵,并根据当前混淆矩阵中存在的音似信息,将字典中对应的位置设置为1,其余位置设置为0;Step 4.1, map the sound-like information confusion matrix obtained in Step 3.4 to a matrix of output probability dimension, and according to the sound-like information existing in the current confusion matrix, set the corresponding position in the dictionary to 1, and set the remaining positions to 0;
Step4.2、将在Step4.1获取的与输出概率维度相同的矩阵与最后的输出概率相加,显示增加与当前字符音似的字符概率。Step 4.2. Add the matrix with the same dimension as the output probability obtained in Step 4.1 to the final output probability to increase the probability of characters that sound similar to the current character.
本发明的有益效果是:The beneficial effects of the present invention are:
1、本发明整合了常见音似混淆集和形似混淆集,为以后的研究提供了基础的混淆集;1. The present invention integrates common sound-similar confusion sets and shape-similar confusion sets, providing a basic confusion set for future research;
2、本发明利用检测模型缓解了模型过度纠正与不纠正的问题。2. The present invention utilizes a detection model to alleviate the problem of over-correction and under-correction of the model.
3、本发明利用音似信息,有效提升了模型对替换错误纠正的准确率。3. The present invention utilizes sound-like information to effectively improve the accuracy of the model in correcting substitution errors.
4、本发明使用错误句子和正确句子构造编辑序列,并利用编辑序列构造检测模型的标签。4. The present invention uses incorrect sentences and correct sentences to construct edit sequences, and uses the edit sequences to construct labels for the detection model.
附图说明BRIEF DESCRIPTION OF THE DRAWINGS
图1为本发明中的流程图;Fig. 1 is a flow chart of the present invention;
具体实施方式Detailed ways
实施例1:如图1所示,一种基于拼音约束的中文语法纠错方法,所述方法的具体步骤如下:Embodiment 1: As shown in FIG1 , a Chinese grammar error correction method based on pinyin constraints, the specific steps of the method are as follows:
Step1、基于预训练模型BART构造序列到序列的语法纠错基础模型。语法纠错基础模型分别采用多层多头注意力机制作为编码器和解码器有效捕捉上下文信息,同时还可以充分利用BART预训练语言模型的强表征能力增强纠错效果;Step 1: Construct a sequence-to-sequence grammatical error correction basic model based on the pre-trained model BART. The grammatical error correction basic model uses a multi-layer multi-head attention mechanism as the encoder and decoder to effectively capture context information. At the same time, it can also make full use of the strong representation ability of the BART pre-trained language model to enhance the error correction effect;
下载语音识别模型和训练集进行预处理、中文BART预训练模型。Download the speech recognition model and training set for preprocessing, and the Chinese BART pre-trained model.
所述Step1的具体步骤如下:The specific steps of Step 1 are as follows:
Step1.1、获取开源并已经预训练好的中文BART模型1作为我们的基础预训练语言模型。;Step 1.1. Obtain the open-source and pre-trained Chinese BART model 1 as our basic pre-trained language model. ;
Step1.2、由于BART预训练语言模型使用多层多头注意力机制构建编码器和解码器,我们基于上一步下载的BART模型进行修改以适应语法纠错的任务,从而有效利用预训练语言模型的强表征能力增强语法纠错模型性能。Step 1.2. Since the BART pre-trained language model uses a multi-layer multi-head attention mechanism to build the encoder and decoder, we modify the BART model downloaded in the previous step to adapt to the task of grammatical correction, thereby effectively utilizing the strong representation ability of the pre-trained language model to enhance the performance of the grammatical correction model.
Step1.3、下载语音识别模型和MagicData语音识别的数据集。Step 1.3. Download the speech recognition model and MagicData speech recognition dataset.
Step1.4、利用语音识别模型自动生成包含音似错误的句子,将这些包含音似错误的句子与原始数据集中正确的句子进行组合,就可以形成正确-错误句对,从而构建出语法纠错任务的基础数据集MagicData。Step 1.4: Use the speech recognition model to automatically generate sentences containing phonetic errors. These sentences containing phonetic errors are combined with the correct sentences in the original dataset to form correct-incorrect sentence pairs, thereby constructing the basic dataset MagicData for grammatical correction tasks.
具体的,使用下载的MagicData语音识别数据集和语音识别模型生成文本MagicData语法纠错数据集;对生成的MagicData数据进行预处理,主要包含去重、删除长度过长或过短的句对等;最终形成363658万的较规范的语法纠错数据,数据集数量如表1所示。Specifically, the downloaded MagicData speech recognition dataset and speech recognition model are used to generate the text MagicData grammatical error correction dataset; the generated MagicData data is preprocessed, mainly including deduplication, deletion of sentence pairs that are too long or too short, etc.; finally, 363.658 million relatively standardized grammatical error correction data are formed, and the number of datasets is shown in Table 1.
表1 MagicData数据集的数量Table 1 Number of MagicData datasets
Step2、在基于BART的语法纠错基础模型的编码端增加检测层,尝试通过检测模块过滤掉正确的句子不进行纠正,从而缓解过度纠正问题;Step 2: Add a detection layer to the encoding end of the BART-based grammatical error correction basic model, and try to filter out correct sentences without correction through the detection module, thereby alleviating the over-correction problem;
首先,处理MagicData数据集,构造当前数据集分类标签,最后,在基础模型编码器增加检测层,缓解过度纠正问题。First, the MagicData dataset is processed to construct the classification label of the current dataset. Finally, a detection layer is added to the base model encoder to alleviate the over-correction problem.
Step2的具体步骤如下:The specific steps of Step 2 are as follows:
Step2.1、在基于BART的语法纠错基础模型的编码器后增加检测层。这里将语法错误检测看成一个简单的二分类任务,如果当前位置存在替换错误则输出1,否则为0;Step 2.1. Add a detection layer after the encoder of the BART-based grammatical error correction basic model. Here, grammatical error detection is regarded as a simple binary classification task. If there is a substitution error at the current position, the output is 1, otherwise it is 0;
Step2.2、对训练集和验证集中带标签的句子进行预处理,如果当前词存在替换错误则标注为1,否则标注为0;Step 2.2: Preprocess the labeled sentences in the training set and the validation set. If the current word has a replacement error, it is marked as 1, otherwise it is marked as 0.
Step2.3、对语法错误检测和语法纠错两个任务的损失进行加权求和作为最终的损失,并通过最小化最终损失更新全部模型参数。Step 2.3: Take the weighted sum of the losses of the two tasks of grammatical error detection and grammatical error correction as the final loss, and update all model parameters by minimizing the final loss.
对MagicData数据集进行处理,利用错误句子和正确句子生成带有编辑序列的文本,编辑序列样式如“$STARTSEPL|||SEPR$KEEP播SEPL|||SEPR$KEEP放SEPL|||SEPR$KEEP工SEPL|||SEPR$REPLACE_宫则SEPL|||SEPR$REPLACE_泽里SEPL|||SEPR$REPLACE_理会SEPL|||SEPR$REPLACE_惠的SEPL|||SEPR$KEEP歌SEPL|||SEPR$KEEP”所示。在这个编辑序列中,“KEEP”表示当前位置的字符保持不变,“REPLACE_宫”表示当前字符使用“宫”替换。使用编辑序列,对数据集进行二分类。The MagicData dataset is processed, and the wrong sentences and correct sentences are used to generate text with edit sequences. The edit sequence style is shown as "$STARTSEPL|||SEPR$KEEPplaySEPL|||SEPR$KEEPreleaseSEPL|||SEPR$KEEPworkSEPL|||SEPR$REPLACE_PalaceSEPL|||SEPR$REPLACE_ZeLiSEPL|||SEPR$REPLACE_ReasonSEPL|||SEPR$REPLACE_BenefitsSEPL|||SEPR$KEEPsongSEPL|||SEPR$KEEP". In this edit sequence, "KEEP" means that the character at the current position remains unchanged, and "REPLACE_Palace" means that the current character is replaced with "Palace". Using the edit sequence, the dataset is classified into two categories.
这基础模型包含了编码器和解码器两个组件,它们协同工作以实现语法纠错的目标。编码器采用了多头自注意力机制,通过对源语句中每个词进行上下文建模,生成了丰富的上下文表示。解码器则具有相似的结构,同时引入了掩码多头子注意力模块,更好地捕捉生成的词的信息,确保输出语句在语法和语义上都准确无误。同时使用预训练语言模型,增强模型上下文表征能力。This basic model consists of two components, an encoder and a decoder, which work together to achieve the goal of grammatical error correction. The encoder uses a multi-head self-attention mechanism to generate rich contextual representations by modeling the context of each word in the source sentence. The decoder has a similar structure and introduces a masked multi-head sub-attention module to better capture the information of the generated words and ensure that the output sentence is grammatically and semantically accurate. At the same time, a pre-trained language model is used to enhance the model's context representation capabilities.
训练过程中,目标函数是最小化交叉熵损失函数,公式如下:During the training process, the objective function is to minimize the cross entropy loss function, the formula is as follows:
θ是可训练的模型参数,x是源语句,y={y1,y2,…,yn}是带有n个词的正确语句,y<t={y1,y2,…,yt-1}是在第t个时间步可见的词;θ is the trainable model parameter, x is the source sentence, y = {y 1 ,y 2 ,…,y n } is the correct sentence with n words, and y < t = {y 1 ,y 2 ,…,y t-1 } is the word seen at the tth time step;
对基础模型的修改包括在模型末尾引入一个检测层。这个检测层具有独立的损失函数,与原始任务的损失函数按照一定权重相加,形成总的损失函数。在计算总的损失函数后,通过反向传播计算梯度,并利用梯度下降等优化算法来更新模型参数,以最小化总的损失函数。模型总的损失函数如下所示The modification to the base model includes the introduction of a detection layer at the end of the model. This detection layer has an independent loss function, which is added to the loss function of the original task according to a certain weight to form a total loss function. After calculating the total loss function, the gradient is calculated through back propagation, and the model parameters are updated using optimization algorithms such as gradient descent to minimize the total loss function. The total loss function of the model is shown below
其中,为检测层损失函数,/>为纠错损失函数。in, is the detection layer loss function,/> is the error correction loss function.
预测阶段,利用波束搜索解码通过最大化条件概率P(y*|x;θ)找到一个最优序列y*;In the prediction stage, beam search decoding is used to find an optimal sequence y * by maximizing the conditional probability P(y * |x; θ);
Step3、利用字符的音似混淆集构建音似混淆矩阵,并将音似混淆矩阵与检测层输出做融合,得到输入句子中包涵错误的字符的音似信息;Step 3: Use the phonetic confusion set of characters to construct a phonetic confusion matrix, and fuse the phonetic confusion matrix with the output of the detection layer to obtain the phonetic information of the wrong characters in the input sentence;
所述Step3的具体步骤如下:The specific steps of Step 3 are as follows:
Step3.1、从一些开源的网站下载音似字符混淆集。对下载的所有混淆集进行去重与合并,并对混淆集中所有字符进行排列,并将音似的字符放在同一行;Step 3.1. Download similar-sounding character obfuscation sets from some open source websites. De-duplicate and merge all the downloaded obfuscation sets, arrange all the characters in the obfuscation sets, and put similar-sounding characters in the same row;
Step3.2、对混淆集进行预处理,使用BART模型自带的tokenizer对字典中的字符进行id化,将首个字符id做为Key,与该字符音似的其他字符的id作为Value,保存为字典格式文件;Step 3.2, pre-process the confusion set, use the tokenizer that comes with the BART model to id the characters in the dictionary, use the first character id as the key, and the ids of other characters that sound similar to the character as the value, and save it as a dictionary format file;
Step3.3、在模型训练时读取该字典,并当模型对输入句子进行分词id化时,对当前分词id进行该字典的映射,获取当前分词id音似字符信息,并对当前音似信息构造混淆音似矩阵;Step 3.3, read the dictionary during model training, and when the model performs word segmentation id on the input sentence, map the dictionary to the current word segmentation id, obtain the phonetic character information of the current word segmentation id, and construct a confusion phonetic matrix for the current phonetic information;
Step3.4、根据上一步得到的混淆音似矩阵与模型检测得到的结果相乘,获取音似信息的混淆矩阵。由于检测结果中为二分类,只有当替换错误出现时为1。这样保证只有在当前字符出现替换错误时,音似信息才会保留。Step 3.4, multiply the confusion matrix obtained in the previous step by the result of model detection to obtain the confusion matrix of the sound-like information. Since the detection result is a binary classification, it is 1 only when a replacement error occurs. This ensures that the sound-like information will only be retained when a replacement error occurs in the current character.
Step4、利用音似信息对解码端的输出概率做约束,从而得到更加准确的纠错结果。Step 4: Use the sound-like information to constrain the output probability of the decoding end, so as to obtain more accurate error correction results.
最终的结果如表2所示。在对基础模型进行MagicData数据训练的基础上,我们引入了检测模块和拼音约束模型,两者的得分差距显著。与基础模型相比,引入检测模块后,F值提高了1个单位;而引入拼音约束模块后,F值则提高了大约3个单位。这不仅表明当前语法纠错模型中替换错误类型的准确率存在改进空间,同时也验证了我们所采用的拼音约束方法的有效性。The final results are shown in Table 2. Based on the MagicData data training of the basic model, we introduced the detection module and the pinyin constraint model, and the score difference between the two is significant. Compared with the basic model, the F value increased by 1 unit after the introduction of the detection module; and the F value increased by about 3 units after the introduction of the pinyin constraint module. This not only shows that there is room for improvement in the accuracy of replacing error types in the current grammatical error correction model, but also verifies the effectiveness of the pinyin constraint method we adopted.
使用以字符为基准的中文语法评价指标。中文语法纠错使用的评价指标为精确率P、召回率R和F0.5,计算结果如下:The Chinese grammar evaluation index based on characters is used. The evaluation indexes used for Chinese grammar error correction are precision P, recall R and F 0.5 . The calculation results are as follows:
其中:in:
TP(True Positive):模型将错误字纠正的字。TP (True Positive): The word that the model corrected from the wrong word.
FP(False Positive):模型将正确字改为错误的字。FP (False Positive): The model changes the correct word to an incorrect word.
FN(False Negative):模型没有将错误的词改正。FN (False Negative): The model did not correct the wrong word.
使用构造的MagicData文本数据集作为训练集训练检测模型,调整超参数,获取当前性能最佳的检测模型。Use the constructed MagicData text dataset as the training set to train the detection model, adjust the hyperparameters, and obtain the detection model with the best performance.
表2基础模型与检测模型测试结果Table 2 Test results of basic model and detection model
上面结合附图对本发明的具体实施方式作了详细说明,但是本发明并不限于上述实施方式,在本领域普通技术人员所具备的知识范围内,还可以在不脱离本发明宗旨的前提下作出各种变化。The specific implementation modes of the present invention are described in detail above in conjunction with the accompanying drawings, but the present invention is not limited to the above implementation modes, and various changes can be made within the knowledge scope of ordinary technicians in this field without departing from the purpose of the present invention.
Claims (5)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410144119.XA CN117973372A (en) | 2024-02-01 | 2024-02-01 | Chinese grammar error correction method based on pinyin constraint |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410144119.XA CN117973372A (en) | 2024-02-01 | 2024-02-01 | Chinese grammar error correction method based on pinyin constraint |
Publications (1)
Publication Number | Publication Date |
---|---|
CN117973372A true CN117973372A (en) | 2024-05-03 |
Family
ID=90864448
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202410144119.XA Pending CN117973372A (en) | 2024-02-01 | 2024-02-01 | Chinese grammar error correction method based on pinyin constraint |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117973372A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118278394A (en) * | 2024-05-28 | 2024-07-02 | 华东交通大学 | A Chinese spelling correction method |
CN119005174A (en) * | 2024-10-25 | 2024-11-22 | 小语智能信息科技(云南)有限公司 | Vietnam grammar error correction method based on large language model detection and phoneme enhancement |
-
2024
- 2024-02-01 CN CN202410144119.XA patent/CN117973372A/en active Pending
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118278394A (en) * | 2024-05-28 | 2024-07-02 | 华东交通大学 | A Chinese spelling correction method |
CN119005174A (en) * | 2024-10-25 | 2024-11-22 | 小语智能信息科技(云南)有限公司 | Vietnam grammar error correction method based on large language model detection and phoneme enhancement |
CN119005174B (en) * | 2024-10-25 | 2025-01-24 | 小语智能信息科技(云南)有限公司 | A Vietnamese grammar correction method based on large language model detection and phoneme enhancement |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Sproat et al. | RNN approaches to text normalization: A challenge | |
CN112712804A (en) | Speech recognition method, system, medium, computer device, terminal and application | |
CN108052499B (en) | Text error correction method and device based on artificial intelligence and computer readable medium | |
CN117973372A (en) | Chinese grammar error correction method based on pinyin constraint | |
CN113076739A (en) | Method and system for realizing cross-domain Chinese text error correction | |
CN115033659B (en) | Clause-level automatic abstract model system based on deep learning and abstract generation method | |
CN114662476B (en) | Character sequence recognition method integrating dictionary and character features | |
CN112183094A (en) | A Chinese grammar error checking method and system based on multiple text features | |
CN114707492B (en) | Vietnam grammar error correction method and device integrating multi-granularity features | |
CN113657123A (en) | Mongolian Aspect-Level Sentiment Analysis Method Based on Target Template Guidance and Relation Head Coding | |
CN114021549B (en) | Chinese named entity recognition method and device based on vocabulary enhancement and multi-features | |
CN112818698A (en) | Fine-grained user comment sentiment analysis method based on dual-channel model | |
CN115293138A (en) | Text error correction method and computer equipment | |
CN117407051B (en) | Code automatic abstracting method based on structure position sensing | |
Yuan | Grammatical error correction in non-native English | |
CN115563959A (en) | Self-supervised pre-training method, system and medium for Chinese Pinyin spelling error correction | |
CN115658898A (en) | A method, system and device for extracting entity relationship between Chinese and English text | |
Chaudhary et al. | The ariel-cmu systems for lorehlt18 | |
CN115688703B (en) | Text error correction method, storage medium and device in specific field | |
CN111274826A (en) | Semantic information fusion-based low-frequency word translation method | |
CN115757325A (en) | Intelligent conversion method and system for XES logs | |
Lv et al. | StyleBERT: Chinese pretraining by font style information | |
CN115169331B (en) | Chinese spelling error correction method incorporating word information | |
CN118194854B (en) | A Chinese text error correction method based on full-word masking and dependency masking | |
Zhao et al. | Ime-spell: Chinese spelling check based on input method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |