CN117973372A - Chinese grammar error correction method based on pinyin constraint - Google Patents

Chinese grammar error correction method based on pinyin constraint Download PDF

Info

Publication number
CN117973372A
CN117973372A CN202410144119.XA CN202410144119A CN117973372A CN 117973372 A CN117973372 A CN 117973372A CN 202410144119 A CN202410144119 A CN 202410144119A CN 117973372 A CN117973372 A CN 117973372A
Authority
CN
China
Prior art keywords
model
sound
error correction
confusion
bart
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410144119.XA
Other languages
Chinese (zh)
Inventor
李英
朱世昌
余正涛
黄于欣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Kunming University of Science and Technology
Original Assignee
Kunming University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Kunming University of Science and Technology filed Critical Kunming University of Science and Technology
Priority to CN202410144119.XA priority Critical patent/CN117973372A/en
Publication of CN117973372A publication Critical patent/CN117973372A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/253Grammatical analysis; Style critique
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Machine Translation (AREA)

Abstract

本发明涉及一种基于拼音约束的中文语法纠错方法,属自然语言处理技术领域。本发明首先基于原始BART模型构建端到端的语法纠错基础模型,该模型可以充分利用预训练语言模型自身的强表征能力,提升纠错性能;然后,在BART编码结束后增加一个检测层,通过有效错误检测缓解过度校正的问题;接着,利用字符的音似混淆集构建音似混淆矩阵,并将音似混淆矩阵与检测层输出做融合,得到输入句子中包涵错误的字符的音似信息;最后,利用音似信息对解码端的输出概率做约束,从而得到更加准确的纠错结果。

The present invention relates to a Chinese grammar error correction method based on pinyin constraints, and belongs to the technical field of natural language processing. The present invention first constructs an end-to-end grammar error correction basic model based on the original BART model, and the model can make full use of the strong representation ability of the pre-trained language model itself to improve the error correction performance; then, after the BART encoding is completed, a detection layer is added to alleviate the problem of over-correction through effective error detection; then, the sound-like confusion set of the characters is used to construct a sound-like confusion matrix, and the sound-like confusion matrix is fused with the output of the detection layer to obtain the sound-like information of the characters containing errors in the input sentence; finally, the sound-like information is used to constrain the output probability of the decoding end, so as to obtain a more accurate error correction result.

Description

一种基于拼音约束的中文语法纠错方法A Chinese grammar correction method based on phonetic constraints

技术语言Technical Language

本发明涉及一种基于拼音约束的中文语法纠错方法,属于自然语言处理技术领域。The invention relates to a Chinese grammar error correction method based on pinyin constraints, belonging to the technical field of natural language processing.

背景技术Background technique

中文语法纠错是自然语言处理领域的一个关键任务,其目标是识别并纠正中文文本中的语法错误。这些错误包括但不限于词序错误、词性不匹配、以及结构上的语病,它们可能会削弱文本的清晰度和可理解性。鉴于此,对于中文语法校正技术的需求正逐渐增长。Chinese grammar correction is a key task in the field of natural language processing, which aims to identify and correct grammatical errors in Chinese text. These errors include but are not limited to incorrect word order, mismatched parts of speech, and structural errors, which may weaken the clarity and comprehensibility of the text. In view of this, the demand for Chinese grammar correction technology is gradually growing.

为了增强中文语法纠错的准确度与处理速度,开发高性能的纠错模型变得至关重要。这些模型能够自动侦测并改正文中的语法问题。进一步地,随着中文文本在各种领域和情境下的广泛应用,中文语法校正模型也需适应不同的专业领域和变化多端的使用环境,以满足各式各样的用户和应用情景。In order to enhance the accuracy and processing speed of Chinese grammar correction, it is crucial to develop high-performance correction models. These models can automatically detect and correct grammatical problems in texts. Furthermore, with the widespread application of Chinese texts in various fields and contexts, Chinese grammar correction models also need to adapt to different professional fields and changing usage environments to meet a wide range of users and application scenarios.

中文语法错误检测的目标是自动检测出中文自然语句中的语法错误,例如:成分缺失或多余,语序不当等。中文语法的检测任务一般包含:是否存在错误、错误类型、错误发生位置等。合理利用语法检测可以有有效提升纠错性能。The goal of Chinese grammar error detection is to automatically detect grammatical errors in natural Chinese sentences, such as missing or redundant components, improper word order, etc. The detection tasks of Chinese grammar generally include: whether there is an error, the type of error, the location of the error, etc. Reasonable use of grammar detection can effectively improve error correction performance.

总之,中文语法纠错技术对于提升文本质量、增强用户体验以及满足多样化的应用需求具有至关重要的作用。语法检测是这一目标的关键技术手段之一,它有助于确保文本的正确性和专业性,减少误解和沟通障碍。此外,它能够提高非母语者的写作质量,促进语言学习,并且在自然语言处理领域中,它能增强机器翻译和语音识别等技术的准确性。随着技术的不断进步,语法检测将继续推动中文语法纠错领域的发展In short, Chinese grammar correction technology plays a vital role in improving text quality, enhancing user experience, and meeting diverse application needs. Grammar detection is one of the key technical means to achieve this goal. It helps ensure the correctness and professionalism of the text and reduce misunderstandings and communication barriers. In addition, it can improve the writing quality of non-native speakers, promote language learning, and in the field of natural language processing, it can enhance the accuracy of technologies such as machine translation and speech recognition. With the continuous advancement of technology, grammar detection will continue to promote the development of the field of Chinese grammar correction.

发明内容Summary of the invention

本发明提供了一种基于拼音约束的中文语法纠错方法,以解决中文语法纠错精确率较低的问题,本发明在MagicData任务上取得了较好的实验结果。The present invention provides a Chinese grammar correction method based on pinyin constraints to solve the problem of low accuracy of Chinese grammar correction. The present invention has achieved good experimental results on the MagicData task.

本发明的技术方案是:一种基于拼音约束的中文语法纠错方法,所述方法的具体步骤如下:The technical solution of the present invention is: a Chinese grammar error correction method based on pinyin constraints, the specific steps of the method are as follows:

Step1、基于预训练模型BART构造序列到序列的语法纠错基础模型。语法纠错基础模型分别采用多层多头注意力机制作为编码器和解码器有效捕捉上下文信息,同时还可以充分利用BART预训练语言模型的强表征能力增强纠错效果;Step 1: Construct a sequence-to-sequence grammatical error correction basic model based on the pre-trained model BART. The grammatical error correction basic model uses a multi-layer multi-head attention mechanism as the encoder and decoder to effectively capture context information. At the same time, it can also make full use of the strong representation ability of the BART pre-trained language model to enhance the error correction effect;

Step2、在基于BART的语法纠错基础模型的编码端增加检测层,尝试通过检测模块过滤掉正确的句子不进行纠正,从而缓解过度纠正问题;Step 2: Add a detection layer to the encoding end of the BART-based grammatical error correction basic model, and try to filter out correct sentences without correction through the detection module, thereby alleviating the over-correction problem;

Step3、利用字符的音似混淆集构建音似混淆矩阵,并将音似混淆矩阵与检测层输出做融合,得到输入句子中包涵错误的字符的音似信息;Step 3: Use the phonetic confusion set of characters to construct a phonetic confusion matrix, and fuse the phonetic confusion matrix with the output of the detection layer to obtain the phonetic information of the wrong characters in the input sentence;

Step4、利用音似信息对解码端的输出概率做约束,从而得到更加准确的纠错结果。Step 4: Use the sound-like information to constrain the output probability of the decoding end, so as to obtain more accurate error correction results.

作为本发明的进一步方案,所述Step1的具体步骤如下:As a further solution of the present invention, the specific steps of Step 1 are as follows:

Step1.1、获取开源并已经预训练好的中文BART模型作为我们的预训练语言模型;Step 1.1. Obtain the open-source and pre-trained Chinese BART model as our pre-trained language model;

Step1.2、由于BART预训练语言模型使用多层多头注意力机制构建编码器和解码器,我们基于上一步下载的BART模型进行修改以适应语法纠错的任务,从而有效利用预训练语言模型的强表征能力增强语法纠错模型;Step 1.2. Since the BART pre-trained language model uses a multi-layer multi-head attention mechanism to build the encoder and decoder, we modify the BART model downloaded in the previous step to adapt to the task of grammatical error correction, thereby effectively utilizing the strong representation ability of the pre-trained language model to enhance the grammatical error correction model;

Step1.3、获取开放访问的MagicData语音识别数据集,并从网上下载语音识别模型;Step 1.3. Obtain the open access MagicData speech recognition dataset and download the speech recognition model from the Internet;

Step1.4、利用语音识别模型自动生成包含音似错误的句子,将这些包含音似错误的句子与原始数据集中正确的句子进行组合,就可以形成正确-错误句对,从而构建出语法纠错任务的基础数据集MagicData。Step 1.4: Use the speech recognition model to automatically generate sentences containing phonetic errors. These sentences containing phonetic errors are combined with the correct sentences in the original dataset to form correct-incorrect sentence pairs, thereby constructing the basic dataset MagicData for grammatical correction tasks.

作为本发明的进一步方案,所述Step2包括如下:As a further solution of the present invention, the Step 2 includes the following:

处理MagicData文本数据集,构造当前数据集分类标签,在基础模型编码端增加检测层,缓解过度纠正问题。Process the MagicData text dataset, construct the classification label of the current dataset, and add a detection layer to the encoding end of the basic model to alleviate the over-correction problem.

具体步骤如下:Specific steps are as follows:

对训练集和验证集的中的带标签的句子进行预处理,按照替换错误和非替换错误进行二分类,其中替换错误是语法纠错任务中四种错误类型之一,四种错误类型分别为替换错误、词序错误、冗余错误、缺失错误。二分类的标签分别为0和1。只有当出现替换错误时为1,其他情况为0;Preprocess the labeled sentences in the training set and validation set, and classify them into two categories according to substitution errors and non-substitution errors. Substitution errors are one of the four error types in the grammatical error correction task, namely substitution errors, word order errors, redundancy errors, and missing errors. The labels of the two categories are 0 and 1 respectively. Only when a substitution error occurs is the label 1, and otherwise it is 0;

Step2的具体步骤如下:The specific steps of Step 2 are as follows:

Step2.1、在基于BART的语法纠错基础模型的编码器后增加检测层。这里将语法错误检测看成一个简单的二分类任务,如果当前位置存在替换错误则输出1,否则为0;Step 2.1. Add a detection layer after the encoder of the BART-based grammatical error correction basic model. Here, grammatical error detection is regarded as a simple binary classification task. If there is a substitution error at the current position, the output is 1, otherwise it is 0;

Step2.2、对训练集和验证集中带标签的句子进行预处理,如果当前词存在替换错误则标注为1,否则标注为0;Step 2.2: Preprocess the labeled sentences in the training set and the validation set. If the current word has a replacement error, it is marked as 1, otherwise it is marked as 0.

Step2.3、对语法错误检测和语法纠错两个任务的损失进行加权求和作为最终的损失,并通过最小化最终损失更新全部模型参数。Step 2.3: Take the weighted sum of the losses of the two tasks of grammatical error detection and grammatical error correction as the final loss, and update all model parameters by minimizing the final loss.

作为本发明的进一步方案,所述Step3的具体步骤如下:As a further solution of the present invention, the specific steps of Step 3 are as follows:

Step3.1、从一些开源的网站下载音似字符混淆集和形似字符混淆集。对下载的所有混淆集进行去重与合并,得到最终包涵音似和形似的混淆集;Step 3.1. Download the confusion sets of similar-sounding characters and similar-looking characters from some open source websites. De-duplicate and merge all the downloaded confusion sets to obtain the final confusion sets that include similar-sounding and similar-looking characters.

Step3.2、对混淆集进行预处理,使用BART-large模型自带的tokenizer对字典中的字符进行id化,将首个字符id做为Key,与该字符音似的id作为Value,保存为字典格式文件;Step 3.2, pre-process the confusion set, use the tokenizer that comes with the BART-large model to id the characters in the dictionary, use the first character id as the key, and the id that sounds similar to the character as the value, and save it as a dictionary format file;

Step3.3、在模型训练时读取该字典,并当模型对输入句子进行分词id化时,对当前分词id进行该字典的映射,获取当前分词id音似字符信息,并对当前音似信息构造混淆音似矩阵;Step 3.3, read the dictionary during model training, and when the model performs word segmentation id on the input sentence, map the dictionary to the current word segmentation id, obtain the phonetic character information of the current word segmentation id, and construct a confusion phonetic matrix for the current phonetic information;

Step3.4、根据上一步得到的混淆音似矩阵与模型检测得到的结果相乘,获取音似信息的混淆矩阵。由于检测结果中为二分类,只有当替换错误出现时为1。这样保证只有在当前字符出现替换错误时,音似信息才会保留。Step 3.4, multiply the confusion matrix obtained in the previous step by the result of model detection to obtain the confusion matrix of the sound-like information. Since the detection result is a binary classification, it is 1 only when a replacement error occurs. This ensures that the sound-like information will only be retained when a replacement error occurs in the current character.

作为本发明的进一步方案,所述Step4的具体步骤如下:As a further solution of the present invention, the specific steps of Step 4 are as follows:

Step4.1、将Step3.4得到的音似信息混淆矩阵映射为输出概率维度的矩阵,并根据当前混淆矩阵中存在的音似信息,将字典中对应的位置设置为1,其余位置设置为0;Step 4.1, map the sound-like information confusion matrix obtained in Step 3.4 to a matrix of output probability dimension, and according to the sound-like information existing in the current confusion matrix, set the corresponding position in the dictionary to 1, and set the remaining positions to 0;

Step4.2、将在Step4.1获取的与输出概率维度相同的矩阵与最后的输出概率相加,显示增加与当前字符音似的字符概率。Step 4.2. Add the matrix with the same dimension as the output probability obtained in Step 4.1 to the final output probability to increase the probability of characters that sound similar to the current character.

本发明的有益效果是:The beneficial effects of the present invention are:

1、本发明整合了常见音似混淆集和形似混淆集,为以后的研究提供了基础的混淆集;1. The present invention integrates common sound-similar confusion sets and shape-similar confusion sets, providing a basic confusion set for future research;

2、本发明利用检测模型缓解了模型过度纠正与不纠正的问题。2. The present invention utilizes a detection model to alleviate the problem of over-correction and under-correction of the model.

3、本发明利用音似信息,有效提升了模型对替换错误纠正的准确率。3. The present invention utilizes sound-like information to effectively improve the accuracy of the model in correcting substitution errors.

4、本发明使用错误句子和正确句子构造编辑序列,并利用编辑序列构造检测模型的标签。4. The present invention uses incorrect sentences and correct sentences to construct edit sequences, and uses the edit sequences to construct labels for the detection model.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1为本发明中的流程图;Fig. 1 is a flow chart of the present invention;

具体实施方式Detailed ways

实施例1:如图1所示,一种基于拼音约束的中文语法纠错方法,所述方法的具体步骤如下:Embodiment 1: As shown in FIG1 , a Chinese grammar error correction method based on pinyin constraints, the specific steps of the method are as follows:

Step1、基于预训练模型BART构造序列到序列的语法纠错基础模型。语法纠错基础模型分别采用多层多头注意力机制作为编码器和解码器有效捕捉上下文信息,同时还可以充分利用BART预训练语言模型的强表征能力增强纠错效果;Step 1: Construct a sequence-to-sequence grammatical error correction basic model based on the pre-trained model BART. The grammatical error correction basic model uses a multi-layer multi-head attention mechanism as the encoder and decoder to effectively capture context information. At the same time, it can also make full use of the strong representation ability of the BART pre-trained language model to enhance the error correction effect;

下载语音识别模型和训练集进行预处理、中文BART预训练模型。Download the speech recognition model and training set for preprocessing, and the Chinese BART pre-trained model.

所述Step1的具体步骤如下:The specific steps of Step 1 are as follows:

Step1.1、获取开源并已经预训练好的中文BART模型1作为我们的基础预训练语言模型。;Step 1.1. Obtain the open-source and pre-trained Chinese BART model 1 as our basic pre-trained language model. ;

Step1.2、由于BART预训练语言模型使用多层多头注意力机制构建编码器和解码器,我们基于上一步下载的BART模型进行修改以适应语法纠错的任务,从而有效利用预训练语言模型的强表征能力增强语法纠错模型性能。Step 1.2. Since the BART pre-trained language model uses a multi-layer multi-head attention mechanism to build the encoder and decoder, we modify the BART model downloaded in the previous step to adapt to the task of grammatical correction, thereby effectively utilizing the strong representation ability of the pre-trained language model to enhance the performance of the grammatical correction model.

Step1.3、下载语音识别模型和MagicData语音识别的数据集。Step 1.3. Download the speech recognition model and MagicData speech recognition dataset.

Step1.4、利用语音识别模型自动生成包含音似错误的句子,将这些包含音似错误的句子与原始数据集中正确的句子进行组合,就可以形成正确-错误句对,从而构建出语法纠错任务的基础数据集MagicData。Step 1.4: Use the speech recognition model to automatically generate sentences containing phonetic errors. These sentences containing phonetic errors are combined with the correct sentences in the original dataset to form correct-incorrect sentence pairs, thereby constructing the basic dataset MagicData for grammatical correction tasks.

具体的,使用下载的MagicData语音识别数据集和语音识别模型生成文本MagicData语法纠错数据集;对生成的MagicData数据进行预处理,主要包含去重、删除长度过长或过短的句对等;最终形成363658万的较规范的语法纠错数据,数据集数量如表1所示。Specifically, the downloaded MagicData speech recognition dataset and speech recognition model are used to generate the text MagicData grammatical error correction dataset; the generated MagicData data is preprocessed, mainly including deduplication, deletion of sentence pairs that are too long or too short, etc.; finally, 363.658 million relatively standardized grammatical error correction data are formed, and the number of datasets is shown in Table 1.

表1 MagicData数据集的数量Table 1 Number of MagicData datasets

数据集名称Dataset name 数据集数量/句Number of datasets/sentences MagicData-trainMagicData-train 342,758342,758 MagicData-devMagicData-dev 6,8966,896 MagicData-testMagicData-test 14,00414,004

Step2、在基于BART的语法纠错基础模型的编码端增加检测层,尝试通过检测模块过滤掉正确的句子不进行纠正,从而缓解过度纠正问题;Step 2: Add a detection layer to the encoding end of the BART-based grammatical error correction basic model, and try to filter out correct sentences without correction through the detection module, thereby alleviating the over-correction problem;

首先,处理MagicData数据集,构造当前数据集分类标签,最后,在基础模型编码器增加检测层,缓解过度纠正问题。First, the MagicData dataset is processed to construct the classification label of the current dataset. Finally, a detection layer is added to the base model encoder to alleviate the over-correction problem.

Step2的具体步骤如下:The specific steps of Step 2 are as follows:

Step2.1、在基于BART的语法纠错基础模型的编码器后增加检测层。这里将语法错误检测看成一个简单的二分类任务,如果当前位置存在替换错误则输出1,否则为0;Step 2.1. Add a detection layer after the encoder of the BART-based grammatical error correction basic model. Here, grammatical error detection is regarded as a simple binary classification task. If there is a substitution error at the current position, the output is 1, otherwise it is 0;

Step2.2、对训练集和验证集中带标签的句子进行预处理,如果当前词存在替换错误则标注为1,否则标注为0;Step 2.2: Preprocess the labeled sentences in the training set and the validation set. If the current word has a replacement error, it is marked as 1, otherwise it is marked as 0.

Step2.3、对语法错误检测和语法纠错两个任务的损失进行加权求和作为最终的损失,并通过最小化最终损失更新全部模型参数。Step 2.3: Take the weighted sum of the losses of the two tasks of grammatical error detection and grammatical error correction as the final loss, and update all model parameters by minimizing the final loss.

对MagicData数据集进行处理,利用错误句子和正确句子生成带有编辑序列的文本,编辑序列样式如“$STARTSEPL|||SEPR$KEEP播SEPL|||SEPR$KEEP放SEPL|||SEPR$KEEP工SEPL|||SEPR$REPLACE_宫则SEPL|||SEPR$REPLACE_泽里SEPL|||SEPR$REPLACE_理会SEPL|||SEPR$REPLACE_惠的SEPL|||SEPR$KEEP歌SEPL|||SEPR$KEEP”所示。在这个编辑序列中,“KEEP”表示当前位置的字符保持不变,“REPLACE_宫”表示当前字符使用“宫”替换。使用编辑序列,对数据集进行二分类。The MagicData dataset is processed, and the wrong sentences and correct sentences are used to generate text with edit sequences. The edit sequence style is shown as "$STARTSEPL|||SEPR$KEEPplaySEPL|||SEPR$KEEPreleaseSEPL|||SEPR$KEEPworkSEPL|||SEPR$REPLACE_PalaceSEPL|||SEPR$REPLACE_ZeLiSEPL|||SEPR$REPLACE_ReasonSEPL|||SEPR$REPLACE_BenefitsSEPL|||SEPR$KEEPsongSEPL|||SEPR$KEEP". In this edit sequence, "KEEP" means that the character at the current position remains unchanged, and "REPLACE_Palace" means that the current character is replaced with "Palace". Using the edit sequence, the dataset is classified into two categories.

这基础模型包含了编码器和解码器两个组件,它们协同工作以实现语法纠错的目标。编码器采用了多头自注意力机制,通过对源语句中每个词进行上下文建模,生成了丰富的上下文表示。解码器则具有相似的结构,同时引入了掩码多头子注意力模块,更好地捕捉生成的词的信息,确保输出语句在语法和语义上都准确无误。同时使用预训练语言模型,增强模型上下文表征能力。This basic model consists of two components, an encoder and a decoder, which work together to achieve the goal of grammatical error correction. The encoder uses a multi-head self-attention mechanism to generate rich contextual representations by modeling the context of each word in the source sentence. The decoder has a similar structure and introduces a masked multi-head sub-attention module to better capture the information of the generated words and ensure that the output sentence is grammatically and semantically accurate. At the same time, a pre-trained language model is used to enhance the model's context representation capabilities.

训练过程中,目标函数是最小化交叉熵损失函数,公式如下:During the training process, the objective function is to minimize the cross entropy loss function, the formula is as follows:

θ是可训练的模型参数,x是源语句,y={y1,y2,…,yn}是带有n个词的正确语句,y<t={y1,y2,…,yt-1}是在第t个时间步可见的词;θ is the trainable model parameter, x is the source sentence, y = {y 1 ,y 2 ,…,y n } is the correct sentence with n words, and y < t = {y 1 ,y 2 ,…,y t-1 } is the word seen at the tth time step;

对基础模型的修改包括在模型末尾引入一个检测层。这个检测层具有独立的损失函数,与原始任务的损失函数按照一定权重相加,形成总的损失函数。在计算总的损失函数后,通过反向传播计算梯度,并利用梯度下降等优化算法来更新模型参数,以最小化总的损失函数。模型总的损失函数如下所示The modification to the base model includes the introduction of a detection layer at the end of the model. This detection layer has an independent loss function, which is added to the loss function of the original task according to a certain weight to form a total loss function. After calculating the total loss function, the gradient is calculated through back propagation, and the model parameters are updated using optimization algorithms such as gradient descent to minimize the total loss function. The total loss function of the model is shown below

其中,为检测层损失函数,/>为纠错损失函数。in, is the detection layer loss function,/> is the error correction loss function.

预测阶段,利用波束搜索解码通过最大化条件概率P(y*|x;θ)找到一个最优序列y*In the prediction stage, beam search decoding is used to find an optimal sequence y * by maximizing the conditional probability P(y * |x; θ);

Step3、利用字符的音似混淆集构建音似混淆矩阵,并将音似混淆矩阵与检测层输出做融合,得到输入句子中包涵错误的字符的音似信息;Step 3: Use the phonetic confusion set of characters to construct a phonetic confusion matrix, and fuse the phonetic confusion matrix with the output of the detection layer to obtain the phonetic information of the wrong characters in the input sentence;

所述Step3的具体步骤如下:The specific steps of Step 3 are as follows:

Step3.1、从一些开源的网站下载音似字符混淆集。对下载的所有混淆集进行去重与合并,并对混淆集中所有字符进行排列,并将音似的字符放在同一行;Step 3.1. Download similar-sounding character obfuscation sets from some open source websites. De-duplicate and merge all the downloaded obfuscation sets, arrange all the characters in the obfuscation sets, and put similar-sounding characters in the same row;

Step3.2、对混淆集进行预处理,使用BART模型自带的tokenizer对字典中的字符进行id化,将首个字符id做为Key,与该字符音似的其他字符的id作为Value,保存为字典格式文件;Step 3.2, pre-process the confusion set, use the tokenizer that comes with the BART model to id the characters in the dictionary, use the first character id as the key, and the ids of other characters that sound similar to the character as the value, and save it as a dictionary format file;

Step3.3、在模型训练时读取该字典,并当模型对输入句子进行分词id化时,对当前分词id进行该字典的映射,获取当前分词id音似字符信息,并对当前音似信息构造混淆音似矩阵;Step 3.3, read the dictionary during model training, and when the model performs word segmentation id on the input sentence, map the dictionary to the current word segmentation id, obtain the phonetic character information of the current word segmentation id, and construct a confusion phonetic matrix for the current phonetic information;

Step3.4、根据上一步得到的混淆音似矩阵与模型检测得到的结果相乘,获取音似信息的混淆矩阵。由于检测结果中为二分类,只有当替换错误出现时为1。这样保证只有在当前字符出现替换错误时,音似信息才会保留。Step 3.4, multiply the confusion matrix obtained in the previous step by the result of model detection to obtain the confusion matrix of the sound-like information. Since the detection result is a binary classification, it is 1 only when a replacement error occurs. This ensures that the sound-like information will only be retained when a replacement error occurs in the current character.

Step4、利用音似信息对解码端的输出概率做约束,从而得到更加准确的纠错结果。Step 4: Use the sound-like information to constrain the output probability of the decoding end, so as to obtain more accurate error correction results.

最终的结果如表2所示。在对基础模型进行MagicData数据训练的基础上,我们引入了检测模块和拼音约束模型,两者的得分差距显著。与基础模型相比,引入检测模块后,F值提高了1个单位;而引入拼音约束模块后,F值则提高了大约3个单位。这不仅表明当前语法纠错模型中替换错误类型的准确率存在改进空间,同时也验证了我们所采用的拼音约束方法的有效性。The final results are shown in Table 2. Based on the MagicData data training of the basic model, we introduced the detection module and the pinyin constraint model, and the score difference between the two is significant. Compared with the basic model, the F value increased by 1 unit after the introduction of the detection module; and the F value increased by about 3 units after the introduction of the pinyin constraint module. This not only shows that there is room for improvement in the accuracy of replacing error types in the current grammatical error correction model, but also verifies the effectiveness of the pinyin constraint method we adopted.

使用以字符为基准的中文语法评价指标。中文语法纠错使用的评价指标为精确率P、召回率R和F0.5,计算结果如下:The Chinese grammar evaluation index based on characters is used. The evaluation indexes used for Chinese grammar error correction are precision P, recall R and F 0.5 . The calculation results are as follows:

其中:in:

TP(True Positive):模型将错误字纠正的字。TP (True Positive): The word that the model corrected from the wrong word.

FP(False Positive):模型将正确字改为错误的字。FP (False Positive): The model changes the correct word to an incorrect word.

FN(False Negative):模型没有将错误的词改正。FN (False Negative): The model did not correct the wrong word.

使用构造的MagicData文本数据集作为训练集训练检测模型,调整超参数,获取当前性能最佳的检测模型。Use the constructed MagicData text dataset as the training set to train the detection model, adjust the hyperparameters, and obtain the detection model with the best performance.

表2基础模型与检测模型测试结果Table 2 Test results of basic model and detection model

上面结合附图对本发明的具体实施方式作了详细说明,但是本发明并不限于上述实施方式,在本领域普通技术人员所具备的知识范围内,还可以在不脱离本发明宗旨的前提下作出各种变化。The specific implementation modes of the present invention are described in detail above in conjunction with the accompanying drawings, but the present invention is not limited to the above implementation modes, and various changes can be made within the knowledge scope of ordinary technicians in this field without departing from the purpose of the present invention.

Claims (5)

1. A Chinese grammar error correction method based on pinyin constraint is characterized in that: the method comprises the following specific steps:
step1, constructing a sequence-to-sequence grammar error correction basic model based on a pre-training model BART; the grammar error correction basic model adopts a multi-layer multi-head attention mechanism as an encoder and a decoder to effectively capture the context information, and simultaneously, the strong characterization capability of the BART pre-training language model is fully utilized to enhance the error correction effect;
Step2, adding a detection layer at the coding end of the syntax error correction basic model based on the BART, and attempting to filter out correct sentences through a detection module without correction so as to relieve the problem of excessive correction;
Step3, constructing a sound-like confusion matrix by utilizing a sound-like confusion set of the characters, and fusing the sound-like confusion matrix with the output of the detection layer to obtain sound-like information of the characters containing errors in the input sentence;
step4, constraint is carried out on the output probability of the decoding end by utilizing the sound-like information, so that a more accurate error correction result is obtained.
2. The pinyin constraint-based chinese grammar error correction method of claim 1, wherein: the specific steps of Step1 are as follows:
step1.1, acquiring a Chinese BART model which is open in source and is already pre-trained as a basic pre-training language model;
Step1.2, because the BART pre-training language model uses a multi-layer multi-head attention mechanism to construct an encoder and a decoder, the BART model downloaded in the last step is modified to adapt to the task of grammar error correction, so that the strong characterization capability of the pre-training language model is effectively utilized to enhance the performance of the grammar error correction model;
Step1.3, acquiring a MAGICDATA voice recognition data set of open access, and downloading a voice recognition model from the internet;
step1.4, automatically generating sentences containing the sound-like errors by using a voice recognition model, and combining the sentences containing the sound-like errors with correct sentences in the original dataset to form correct-error sentence pairs, thereby constructing a basic dataset MAGICDATA of a grammar error correction task.
3. The pinyin constraint-based chinese grammar error correction method of claim 1, wherein: the specific steps of Step2 are as follows:
Step2.1, adding a detection layer after an encoder of a syntax error correction basic model based on BART; syntax error detection is considered as a simple two-classification task, outputting 1 if there is a replacement error in the current position, otherwise 0;
Step2.2, preprocessing sentences with labels in the training set and the verification set, marking the sentences as 1 if the current word has a replacement error, and marking the sentences as 0 if the current word has the replacement error;
Step2.3, performing weighted summation on the loss of two tasks of grammar error detection and grammar error correction as a final loss, and updating all model parameters by minimizing the final loss.
4. The pinyin constraint-based chinese grammar error correction method of claim 1, wherein: the specific steps of Step3 are as follows:
Step3.1, download phonetic-like character confusion sets from some open-source websites. Performing de-duplication and merging on all downloaded confusion sets, arranging all characters in the confusion sets, and placing the similar characters in the same row;
Step3.2, preprocessing the confusion set, using tokenizer of the BART model to id the characters in the dictionary, taking the id of the first character as a Key, taking the ids of other characters similar to the sound of the character as Value, and storing the ids as dictionary format files;
Step3.3, reading the dictionary during model training, mapping the dictionary for the current word segmentation id when the model carries out word segmentation id on an input sentence, obtaining current word segmentation id sound-like character information, and constructing a confusion sound-like matrix for the current sound-like information;
Step3.4, multiplying the confusion sound-like matrix obtained in the last step by a result obtained by model detection to obtain a confusion matrix of sound-like information; since the detection result is classified into two types, the detection result is 1 only when a replacement error occurs; this ensures that the sound-like information is preserved only if a replacement error occurs in the current character.
5. The pinyin constraint-based chinese grammar error correction method of claim 4, wherein: the specific steps of Step4 are as follows:
Step4.1, mapping the sound-like information confusion matrix obtained in step3.4 into a matrix with output probability dimension, and setting the corresponding position in the dictionary as 1 and the rest positions as 0 according to the sound-like information existing in the current confusion matrix;
Step4.2, adding the matrix with the same dimension as the output probability obtained in step4.1 to the final output probability, and explicitly increasing the character probability similar to the current character.
CN202410144119.XA 2024-02-01 2024-02-01 Chinese grammar error correction method based on pinyin constraint Pending CN117973372A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410144119.XA CN117973372A (en) 2024-02-01 2024-02-01 Chinese grammar error correction method based on pinyin constraint

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410144119.XA CN117973372A (en) 2024-02-01 2024-02-01 Chinese grammar error correction method based on pinyin constraint

Publications (1)

Publication Number Publication Date
CN117973372A true CN117973372A (en) 2024-05-03

Family

ID=90864448

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410144119.XA Pending CN117973372A (en) 2024-02-01 2024-02-01 Chinese grammar error correction method based on pinyin constraint

Country Status (1)

Country Link
CN (1) CN117973372A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118278394A (en) * 2024-05-28 2024-07-02 华东交通大学 A Chinese spelling correction method
CN119005174A (en) * 2024-10-25 2024-11-22 小语智能信息科技(云南)有限公司 Vietnam grammar error correction method based on large language model detection and phoneme enhancement

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118278394A (en) * 2024-05-28 2024-07-02 华东交通大学 A Chinese spelling correction method
CN119005174A (en) * 2024-10-25 2024-11-22 小语智能信息科技(云南)有限公司 Vietnam grammar error correction method based on large language model detection and phoneme enhancement
CN119005174B (en) * 2024-10-25 2025-01-24 小语智能信息科技(云南)有限公司 A Vietnamese grammar correction method based on large language model detection and phoneme enhancement

Similar Documents

Publication Publication Date Title
Sproat et al. RNN approaches to text normalization: A challenge
CN112712804A (en) Speech recognition method, system, medium, computer device, terminal and application
CN108052499B (en) Text error correction method and device based on artificial intelligence and computer readable medium
CN117973372A (en) Chinese grammar error correction method based on pinyin constraint
CN113076739A (en) Method and system for realizing cross-domain Chinese text error correction
CN115033659B (en) Clause-level automatic abstract model system based on deep learning and abstract generation method
CN114662476B (en) Character sequence recognition method integrating dictionary and character features
CN112183094A (en) A Chinese grammar error checking method and system based on multiple text features
CN114707492B (en) Vietnam grammar error correction method and device integrating multi-granularity features
CN113657123A (en) Mongolian Aspect-Level Sentiment Analysis Method Based on Target Template Guidance and Relation Head Coding
CN114021549B (en) Chinese named entity recognition method and device based on vocabulary enhancement and multi-features
CN112818698A (en) Fine-grained user comment sentiment analysis method based on dual-channel model
CN115293138A (en) Text error correction method and computer equipment
CN117407051B (en) Code automatic abstracting method based on structure position sensing
Yuan Grammatical error correction in non-native English
CN115563959A (en) Self-supervised pre-training method, system and medium for Chinese Pinyin spelling error correction
CN115658898A (en) A method, system and device for extracting entity relationship between Chinese and English text
Chaudhary et al. The ariel-cmu systems for lorehlt18
CN115688703B (en) Text error correction method, storage medium and device in specific field
CN111274826A (en) Semantic information fusion-based low-frequency word translation method
CN115757325A (en) Intelligent conversion method and system for XES logs
Lv et al. StyleBERT: Chinese pretraining by font style information
CN115169331B (en) Chinese spelling error correction method incorporating word information
CN118194854B (en) A Chinese text error correction method based on full-word masking and dependency masking
Zhao et al. Ime-spell: Chinese spelling check based on input method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination