WO2021189851A1 - 文本纠错方法、系统、设备及可读存储介质 - Google Patents

文本纠错方法、系统、设备及可读存储介质 Download PDF

Info

Publication number
WO2021189851A1
WO2021189851A1 PCT/CN2020/125011 CN2020125011W WO2021189851A1 WO 2021189851 A1 WO2021189851 A1 WO 2021189851A1 CN 2020125011 W CN2020125011 W CN 2020125011W WO 2021189851 A1 WO2021189851 A1 WO 2021189851A1
Authority
WO
WIPO (PCT)
Prior art keywords
word
target
corrected
text
sequence
Prior art date
Application number
PCT/CN2020/125011
Other languages
English (en)
French (fr)
Inventor
回艳菲
王健宗
程宁
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2021189851A1 publication Critical patent/WO2021189851A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/226Validation

Definitions

  • This application relates to the field of artificial intelligence technology, and in particular to a text error correction method, system, device, and computer-readable storage medium.
  • the inventor realizes that there are two main problems with traditional Chinese text error correction.
  • One is that the parallel corpus of Chinese text error correction is insufficient.
  • the other is that when using confusion sets for error correction, the confusion sets are artificially preset and different.
  • Business application scenarios require manual construction of different confusion sets, so their flexibility is not high enough, making the current Chinese grammar error correction models generally perform poorly.
  • a text error correction method includes the following steps:
  • the candidate word set of the target word is screened according to a preset screening rule, the target replacement word of the target word is determined, and a replacement text sequence is generated according to the target replacement word and the text sequence to be corrected.
  • a text error correction system includes:
  • the target word determination module is configured to obtain the text sequence to be corrected, recognize the text sequence to be corrected through a mask language model based on Bert, and determine the target word that needs to be corrected in the text sequence to be corrected;
  • a candidate word generation module configured to generate a candidate word set of the target word according to the target word and the text sequence to be corrected
  • the replacement module is used to screen the candidate word set of the target word according to preset screening rules, determine the target replacement word of the target word, and generate a replacement text sequence based on the target replacement word and the text sequence to be corrected .
  • a text error correction device includes a processor, a memory, and a text error correction program stored on the memory and executable by the processor, wherein the text error correction program is When the processor executes, the following steps are implemented:
  • the candidate word set of the target word is screened according to a preset screening rule, the target replacement word of the target word is determined, and a replacement text sequence is generated according to the target replacement word and the text sequence to be corrected.
  • the candidate word set of the target word is screened according to a preset screening rule, the target replacement word of the target word is determined, and a replacement text sequence is generated according to the target replacement word and the text sequence to be corrected.
  • This application realizes the dynamic generation of candidate words based on the context of the target word, avoids the problem of inflexible candidate word generation caused by the use of confusion sets in the prior art, and this application does not need to treat all the word generation in the error correction text sequence Candidate words greatly save computing resources.
  • FIG. 1 is a schematic diagram of the hardware structure of a text error correction device involved in a solution of an embodiment of the application;
  • FIG. 2 is a schematic flowchart of a first embodiment of a method for correcting text errors in this application
  • FIG. 3 is a schematic diagram of the functional modules of the first embodiment of the text error correction system of this application.
  • the text error correction method involved in the embodiments of the present application is mainly applied to text error correction devices, which may be devices with display and processing functions such as PCs, portable computers, and mobile terminals.
  • FIG. 1 is a schematic diagram of the hardware structure of the text error correction device involved in the solution of the embodiment of the application.
  • the text error correction device may include a processor 1001 (for example, a CPU), a communication bus 1002, a user interface 1003, a network interface 1004, and a memory 1005.
  • the communication bus 1002 is used to realize the connection and communication between these components;
  • the user interface 1003 may include a display screen (Display), an input unit such as a keyboard (Keyboard);
  • the network interface 1004 may optionally include a standard wired interface, a wireless interface (Such as WI-FI interface);
  • the memory 1005 can be a high-speed RAM memory, or a stable memory (non-volatile memory), such as a disk memory.
  • the memory 1005 can optionally also be a storage device independent of the aforementioned processor 1001 .
  • FIG. 1 does not constitute a limitation on the text error correction device, and may include more or less components than shown in the figure, or a combination of certain components, or a different component arrangement .
  • the memory 1005 as a computer-readable storage medium in FIG. 1 may include an operating system, a network communication module, and a text error correction program.
  • the network communication module is mainly used to connect to the server and perform data communication with the server; and the processor 1001 can call the text error correction program stored in the memory 1005, and perform the following operations:
  • the candidate word set of the target word is screened according to a preset screening rule, the target replacement word of the target word is determined, and a replacement text sequence is generated according to the target replacement word and the text sequence to be corrected.
  • this application provides a text error correction method, that is, a pre-training language model that has been pre-trained with a large number of normal samples is used, and only a small amount of business-related training data is used. Fine-tuning on the basis of the model to obtain a mask language model based on Bert, thereby avoiding the over-fitting problem caused by insufficient parallel corpus for error correction of Chinese text in the prior art; generating candidates based on the target word and the text sequence to be corrected Words realize the dynamic generation of candidate words based on the context of the target word, avoiding the problem of insufficient candidate word generation caused by the use of confusion sets in the prior art; moreover, this application does not need to treat all the words in the error correction text sequence Generating candidate words greatly saves computing resources.
  • FIG. 2 is a schematic flowchart of a first embodiment of a text error correction method of this application.
  • the first embodiment of the present application provides a text error correction method.
  • the text error correction method includes the following steps:
  • Step S10 Obtain a text sequence to be corrected, recognize the text sequence to be corrected by a mask language model based on Bert, and determine the target word that needs to be corrected in the text sequence to be corrected;
  • the text error correction method in this embodiment is implemented by a text error correction device.
  • the text error correction device may be a server, a personal computer, a notebook computer, or other devices.
  • a server is taken as an example for description.
  • Masked Language Model (MLM) used in this embodiment is obtained after performing FINE-TUNE (fine tuning) on the basis of the Chinese pre-training language model provided by Google.
  • the language model is to predict what a word is based on the context, and can learn a wealth of semantic knowledge from an unrestricted large-scale monolingual corpus.
  • the mask language model used in this embodiment may be based on the Bert language model (Bidirectional Encoder Representations from Transformers); the Bert language model includes the Transformer encoder. Due to the self-attention mechanism, the upper and lower layers of the model are directly connected to each other. It can be considered that all layers of the model are bidirectional.
  • the input of the model includes token embedding and segmentation embedding , And position embedding together; when Bert conducts pre-training, it includes two Masked LM and Next Sentence Prediction tasks, and the samples used for pre-training can be unlabeled corpus, such as corpus text crawled from the web And other content.
  • the construction process of the mask language model is obtained after the FINE-TUNE (fine-tuning) transfer learning method based on the Chinese pre-training language model provided by Google, so as to ensure that it can also be used in the case of limited data sets.
  • Good results can be obtained, which is beneficial to reduce the negative impact caused by insufficient training samples;
  • fine-tuning is based on the existing parameters of the pre-training language model, and transfer learning (training) by labeling the training data, so as to Some parameters are fine-tuned to obtain a model that meets actual usage requirements; model construction through task fine-tuning is beneficial to ensure the accuracy of model processing results, reduce model construction costs, and improve the efficiency of model construction.
  • the text sequence to be corrected refers to the text that needs to be corrected. Of course, it can also be a sentence obtained by dividing the text that needs to be corrected according to punctuation and sentence segmentation.
  • the text sequence to be corrected retains its Context in the original text.
  • After obtaining the text sequence to be corrected input it into the mask language model based on Bert.
  • the mask language model recognizes each word in the text sequence to be corrected and determines that there may be errors in the text sequence to be corrected.
  • the target word to be corrected is based on Bert.
  • the above step S10 includes: determining the context confidence of each word in the text sequence to be corrected by the mask language model, and using the word whose context confidence is lower than a preset threshold as The target word.
  • the mask language model can calculate the context confidence of the word at each position in the text sequence to be corrected, and then use the word whose context confidence is lower than the preset threshold as the target word that needs to be corrected.
  • the preset threshold can be based on Set the accuracy requirements of the business scenario. The higher the accuracy requirements, the higher the preset threshold value set corresponds to.
  • the above step S10 includes: determining the context confidence of each word in the text sequence to be corrected by the mask language model, sorting each word according to the context confidence, and sorting the preset with the lowest context confidence Set the number of words as the target word.
  • the mask language model calculates the context confidence of the words at each position in the text sequence to be corrected
  • the words at each position in the text sequence to be corrected can be sorted according to their context confidence, and the preset number
  • the word with the lowest context confidence is used as the target word that needs to be corrected.
  • the preset number can be set according to the accuracy requirements of the business scenario, the computing resource limitation of the text error correction device, and the calculation time requirements of the text error correction, which is not specifically limited in this embodiment.
  • the context confidence of each word reflects the probability that the word appears at its location determined by combining the contextual semantics of the word in the text sequence to be corrected.
  • the higher the context confidence of a word it is The lower the probability of the target word that needs to be corrected, the lower the context confidence of a word, and the higher the probability of the target word that needs to be corrected.
  • Step S20 generating a candidate word set of the target word according to the target word and the text sequence to be corrected
  • a candidate word set of the target word can be generated according to the context of the target word. It is understandable that the target word at each position has a corresponding candidate word set, and the number of candidate words in the candidate word set can be set as required.
  • the target word in the text sequence to be corrected can be marked to obtain the marked text sequence, and the marked text sequence is input into the mask language model to mask the target word.
  • the code language model processes the annotated text sequence and outputs a set of candidate words for each target word.
  • the target word that needs to be corrected after determining the target word that needs to be corrected, it can be searched for whether there is a corrected historical replacement word corresponding to the target word in the historical error correction record, and if it exists, the history is replaced A word is a candidate word of the target word, and one or more candidate words constitute a candidate word set; if it does not exist, a candidate word set of the target word is generated according to the confusion set of the target word.
  • step S30 the candidate word set of the target word is screened according to a preset screening rule, the target replacement word of the target word is determined, and a replacement text sequence is generated according to the target replacement word and the text sequence to be corrected.
  • the preset screening rule may be a similarity-context confidence screening rule.
  • the candidate word set of the target word also includes
  • the specific screening rule is to calculate the similarity between each candidate word and the corresponding target word;
  • the target replacement word of the target word is determined in the candidate word set, the abscissa of the preset filtering curve is the context confidence, and the ordinate is the similarity.
  • the preset screening rule may also be a character sound similarity screening rule, and/or a character shape similarity screening rule.
  • the phonetic similarity screening rule is to calculate the phonetic similarity between each candidate word and the corresponding target word, and use the candidate word with the highest phonetic similarity with the target word as the target replacement word of the target word;
  • the font shape The similarity screening rule is to calculate the glyph similarity between each candidate character and the corresponding target character, and the candidate character with the highest glyph similarity with the target character is used as the target replacement character of the target character; for the character sound similarity screening rule and the glyph
  • the combination of similarity screening rules for screening specifically, pre-calculate the user's use frequency of the Pinyin input method and the use frequency of the stroke input method during historical typing, and set the corresponding phonetic similarity according to the use frequency of the Pinyin input method According to the use frequency of the stroke input method, set the corresponding font weight coefficient V for the font similarity.
  • each candidate word the phonetic similarity * P + the font similarity * V
  • each candidate word of the target word is sorted according to the weight value, and the candidate word with the largest weight value is determined as the target replacement word of the target word . It is understandable that the number of target replacement words of the target word can be greater than one.
  • the Bert-based mask language model is used to identify the text sequence to be corrected, and the target word in the text sequence to be corrected is determined; Generating a candidate word set of the target word according to the target word and the text sequence to be corrected; screening the candidate word set of the target word according to a preset screening rule to determine the target replacement word of the target word, A replacement text sequence is generated according to the target replacement word and the text sequence to be corrected.
  • this application adopts a pre-training language model that has been pre-trained using a large number of normal samples in advance, and only a small amount of business-related training data is used.
  • the pre-training language model is fine-tuned to obtain a Bert-based Mask language model, thereby avoiding the over-fitting problem caused by insufficient parallel corpus for Chinese text error correction in the prior art; by generating candidate words based on the target word and the text sequence to be corrected, the contextual language based on the target word is realized
  • Candidate characters are generated dynamically in the environment, avoiding the problem of insufficient flexibility in candidate character generation caused by the use of confusion sets in the prior art; moreover, this application does not need to treat all the characters in the error correction text sequence to generate candidate characters, which greatly saves calculations. resource.
  • the method further includes:
  • Step A1 Obtain labeling training data, where the labeling training data includes sentences without erroneous words, sentences with erroneous words, and correct sentences corresponding to the sentence with erroneous words;
  • Step A2 Perform FINE-TUNE fine-tuning on the Bert-based pre-training language model based on the labeled training data to obtain the Bert-based mask language model.
  • the Bert-based mask language model is obtained by fine-tuning the parameters of the Bert-based pre-training language model by labeling training data, where the labeling training data is text data related to business scenarios. Business scenarios may have different labeled training data.
  • step A2 includes:
  • the second mask data Based on the first mask data, the second mask data, and their corresponding predicted words, fine-tune the Bert-based pre-training language model to obtain a Bert-based mask language model.
  • the marked training data includes sentences that do not have error words, which can be used as the first training data, and the first training data is masked according to the preset BERT masking method, where the preset BERT masking method refers to What is, is that the first mask data is obtained by masking a preset proportion of words in the first training data, and the first mask data is also associated with the correct word corresponding to it, that is, the predicted word, the prediction of the first mask data The word is itself.
  • the specific mask method is to use the [MASK] mask for 80% of the words in the preset proportion in the first training data, so that the model can predict the mask words in the text through the context, learn cloze, and perform the first training.
  • 10% of the words with a preset ratio in the data use a random word mask to allow the model to learn how to correct the wrong words; 10% of the words with a preset ratio in the first training data are reserved for the original words for the model to learn Check whether the word is wrong.
  • the preset ratio is less than or equal to 20%, for example, 10%, 15%, and 20% can be selected.
  • Annotated training data also includes sentences with erroneous words, which can be used as the second training data to mask the erroneous words in the second training data with the original word, that is, keep the original words to obtain the second mask data, the second mask
  • the data is also associated with the correct word corresponding to it, that is, the predicted word.
  • the second mask data and their corresponding prediction words After obtaining the first mask data, the second mask data and their corresponding prediction words, input these data into the Bert-based pre-training language model, and train the pre-training language model to obtain the Bert-based mask language model .
  • some correct words in the second training data can also be masked by the original words to obtain the third mask data, and the third mask data is also associated with the corresponding predicted words.
  • the proportion of the original word mask for part of the correct words in the second training data may be the same as the proportion of the original word mask for the erroneous words in the second training data.
  • these data are input into the Bert-based pre-training language model to train the pre-training language model, namely A mask language model based on Bert is available.
  • This embodiment uses a pre-training language model that has been pre-trained using a large number of normal samples in advance, only a small amount of business-related training data is used, and fine-tuning is performed on the basis of the pre-training language model to obtain a mask language based on Bert. Model, thereby avoiding the over-fitting problem caused by insufficient parallel corpus for error correction of Chinese text in the prior art.
  • the candidate word set of the target word includes the context confidence of each candidate word of the target word, and the above step S30 includes:
  • Step S31 Calculate the similarity between each candidate word and the corresponding target word
  • Step S32 Determine the target replacement word of the target word from the candidate word set based on the context confidence, similarity and a preset filtering curve of each candidate word, and the abscissa of the preset filtering curve is the context confidence.
  • the ordinate is the similarity.
  • the preset screening rule is a similarity-context confidence screening rule, wherein the similarity between the candidate character and the corresponding target character is based on the similarity of the character shape and the character sound of the candidate character to the corresponding target character Degree.
  • the preset filtering curve is a function constructed based on the context confidence and similarity of the labeled training data.
  • the independent variable of the function is the context confidence (Confidence), and the dependent variable is the similarity (Silmilarity).
  • the candidate word set of the target word is screened according to the similarity-context confidence screening rule, it is not necessarily that the similarity and context confidence of the candidate word set to the target word are the highest candidates.
  • the word is used as the target replacement word.
  • step S31 includes:
  • the average of the phonetic similarity and the font similarity of the candidate character and the corresponding target character is calculated as the similarity between the candidate character and the corresponding target character.
  • the method of calculating the phonetic similarity between the candidate word and the corresponding target word is specifically based on the candidate word and the corresponding target word, respectively identifying them to obtain their pronunciation information in Mandarin Chinese Pinyin.
  • the information is the pinyin containing the tones.
  • the phonetic sequence can be constructed after the pronunciation information of the candidate word and the corresponding target word is determined, the first phonetic sequence is constructed based on the pronunciation information of the candidate word, and the pronunciation information of the target word corresponding to the candidate word is constructed
  • the second phonetic sequence, the phonetic sequence includes pinyin and tones.
  • the sequence of characters in the phonetic sequence can be either pinyin first, phonetic second, or phonetic first, phonetic second, for example, the phonetic sequence of the candidate "Wu” is “wu2" ", “wu” means pinyin, "2" means the tone is the second tone; the character sequence of the target word “hao” is "hao4", where “hao” means pinyin, and "4" means the tone is the fourth tone .
  • the phonetic edit distance between the candidate word and the target word can be calculated according to the phonetic sequence.
  • the edit distance is the deletion, addition, and deletion required to adjust the first phonetic sequence of the candidate word to the second phonetic sequence of the target word. The number of characters modified.
  • the way to calculate the font similarity between the candidate character and the corresponding target character is specifically to identify the candidate character and the corresponding target character to obtain their stroke order in the standard Chinese writing rules.
  • the stroke sequence can be constructed, the first stroke sequence is constructed based on the stroke sequence of the candidate character, and the second stroke sequence is constructed based on the stroke sequence of the target character corresponding to the candidate character.
  • the glyph edit distance between the candidate character and the target character can be calculated according to the stroke sequence.
  • the edit distance is the deletion, addition, and deletion required to adjust the first stroke sequence of the candidate character to the second stroke sequence of the target character. The number of characters modified.
  • the phonetic similarity and the font similarity between the candidate word and the corresponding target word are respectively calculated based on the pronunciation and font shape of the candidate word, and the average of the phonetic similarity and the font similarity is used as the candidate word and the corresponding
  • the information of the target word itself is used to determine the similarity between the candidate word and the target word from the phonetic and font factors, so that the factors involved in the similarity of the candidate word are more comprehensive and flexible.
  • an embodiment of the present application also provides a text error correction system.
  • the text error correction system includes:
  • the target word determination module is configured to obtain the text sequence to be corrected, recognize the text sequence to be corrected through a mask language model based on Bert, and determine the target word that needs to be corrected in the text sequence to be corrected;
  • a candidate word generation module configured to generate a candidate word set of the target word according to the target word and the text sequence to be corrected
  • the replacement module is used to screen the candidate word set of the target word according to preset screening rules, determine the target replacement word of the target word, and generate a replacement text sequence based on the target replacement word and the text sequence to be corrected .
  • each module in the above text error correction system corresponds to each step in the above embodiment of the text error correction method, and its functions and implementation processes will not be repeated here.
  • This application also provides a text error correction device.
  • the text error correction device includes a processor, a memory, and a text error correction program that is stored on the memory and can run on the processor.
  • the text error correction program is executed by the processor, the implementation is as follows step:
  • the candidate word set of the target word is screened according to a preset screening rule, the target replacement word of the target word is determined, and a replacement text sequence is generated according to the target replacement word and the text sequence to be corrected.
  • the embodiments of the present application also provide a computer-readable storage medium.
  • the computer-readable storage medium may be volatile or non-volatile.
  • a text error correction program is stored on the computer-readable storage medium of this application, and when the text error correction program is executed by a processor, the following steps are implemented:
  • the candidate word set of the target word is screened according to a preset screening rule, the target replacement word of the target word is determined, and a replacement text sequence is generated according to the target replacement word and the text sequence to be corrected.
  • the method implemented when the text error correction program is executed can refer to the various embodiments of the text error correction method of this application, which will not be repeated here.
  • the text error correction method provided by the present application further guarantees the privacy and security of all the above-mentioned data
  • all the above-mentioned data can also be stored in a node of a blockchain.
  • target replacement words and candidate word sets, etc. these data can be stored in the blockchain node.
  • the blockchain referred to in this application is a new application mode of computer technology such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm.
  • the technical solution of this application essentially or the part that contributes to the existing technology can be embodied in the form of a software product, and the computer software product is stored in a storage medium (such as ROM/RAM) as described above. , Magnetic disks, optical disks), including several instructions to make a terminal device (which can be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) execute the method described in each embodiment of the present application.
  • a terminal device which can be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

一种文本纠错方法、系统、设备及计算机可读存储介质,涉及人工智能技术领域。该方法通过获取待纠错文本序列,通过基于Bert的掩码语言模型对待纠错文本序列进行识别,确定待纠错文本序列中需要进行纠错的目标字;根据目标字以及待纠错文本序列生成所述目标字的候选字集合;按照预设筛选规则对所述目标字的候选字集合进行筛选,确定所述目标字的目标替换字,根据所述目标替换字和所述待纠错文本序列生成替换文本序列。采用了基于Bert的掩码语言模型,能够避免中文文本纠错的平行语料不足所造成的过拟合问题;通过基于目标字的上下文语境动态生成候选字,避免了现有技术中使用混淆集所造成的候选字生成不够灵活的问题。

Description

文本纠错方法、系统、设备及可读存储介质
本申请要求于2020年9月3日提交中国专利局、申请号为CN202010925578.3,发明名称为“文本纠错方法、系统、设备及可读存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及人工智能技术领域,尤其涉及一种文本纠错方法、系统、设备及计算机可读存储介质。
背景技术
进入21世纪以来,医疗领域的重要文档逐渐从手写文件变成了电子文档的形式,如病例等重要文档都是医生手动输入到计算机中进行存储,那么在这一过程中文档输入信息的正确性至关重要,由于输入过程中的错误敲击或者输入法都会造成一定比例的语法错误,而这种错误在医疗领域是极其严重的问题,需要尽最大的可能去消除这种语法错误。
技术问题
发明人意识到传统的中文文本纠错主要存在两个问题,一是中文文本纠错的平行语料不足,二是使用混淆集进行纠错时,由于混淆集是人为预先设定好的,不同的业务应用场景需要人工构建不同的混淆集,因此其灵活性不够高,使得目前的中文语法纠错模型普遍性能不佳。
技术解决方案
一种文本纠错方法,所述文本纠错方法包括以下步骤:
获取待纠错文本序列,通过基于Bert的掩码语言模型对所述待纠错文本序列进行识别,确定所述待纠错文本序列中需要进行纠错的目标字;
根据所述目标字以及所述待纠错文本序列生成所述目标字的候选字集合;
按照预设筛选规则对所述目标字的候选字集合进行筛选,确定所述目标字的目标替换字,根据所述目标替换字和所述待纠错文本序列生成替换文本序列。
一种文本纠错系统,所述文本纠错系统包括:
目标字确定模块,用于获取待纠错文本序列,通过基于Bert的掩码语言模型对所述待纠错文本序列进行识别,确定所述待纠错文本序列中需要进行纠错的目标字;
候选字生成模块,用于根据所述目标字以及所述待纠错文本序列生成所述目标字的候选字集合;
替换模块,用于按照预设筛选规则对所述目标字的候选字集合进行筛选,确定所述目标字的目标替换字,根据所述目标替换字和所述待纠错文本序列生成替换文本序列。
一种文本纠错设备,所述文本纠错设备包括处理器、存储器、以及存储在所述存储器上并可被所述处理器执行的文本纠错程序,其中所述文本纠错程序被所述处理器执行时,实现如下步骤:
获取待纠错文本序列,通过基于Bert的掩码语言模型对所述待纠错文本序列进行识别,确定所述待纠错文本序列中需要进行纠错的目标字;
根据所述目标字以及所述待纠错文本序列生成所述目标字的候选字集合;
按照预设筛选规则对所述目标字的候选字集合进行筛选,确定所述目标字的目标替换字,根据所述目标替换字和所述待纠错文本序列生成替换文本序列。
一种计算机可读存储介质,所述计算机可读存储介质上存储有文本纠错程序,其中所述文本纠错程序被处理器执行时,实现如下步骤:
获取待纠错文本序列,通过基于Bert的掩码语言模型对所述待纠错文本序列进行识别,确定所述待纠错文本序列中需要进行纠错的目标字;
根据所述目标字以及所述待纠错文本序列生成所述目标字的候选字集合;
按照预设筛选规则对所述目标字的候选字集合进行筛选,确定所述目标字的目标替换字,根据所述目标替换字和所述待纠错文本序列生成替换文本序列。
本申请实现了基于目标字的上下文语境动态生成候选字,避免了现有技术中使用混淆集所造成的候选字生成不够灵活的问题,而且本申请不用对待纠错文本序列中所有的文字生成候选字,极大的节约了计算资源。
附图说明
图1为本申请实施例方案中涉及的文本纠错设备的硬件结构示意图;
图2为本申请文本纠错方法第一实施例的流程示意图;
图3为本申请文本纠错系统第一实施例的功能模块示意图。
本申请目的的实现、功能特点及优点将结合实施例,参照附图做进一步说明。
本发明的实施方式
应当理解,此处所描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。
本申请实施例涉及的文本纠错方法主要应用于文本纠错设备,该文本纠错设备可以是PC、便携计算机、移动终端等具有显示和处理功能的设备。
参照图1,图1为本申请实施例方案中涉及的文本纠错设备的硬件结构示意图。本申请实施例中,文本纠错设备可以包括处理器1001(例如CPU),通信总线1002,用户接口1003,网络接口1004,存储器1005。其中,通信总线1002用于实现这些组件之间的连接通信;用户接口1003可以包括显示屏(Display)、输入单元比如键盘(Keyboard);网络接口1004可选的可以包括标准的有线接口、无线接口(如WI-FI接口);存储器1005可以是高速RAM存储器,也可以是稳定的存储器(non-volatile memory),例如磁盘存储器,存储器1005可选的还可以是独立于前述处理器1001的存储装置。
本领域技术人员可以理解,图1中示出的硬件结构并不构成对文本纠错设备的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。
继续参照图1,图1中作为一种计算机可读存储介质的存储器1005可以包括操作系统、网络通信模块以及文本纠错程序。
在图1中,网络通信模块主要用于连接服务器,与服务器进行数据通信;而处理器1001可以调用存储器1005中存储的文本纠错程序,并执行以下操作:
获取待纠错文本序列,通过基于Bert的掩码语言模型对所述待纠错文本序列进行识别,确定所述待纠错文本序列中需要进行纠错的目标字;
根据所述目标字以及所述待纠错文本序列生成所述目标字的候选字集合;
按照预设筛选规则对所述目标字的候选字集合进行筛选,确定所述目标字的目标替换字,根据所述目标替换字和所述待纠错文本序列生成替换文本序列。
基于上述硬件结构,提出本申请文本纠错方法的各个实施例。
进入21世纪以来,医疗领域的重要文档逐渐从手写文件变成了电子文档的形式,如病例等重要文档都是医生手动输入到计算机中进行存储,那么在这一过程中文档输入信息的正确性至关重要,由于输入过程中的错误敲击或者输入法都会造成一定比例的语法错误,而这种错误在医疗领域是极其严重的问题,需要尽最大的可能去消除这种语法错误。
传统的中文文本纠错主要存在两个问题,一是中文文本纠错的平行语料不足,二是使用混淆集进行纠错时,由于混淆集是人为预先设定好的,不同的业务应用场景需要人工构建不同的混淆集,因此其灵活性不够高,使得目前的中文语法纠错模型普遍性能不佳。
为解决上述问题,本申请提供一种文本纠错方法,即采用了已经预先利用大量正常样本完成了预训练的预训练语言模型,只需要采用少量的与业务相关的训练数据,在预训练语言模型的基础上进行微调得到基于Bert的掩码语言模型,从而避免了现有技术中中文文本纠错的平行语料不足所造成的过拟合问题;通过基于目标字以及待纠错文本序列生成候选字,实现了基于目标字的上下文语境动态生成候选字,避免了现有技术中使用混淆集所造成的候选字生成不够灵活的问题;而且,本申请不用对待纠错文本序列中所有的文字生成候选字,极大的节约了计算资源。
参照图2,图2为本申请文本纠错方法第一实施例的流程示意图。
本申请第一实施例提供一种文本纠错方法,所述文本纠错方法包括以下步骤:
步骤S10,获取待纠错文本序列,通过基于Bert的掩码语言模型对所述待纠错文本序列进行识别,确定所述待纠错文本序列中需要进行纠错的目标字;
本实施例中的文本纠错方法是由文本纠错设备实现的,该文本纠错设备可以是服务器、个人计算机、笔记本电脑等设备,本实施例中以服务器为例进行说明。本实施例在进行文本纠错前,首先需要获取(构建)一个用以对待纠错文本进行识别的语言模型;而训练该语言模型需要使用大量的训练数据、计算时间和计算资源,而且还容易出现参数不够优化、准确率低、容易过拟合等风险。故而本实施例所使用的掩码语言模型(Masked Language Model,MLM)是在google提供的中文预训练语言模型的基础上进行FINE-TUNE(微调)后得到的。其中,语言模型的就是根据上下文去预测一个词是什么,能够从无限制的大规模单语语料中,学习到丰富的语义知识。本实施例中所采用的掩码语言模型,可以是基于Bert语言模型(Bidirectional Encoder Representations from Transformers)实现的;Bert语言模型包括Transformer编码器,由于self-attention机制,所以模型上下层直接全部互相连接的,可认为模型的所有层中是双向的,模型的输入包括token embedding、segmentation embedding、和position embedding共同构成;而Bert在进行预训练时,包括两个Masked LM和Next Sentence Prediction任务,而其预训练所用的样本,则可以是使用无标记语料,如从网络爬取的语料文本等内容。
进一步地,对于掩码语言模型的构造过程,是在google提供的中文预训练语言模型的基础上进行FINE-TUNE(微调)的迁移学习方式后得到的,从而保证在有限数据集的情况下也能获得较好的效果,有利于减小训练样本不足所带来的负面影响;微调是在预训练语言模型已有的参数基础上,通过标注训练数据对其进行迁移学习(训练),从而对部分参数进行微调,得到符合实际使用需求的模型;通过任务微调的方式进行模型构造,有利于在保证模型处理结果准确性的同时、降低模型构造成本,同时还可提高模型构造的效率。
在本实施例中,待纠错文本序列指的是需要进行纠错的文本,当然也可是对需要进行纠错的文本按照标点、断句进行划分后得到的语句,待纠错文本序列保留有其在原文本中的上下文联系。在得到待纠错文本序列后,将其输入至基于Bert的掩码语言模型,掩码语言模型对待纠错文本序列中的每个字进行识别,确定待纠错文本序列中可能存在错误,因而需要进行纠错的目标字。
进一步地,在一实施例中,上述步骤S10包括:通过所述掩码语言模型确定所述待纠错文本序列中每个字的上下文置信度,将上下文置信度低于预设阈值的字作为所述目标字。掩码语言模型能够对待纠错文本序列中每个位置的字计算其上下文置信度,进而将上下文置信度低于预设阈值的字作为需要进行纠错的目标字,其中,预设阈值可根据业务场景的准确度要求的高低进行设置,准确度要求越高,设置的预设阈值也对应越高。
或者,上述步骤S10包括:通过所述掩码语言模型确定所述待纠错文本序列中每个字的上下文置信度,将每个字按照上下文置信度高低进行排序,将上下文置信度最低的预设数量的字作为所述目标字。在掩码语言模型对待纠错文本序列中每个位置的字计算其上下文置信度后,可将待纠错文本序列中每个位置的字按照其上下文置信度的高低进行排序,将预设数量的上下文置信度最低的字作为需要进行纠错的目标字。其中,预设数量可根据业务场景的准确度要求、文本纠错设备的计算资源限制、文本纠错的计算时间要求进行设置,本实施例不做具体限制。
其中,每个字的上下文置信度反映的是在结合该字在待纠错文本序列中的上下文语义所确定的该字出现在其所在位置的概率,一个字的上下文置信度越高,其为需要进行纠错的目标字的概率越低,一个字的上下文置信度越低,其需要进行纠错的目标字的概率越高。
步骤S20,根据所述目标字以及所述待纠错文本序列生成所述目标字的候选字集合;
在本实施例中,在确定需要进行纠错的目标字后,可根据目标字的上下文生成目标字的候选字集合。可以理解的是,每个位置的目标字都有与之对应的候选字集合,且候选字集合中的候选字的数量可以根据需要进行设置。
进一步地,在一实施例中,在确定需要进行纠错的目标字后,可对待纠错文本序列中的目标字进行标注,得到标注文本序列,将标注文本序列输入至掩码语言模型,掩码语言模型对标注文本序列进行处理,输出各个目标字的候选字集合。
进一步地,在一实施例中,在确定需要进行纠错的目标字后,可查找历史纠错记录中是否存在与该目标字对应的纠错后的历史替换字,若存在,则将历史替换字作为目标字的候选字,一个或多个候选字构成候选字集合;若不存在,则根据所述目标字的混淆集生成目标字的候选字集合。
步骤S30,按照预设筛选规则对所述目标字的候选字集合进行筛选,确定所述目标字的目标替换字,根据所述目标替换字和所述待纠错文本序列生成替换文本序列。
在本实施例中,所述预设筛选规则可以为相似度-上下文置信度筛选规则,在预设筛选规则为相似度-上下文置信度筛选规则时,所述目标字的候选字集合中还包括目标字的各个候选字的上下文置信度,具体的筛选规则为,计算各个候选字与对应的所述目标字的相似度;基于各个候选字的上下文置信度、相似度以及预设过滤曲线从所述候选字集合中确定所述目标字的目标替换字,所述预设过滤曲线的横坐标为上下文置信度,纵坐标为相似度。
进一步地,在一实施例中,所述预设筛选规则也可以为字音相似度筛选规则,和/或,字形相似度筛选规则。具体地,所述字音相似度筛选规则为计算各个候选字与对应的目标字的字音相似度,将与目标字的字音相似度最高的候选字作为所述目标字的目标替换字;所述字形相似度筛选规则为计算各个候选字与对应的目标字的字形相似度,将与目标字的字形相似度最高的候选字作为所述目标字的目标替换字;对于将字音相似度筛选规则和字形相似度筛选规则结合起来进行筛选的情况,具体为,预先统计用户在历史打字时使用拼音输入法的使用频率以及使用笔画输入法的使用频率,根据拼音输入法的使用频率为字音相似度设置对应的字音权重系数P,根据笔画输入法的使用频率为字形相似度设置对应的字形权重系数V,使用频率越高,对应的权重系数越大,然后在对候选字集合进行筛选时,计算目标字的每个候选字的权重值=字音相似度*P+字形相似度*V,对目标字的每个候选字按照权重值大小进行排序,将权重值最大的候选字确定为目标字的目标替换字。可以理解的是,目标字的目标替换字的数量可以大于1。
在本实施例中,通过获取待纠错文本序列,通过基于Bert的掩码语言模型对所述待纠错文本序列进行识别,确定所述待纠错文本序列中需要进行纠错的目标字;根据所述目标字以及所述待纠错文本序列生成所述目标字的候选字集合;按照预设筛选规则对所述目标字的候选字集合进行筛选,确定所述目标字的目标替换字,根据所述目标替换字和所述待纠错文本序列生成替换文本序列。
通过上述方式,本申请采用了已经预先利用大量正常样本完成了预训练的预训练语言模型,只需要采用少量的与业务相关的训练数据,在预训练语言模型的基础上进行微调得到基于Bert的掩码语言模型,从而避免了现有技术中中文文本纠错的平行语料不足所造成的过拟合问题;通过基于目标字以及待纠错文本序列生成候选字,实现了基于目标字的上下文语境动态生成候选字,避免了现有技术中使用混淆集所造成的候选字生成不够灵活的问题;而且,本申请不用对待纠错文本序列中所有的文字生成候选字,极大的节约了计算资源。
进一步地,图中未示的,基于上述图2所示的第一实施例,提出本申请文本纠错方法的第二实施例,本实施例中,步骤S10之前,还包括:
步骤A1,获取标注训练数据,所述标注训练数据包括不存在错误字的语句、存在错误字的语句及所述存在错误字的语句对应的正确语句;
步骤A2,基于所述标注训练数据对基于Bert的预训练语言模型进行FINE-TUNE微调,得到基于Bert的掩码语言模型。
在本实施例中,基于Bert的掩码语言模型是对通过标注训练数据对基于Bert的预训练语言模型的参数进行微调得到的,其中,标注训练数据是与业务场景相关的文本数据,不同的业务场景可能具有不同的标注训练数据。
进一步地,上述步骤A2包括:
对所述标注训练数据中所述不存在错误字的语句按照预设BERT掩码方式进行掩码得到第一掩码数据,并将掩码后的字的预测字设置为掩码前的字本身;
对所述标注训练数据中所述存在错误字的语句中的错误字进行原字掩码得到第二掩码数据,并将掩码后的字的预测字设置为对应的正确字;
基于所述第一掩码数据、第二掩码数据以及各自对应的预测字,对所述基于Bert的预训练语言模型进行微调,得到基于Bert的掩码语言模型。
在本实施例中,标注训练数据中包括不存在错误字的语句,可作为第一训练数据,对第一训练数据按照预设BERT掩码方式进行掩码,其中,预设BERT掩码方式指的是,对第一训练数据中预设比例的字进行掩码得到第一掩码数据,第一掩码数据还关联有与之对应的正确字,即预测字,第一掩码数据的预测字为它本身。具体掩码方式为,对第一训练数据中预设比例的字中的80%使用[MASK]掩码,以让模型通过上下文预测文本中的掩码字,学习完形填空,对第一训练数据中预设比例的字中的10%使用随机词掩码,以让模型学习如何纠正错误的字;对第一训练数据中预设比例的字中的10%保留原字,以让模型学习检测字是否是错误的。其中,预设比例小于或等于20%,例如可选为10%,15%,20%。
标注训练数据中还包括存在错误字的语句,可作为第二训练数据,对第二训练数据中的错误字进行原字掩码,即保留原字,得到第二掩码数据,第二掩码数据也关联有与之对应的正确字,即预测字。
得到第一掩码数据、第二掩码数据以及各自对应的预测字之后,将这些数据输入基于Bert的预训练语言模型,对预训练语言模型进行训练,即可得到基于Bert的掩码语言模型。
进一步地,为了进一步防止过拟合,可以对第二训练数据中的部分正确字也进行原字掩码,得到第三掩码数据,第三掩码数据也关联有与之对应的预测字,即它本身,其中,对第二训练数据中的部分正确字进行原字掩码的比例可与对第二训练数据中的错误字进行原字掩码的比例相同。对应的,得到第一掩码数据、第二掩码数据、第三掩码数据以及各自对应的预测字之后,将这些数据输入基于Bert的预训练语言模型,对预训练语言模型进行训练,即可得到基于Bert的掩码语言模型。
本实施例采用了已经预先利用大量正常样本完成了预训练的预训练语言模型,只需要采用少量的与业务相关的训练数据,在预训练语言模型的基础上进行微调得到基于Bert的掩码语言模型,从而避免了现有技术中中文文本纠错的平行语料不足所造成的过拟合问题。
进一步地,基于上述图2所示的第一实施例以及第二实施例,提出本申请文本纠错方法的第三实施例。
所述目标字的候选字集合中包括目标字的各个候选字的上下文置信度,上述步骤S30包括:
步骤S31,计算各个候选字与对应的所述目标字的相似度;
步骤S32,基于各个候选字的上下文置信度、相似度以及预设过滤曲线从所述候选字集合中确定所述目标字的目标替换字,所述预设过滤曲线的横坐标为上下文置信度,纵坐标为相似度。
在本实施例中,所述预设筛选规则为相似度-上下文置信度筛选规则,其中,候选字与对应的目标字的相似度是根据候选字与对应的目标字的字形相似度和字音相似度得到的。
在本实施例中,预设过滤曲线是根据标注训练数据的上下文置信度和相似度构建的一个函数,函数的自变量为上下文置信度(Confidence),因变量为相似度(Silmilarity),在预设过滤曲线构建完成后,将目标字的各个候选字标识在预设过滤曲线所在的坐标系中,每个候选字以其Silmilarity为横坐标,Confidence为纵坐标,即每个候选字在预设过滤曲线所在的坐标系中显示为坐标点,预设过滤曲线是人工找到一条曲线,其能够保证在曲线上方的候选字都是准确度较高的字,因此,可以在进行候选字筛选时,将所有处于曲线上方的候选字作为对应的目标字的目标替换字。
需要说明的是,按照相似度-上下文置信度筛选规则对所述目标字的候选字集合进行筛选时,不一定是选取候选字集合中与目标字的相似度、上下文置信度都是最高的候选字作为目标替换字。
进一步的,上述步骤S31包括:
基于所述候选字的读音信息构建第一字音序列,基于与所述候选字对应的目标字的读音信息构建第二字音序列;
计算所述第一字音序列与所述第二字音序列的字音编辑距离,基于所述字音编辑距离确定所述候选字与对应的所述目标字的字音相似度;
基于所述候选字的笔画顺序构建第一笔画序列,基于所述目标字的笔画顺序构建第二笔画序列;
计算所述第一笔画序列与所述第二笔画序列的笔画编辑距离,基于所述笔画编辑距离确定所述候选字与对应的所述目标字的字形相似度;
计算所述候选字与对应的所述目标字的字音相似度和字形相似度的平均值,作为所述候选字与对应的所述目标字的相似度。
在本实施例中,计算候选字与对应的目标字的字音相似度的方式具体为,基于候选字和对应的目标字,分别对其进行识别以获取其在中文普通话拼音中的读音信息,读音信息为包含音调的拼音,在确定候选字和对应的目标字的读音信息后即可构建字音序列,基于候选字的读音信息构建第一字音序列,基于与候选字对应的目标字的读音信息构建第二字音序列,字音序列包括拼音和音调,字音序列中的字符顺序可以是拼音在前音调在后,也可以是音调在前拼音在后,例如,候选字“吴”的字音序列为“wu2”,其中,“wu”表示拼音,“2”表示音调为第二声;目标字“昊”的字音序列为“hao4”,其中,“hao”表示拼音,“4”表示音调为第四声。
在确定字音序列之后,可以根据字音序列计算候选字与目标字之间的字音编辑距离,编辑距离即为将候选字的第一字音序列调整为目标字的第二字音序列所需要删除、增加、修改的字符数。
在确定候选字与目标字之间的字音编辑距离后,可根据下述公式计算候选字与目标字之间的字音相似度:字音相似度=(L MAX-字音编辑距离)/L MAX,其中,LMAX指的是候选字的第一字音序列长度和目标字的第二字音序列长度中的较大者。
在本实施例中,计算候选字与对应的目标字的字形相似度的方式具体为,基于候选字和对应的目标字,分别对其进行识别以获取其在标准中文书写规则中的笔画顺序,在确定候选字和对应的目标字的笔画顺序后即可构建笔画序列,基于候选字的笔画顺序构建第一笔画序列,基于与候选字对应的目标字的笔画顺序构建第二笔画序列。
在确定笔画序列之后,可以根据笔画序列计算候选字与目标字之间的字形编辑距离,编辑距离即为将候选字的第一笔画序列调整为目标字的第二笔画序列所需要删除、增加、修改的字符数。
在确定候选字与目标字之间的字形编辑距离后,可根据下述公式计算候选字与目标字之间的字形相似度:字形相似度=(L MAX-字形编辑距离)/L MAX,其中,L MAX指的是候选字的第一笔画序列长度和目标字的第二笔画序列长度中的较大者。
本实施例中,分别基于候选字的读音和字形,分别计算候选字与对应的目标字之间的字音相似度和字形相似度,将字音相似度和字形相似度的平均值作为候选字与对应的目标字的相似度,从而利用目标字本身的信息,从字音、字形两方面的因素确定候选字与目标字的相似度,使得候选字的相似度所涉及的因素更加全面和灵活。
此外,如图3所示,本申请实施例还提供一种文本纠错系统。
本实施例中,所述文本纠错系统包括:
目标字确定模块,用于获取待纠错文本序列,通过基于Bert的掩码语言模型对所述待纠错文本序列进行识别,确定所述待纠错文本序列中需要进行纠错的目标字;
候选字生成模块,用于根据所述目标字以及所述待纠错文本序列生成所述目标字的候选字集合;
替换模块,用于按照预设筛选规则对所述目标字的候选字集合进行筛选,确定所述目标字的目标替换字,根据所述目标替换字和所述待纠错文本序列生成替换文本序列。
其中,上述文本纠错系统中各个模块与上述文本纠错方法实施例中各步骤相对应,其功能和实现过程在此处不再一一赘述。
本申请还提供一种文本纠错设备。
所述文本纠错设备包括处理器、存储器及存储在所述存储器上并可在所述处理器上运行的文本纠错程序,其中所述文本纠错程序被所述处理器执行时,实现如下步骤:
获取待纠错文本序列,通过基于Bert的掩码语言模型对所述待纠错文本序列进行识别,确定所述待纠错文本序列中需要进行纠错的目标字;
根据所述目标字以及所述待纠错文本序列生成所述目标字的候选字集合;
按照预设筛选规则对所述目标字的候选字集合进行筛选,确定所述目标字的目标替换字,根据所述目标替换字和所述待纠错文本序列生成替换文本序列。
其中,所述文本纠错程序被执行时所实现的方法可参照本申请文本纠错方法的各个实施例,此处不再赘述。
此外,本申请实施例还提供一种计算机可读存储介质,该计算机可读存储介质可以是易失性的,也可以是非易失性的。
本申请计算机可读存储介质上存储有文本纠错程序,其中所述文本纠错程序被处理器执行时,实现如下步骤:
获取待纠错文本序列,通过基于Bert的掩码语言模型对所述待纠错文本序列进行识别,确定所述待纠错文本序列中需要进行纠错的目标字;
根据所述目标字以及所述待纠错文本序列生成所述目标字的候选字集合;
按照预设筛选规则对所述目标字的候选字集合进行筛选,确定所述目标字的目标替换字,根据所述目标替换字和所述待纠错文本序列生成替换文本序列。
其中,文本纠错程序被执行时所实现的方法可参照本申请文本纠错方法的各个实施例,此处不再赘述。
在另一实施例中,本申请所提供的文本纠错方法,为进一步保证上述所有出现的数据的私密和安全性,上述所有数据还可以存储于一区块链的节点中。例如目标替换字及候选字集合等,这些数据均可存储在区块链节点中。
需要说明的是,本申请所指区块链是分布式数据存储、点对点传输、共识机制、加密算法等计算机技术的新型应用模式。
需要说明的是,在本文中,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者系统不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者系统所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括该要素的过程、方法、物品或者系统中还存在另外的相同要素。
上述本申请实施例序号仅仅为了描述,不代表实施例的优劣。
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在如上所述的一个存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台终端设备(可以是手机,计算机,服务器,空调器,或者网络设备等)执行本申请各个实施例所述的方法。
以上仅为本申请的优选实施例,并非因此限制本申请的专利范围,凡是利用本申请说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其他相关的技术领域,均同理包括在本申请的专利保护范围内。

Claims (20)

  1. 一种文本纠错方法,其中,所述文本纠错方法包括以下步骤:
    获取待纠错文本序列,通过基于Bert的掩码语言模型对所述待纠错文本序列进行识别,确定所述待纠错文本序列中需要进行纠错的目标字;
    根据所述目标字以及所述待纠错文本序列生成所述目标字的候选字集合;
    按照预设筛选规则对所述目标字的候选字集合进行筛选,确定所述目标字的目标替换字,根据所述目标替换字和所述待纠错文本序列生成替换文本序列。
  2. 如权利要求1所述的文本纠错方法,其中,所述通过基于Bert的掩码语言模型对所述待纠错文本序列进行识别,确定所述待纠错文本序列中需要进行纠错的目标字的步骤包括:
    通过所述掩码语言模型确定所述待纠错文本序列中每个字的上下文置信度,将上下文置信度低于预设阈值的字作为所述目标字,或者,将每个字按照上下文置信度高低进行排序,将上下文置信度最低的预设数量的字作为所述目标字。
  3. 如权利要求2所述的文本纠错方法,其中,所述根据所述目标字以及所述待纠错文本序列生成所述目标字的候选字集合的步骤包括:
    对所述待纠错文本序列中的所述目标字进行标注,得到标注文本序列;
    将所述标注文本序列输入所述掩码语言模型,得到所述掩码语言模型输出的所述目标字的候选字集合。
  4. 如权利要求3所述的文本纠错方法,其中,所述目标字的候选字集合中包括目标字的各个候选字的上下文置信度;
    所述按照预设筛选规则对所述目标字的候选字集合进行筛选,确定所述目标字的目标替换字的步骤包括:
    计算各个候选字与对应的所述目标字的相似度;
    基于各个候选字的上下文置信度、相似度以及预设过滤曲线从所述候选字集合中确定所述目标字的目标替换字,所述预设过滤曲线的横坐标为上下文置信度,纵坐标为相似度。
  5. 如权利要求4所述的文本纠错方法,其中,所述计算每个候选字与对应的所述目标字的相似度的步骤包括:
    基于所述候选字的读音信息构建第一字音序列,基于与所述候选字对应的目标字的读音信息构建第二字音序列;
    计算所述第一字音序列与所述第二字音序列的字音编辑距离,基于所述字音编辑距离确定所述候选字与对应的所述目标字的字音相似度;
    基于所述候选字的笔画顺序构建第一笔画序列,基于所述目标字的笔画顺序构建第二笔画序列;
    计算所述第一笔画序列与所述第二笔画序列的笔画编辑距离,基于所述笔画编辑距离确定所述候选字与对应的所述目标字的字形相似度;
    计算所述候选字与对应的所述目标字的字音相似度和字形相似度的平均值,作为所述候选字与对应的所述目标字的相似度。
  6. 如权利要求1所述的文本纠错方法,其中,所通过基于Bert的掩码语言模型对所述待纠错文本序列进行识别,确定所述待纠错文本序列中需要进行纠错的目标字的步骤之前,还包括:
    获取标注训练数据,所述标注训练数据包括不存在错误字的语句、存在错误字的语句及所述存在错误字的语句对应的正确语句;
    基于所述标注训练数据对基于Bert的预训练语言模型进行FINE-TUNE微调,得到基于Bert的掩码语言模型。
  7. 如权利要求6所述的文本纠错方法,其中,所述基于所述标注训练数据对基于Bert的预训练语言模型进行FINE-TUNE微调,得到基于Bert的掩码语言模型的步骤包括:
    对所述标注训练数据中所述不存在错误字的语句按照预设BERT掩码方式进行掩码得到第一掩码数据,并将掩码后的字的预测字设置为掩码前的字本身;
    对所述标注训练数据中所述存在错误字的语句中的错误字进行原字掩码得到第二掩码数据,并将掩码后的字的预测字设置为对应的正确字;
    基于所述第一掩码数据、第二掩码数据以及各自对应的预测字,对所述基于Bert的预训练语言模型进行微调,得到基于Bert的掩码语言模型。
  8. 一种文本纠错系统,其中,所述文本纠错系统包括:
    目标字确定模块,用于获取待纠错文本序列,通过基于Bert的掩码语言模型对所述待纠错文本序列进行识别,确定所述待纠错文本序列中需要进行纠错的目标字;
    候选字生成模块,用于根据所述目标字以及所述待纠错文本序列生成所述目标字的候选字集合;
    替换模块,用于按照预设筛选规则对所述目标字的候选字集合进行筛选,确定所述目标字的目标替换字,根据所述目标替换字和所述待纠错文本序列生成替换文本序列。
  9. 一种文本纠错设备,其中,所述文本纠错设备包括处理器、存储器、以及存储在所述存储器上并可被所述处理器执行的文本纠错程序,其中所述文本纠错程序被所述处理器执行时,实现如下步骤:
    获取待纠错文本序列,通过基于Bert的掩码语言模型对所述待纠错文本序列进行识别,确定所述待纠错文本序列中需要进行纠错的目标字;
    根据所述目标字以及所述待纠错文本序列生成所述目标字的候选字集合;
    按照预设筛选规则对所述目标字的候选字集合进行筛选,确定所述目标字的目标替换字,根据所述目标替换字和所述待纠错文本序列生成替换文本序列。
  10. 如权利要求9所述的文本纠错设备,其中,所述通过基于Bert的掩码语言模型对所述待纠错文本序列进行识别,确定所述待纠错文本序列中需要进行纠错的目标字的步骤包括:
    通过所述掩码语言模型确定所述待纠错文本序列中每个字的上下文置信度,将上下文置信度低于预设阈值的字作为所述目标字,或者,将每个字按照上下文置信度高低进行排序,将上下文置信度最低的预设数量的字作为所述目标字。
  11. 如权利要求10所述的文本纠错设备,其中,所述根据所述目标字以及所述待纠错文本序列生成所述目标字的候选字集合的步骤包括:
    对所述待纠错文本序列中的所述目标字进行标注,得到标注文本序列;
    将所述标注文本序列输入所述掩码语言模型,得到所述掩码语言模型输出的所述目标字的候选字集合。
  12. 如权利要求11所述的文本纠错设备,其中,所述目标字的候选字集合中包括目标字的各个候选字的上下文置信度;
    所述按照预设筛选规则对所述目标字的候选字集合进行筛选,确定所述目标字的目标替换字的步骤包括:
    计算各个候选字与对应的所述目标字的相似度;
    基于各个候选字的上下文置信度、相似度以及预设过滤曲线从所述候选字集合中确定所述目标字的目标替换字,所述预设过滤曲线的横坐标为上下文置信度,纵坐标为相似度。
  13. 如权利要求12所述的文本纠错设备,其中,所述计算每个候选字与对应的所述目标字的相似度的步骤包括:
    基于所述候选字的读音信息构建第一字音序列,基于与所述候选字对应的目标字的读音信息构建第二字音序列;
    计算所述第一字音序列与所述第二字音序列的字音编辑距离,基于所述字音编辑距离确定所述候选字与对应的所述目标字的字音相似度;
    基于所述候选字的笔画顺序构建第一笔画序列,基于所述目标字的笔画顺序构建第二笔画序列;
    计算所述第一笔画序列与所述第二笔画序列的笔画编辑距离,基于所述笔画编辑距离确定所述候选字与对应的所述目标字的字形相似度;
    计算所述候选字与对应的所述目标字的字音相似度和字形相似度的平均值,作为所述候选字与对应的所述目标字的相似度。
  14. 如权利要求9所述的文本纠错设备,其中,所通过基于Bert的掩码语言模型对所述待纠错文本序列进行识别,确定所述待纠错文本序列中需要进行纠错的目标字的步骤之前,所述文本纠错程序被所述处理器执行时,还实现如下步骤:
    获取标注训练数据,所述标注训练数据包括不存在错误字的语句、存在错误字的语句及所述存在错误字的语句对应的正确语句;
    基于所述标注训练数据对基于Bert的预训练语言模型进行FINE-TUNE微调,得到基于Bert的掩码语言模型。
  15. 如权利要求14所述的文本纠错设备,其中,所述基于所述标注训练数据对基于Bert的预训练语言模型进行FINE-TUNE微调,得到基于Bert的掩码语言模型的步骤包括:
    对所述标注训练数据中所述不存在错误字的语句按照预设BERT掩码方式进行掩码得到第一掩码数据,并将掩码后的字的预测字设置为掩码前的字本身;
    对所述标注训练数据中所述存在错误字的语句中的错误字进行原字掩码得到第二掩码数据,并将掩码后的字的预测字设置为对应的正确字;
    基于所述第一掩码数据、第二掩码数据以及各自对应的预测字,对所述基于Bert的预训练语言模型进行微调,得到基于Bert的掩码语言模型。
  16. 一种计算机可读存储介质,其中,所述计算机可读存储介质上存储有文本纠错程序,其中所述文本纠错程序被处理器执行时,实现如下步骤:
    获取待纠错文本序列,通过基于Bert的掩码语言模型对所述待纠错文本序列进行识别,确定所述待纠错文本序列中需要进行纠错的目标字;
    根据所述目标字以及所述待纠错文本序列生成所述目标字的候选字集合;
    按照预设筛选规则对所述目标字的候选字集合进行筛选,确定所述目标字的目标替换字,根据所述目标替换字和所述待纠错文本序列生成替换文本序列。
  17. 如权利要求16所述的计算机可读存储介质,其中,所述通过基于Bert的掩码语言模型对所述待纠错文本序列进行识别,确定所述待纠错文本序列中需要进行纠错的目标字的步骤包括:
    通过所述掩码语言模型确定所述待纠错文本序列中每个字的上下文置信度,将上下文置信度低于预设阈值的字作为所述目标字,或者,将每个字按照上下文置信度高低进行排序,将上下文置信度最低的预设数量的字作为所述目标字。
  18. 如权利要求17所述的计算机可读存储介质,其中,所述根据所述目标字以及所述待纠错文本序列生成所述目标字的候选字集合的步骤包括:
    对所述待纠错文本序列中的所述目标字进行标注,得到标注文本序列;
    将所述标注文本序列输入所述掩码语言模型,得到所述掩码语言模型输出的所述目标字的候选字集合。
  19. 如权利要求18所述的计算机可读存储介质,其中,所述目标字的候选字集合中包括目标字的各个候选字的上下文置信度;
    所述按照预设筛选规则对所述目标字的候选字集合进行筛选,确定所述目标字的目标替换字的步骤包括:
    计算各个候选字与对应的所述目标字的相似度;
    基于各个候选字的上下文置信度、相似度以及预设过滤曲线从所述候选字集合中确定所述目标字的目标替换字,所述预设过滤曲线的横坐标为上下文置信度,纵坐标为相似度。
  20. 如权利要求19所述的计算机可读存储介质,其中,所述计算每个候选字与对应的所述目标字的相似度的步骤包括:
    基于所述候选字的读音信息构建第一字音序列,基于与所述候选字对应的目标字的读音信息构建第二字音序列;
    计算所述第一字音序列与所述第二字音序列的字音编辑距离,基于所述字音编辑距离确定所述候选字与对应的所述目标字的字音相似度;
    基于所述候选字的笔画顺序构建第一笔画序列,基于所述目标字的笔画顺序构建第二笔画序列;
    计算所述第一笔画序列与所述第二笔画序列的笔画编辑距离,基于所述笔画编辑距离确定所述候选字与对应的所述目标字的字形相似度;
    计算所述候选字与对应的所述目标字的字音相似度和字形相似度的平均值,作为所述候选字与对应的所述目标字的相似度。
PCT/CN2020/125011 2020-09-03 2020-10-30 文本纠错方法、系统、设备及可读存储介质 WO2021189851A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010925578.3 2020-09-03
CN202010925578.3A CN112016310A (zh) 2020-09-03 2020-09-03 文本纠错方法、系统、设备及可读存储介质

Publications (1)

Publication Number Publication Date
WO2021189851A1 true WO2021189851A1 (zh) 2021-09-30

Family

ID=73515401

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/125011 WO2021189851A1 (zh) 2020-09-03 2020-10-30 文本纠错方法、系统、设备及可读存储介质

Country Status (2)

Country Link
CN (1) CN112016310A (zh)
WO (1) WO2021189851A1 (zh)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113449514A (zh) * 2021-06-21 2021-09-28 浙江康旭科技有限公司 一种适用于特定垂直领域的文本纠错方法及其纠错装置
CN114118065A (zh) * 2021-10-28 2022-03-01 国网江苏省电力有限公司电力科学研究院 一种电力领域中文文本纠错方法、装置、存储介质及计算设备
CN114328831A (zh) * 2021-12-24 2022-04-12 江苏银承网络科技股份有限公司 票据信息识别与纠错方法及装置
CN114970502A (zh) * 2021-12-29 2022-08-30 中科大数据研究院 一种应用于数字政府的文本纠错方法
CN115129877A (zh) * 2022-07-12 2022-09-30 北京有竹居网络技术有限公司 标点符号预测模型的生成方法、装置和电子设备
CN115270771A (zh) * 2022-10-08 2022-11-01 中国科学技术大学 细粒度自适应字音预测任务辅助的中文拼写纠错方法
CN115809662A (zh) * 2023-02-03 2023-03-17 北京匠数科技有限公司 一种文本内容异常检测的方法、装置、设备及介质
CN116127953A (zh) * 2023-04-18 2023-05-16 之江实验室 一种基于对比学习的中文拼写纠错方法、装置和介质
WO2023093525A1 (zh) * 2021-11-23 2023-06-01 中兴通讯股份有限公司 模型训练方法、中文文本纠错方法、电子设备和存储介质

Families Citing this family (27)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112380333B (zh) * 2020-12-04 2024-03-29 北京中科凡语科技有限公司 用于问答系统的基于拼音概率的文本纠错方法
CN112232059B (zh) * 2020-12-14 2021-03-26 北京声智科技有限公司 文本纠错方法、装置、计算机设备及存储介质
CN112530522B (zh) * 2020-12-15 2022-10-28 中国科学院深圳先进技术研究院 序列纠错方法、装置、设备和存储介质
CN112528980B (zh) * 2020-12-16 2022-02-15 北京华宇信息技术有限公司 Ocr识别结果纠正方法及其终端、系统
CN112632912A (zh) * 2020-12-18 2021-04-09 平安科技(深圳)有限公司 文本纠错方法、装置、设备及可读存储介质
CN114742037A (zh) * 2020-12-23 2022-07-12 广州视源电子科技股份有限公司 文本纠错方法、装置、计算机设备和存储介质
CN112861518B (zh) * 2020-12-29 2023-12-01 科大讯飞股份有限公司 文本纠错方法、装置和存储介质及电子装置
CN112632955B (zh) * 2020-12-29 2023-02-17 五八有限公司 文本集生成方法、装置、电子设备和介质
CN112784582A (zh) * 2021-02-09 2021-05-11 中国工商银行股份有限公司 纠错方法、装置和计算设备
CN113011149B (zh) * 2021-03-04 2024-05-14 中国科学院自动化研究所 一种文本纠错方法及系统
CN112926306B (zh) * 2021-03-08 2024-01-23 北京百度网讯科技有限公司 文本纠错方法、装置、设备以及存储介质
CN112863627B (zh) * 2021-03-12 2023-11-03 云知声智能科技股份有限公司 医疗质控信息检测方法、系统以及存储介质
CN113205813B (zh) * 2021-04-01 2022-03-11 北京华宇信息技术有限公司 语音识别文本的纠错方法
CN113065339B (zh) * 2021-04-12 2023-06-30 平安国际智慧城市科技股份有限公司 中文文本的自动纠错方法、装置、设备以及存储介质
CN113177405A (zh) * 2021-05-28 2021-07-27 中国平安人寿保险股份有限公司 基于bert的数据纠错方法、装置、设备及存储介质
CN113221558B (zh) * 2021-05-28 2023-09-19 中邮信息科技(北京)有限公司 一种快递地址纠错方法、装置、存储介质及电子设备
CN113343671B (zh) * 2021-06-07 2023-03-31 佳都科技集团股份有限公司 一种语音识别后的语句纠错方法、装置、设备及存储介质
CN113536786A (zh) * 2021-06-22 2021-10-22 深圳价值在线信息科技股份有限公司 混淆汉字的生成方法、终端设备及计算机可读存储介质
CN117113978A (zh) * 2021-06-24 2023-11-24 湖北大学 使用遮挡语言模型进行查错的文本纠错系统
CN113343678A (zh) * 2021-06-25 2021-09-03 北京市商汤科技开发有限公司 一种文本纠错的方法、装置、电子设备及存储介质
CN113449510B (zh) * 2021-06-28 2022-12-27 平安科技(深圳)有限公司 文本识别方法、装置、设备及存储介质
CN113657098B (zh) * 2021-08-24 2024-03-01 平安科技(深圳)有限公司 文本纠错方法、装置、设备及存储介质
CN113536789B (zh) * 2021-09-16 2021-12-24 平安科技(深圳)有限公司 算法比赛关联性预测方法、装置、设备及介质
CN114881006A (zh) * 2022-03-30 2022-08-09 医渡云(北京)技术有限公司 医疗文本纠错方法及装置、存储介质、电子设备
CN115879458A (zh) * 2022-04-08 2023-03-31 北京中关村科金技术有限公司 一种语料扩充方法、装置及存储介质
CN115169330B (zh) * 2022-07-13 2023-05-02 平安科技(深圳)有限公司 中文文本纠错及验证方法、装置、设备及存储介质
CN118093789A (zh) * 2024-04-22 2024-05-28 阿里健康科技(杭州)有限公司 医学文本纠错系统、医学查询提示文本展示方法及设备

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110196894A (zh) * 2019-05-30 2019-09-03 北京百度网讯科技有限公司 语言模型的训练方法和预测方法
CN110807319A (zh) * 2019-10-31 2020-02-18 北京奇艺世纪科技有限公司 一种文本内容检测方法、检测装置、电子设备及存储介质
CN110852087A (zh) * 2019-09-23 2020-02-28 腾讯科技(深圳)有限公司 中文纠错方法和装置、存储介质及电子装置
US20200192983A1 (en) * 2018-12-17 2020-06-18 Beijing Baidu Netcom Science And Technology Co., Ltd. Method and device for correcting error in text
CN111310443A (zh) * 2020-02-12 2020-06-19 新华智云科技有限公司 一种文本纠错方法和系统

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200192983A1 (en) * 2018-12-17 2020-06-18 Beijing Baidu Netcom Science And Technology Co., Ltd. Method and device for correcting error in text
CN110196894A (zh) * 2019-05-30 2019-09-03 北京百度网讯科技有限公司 语言模型的训练方法和预测方法
CN110852087A (zh) * 2019-09-23 2020-02-28 腾讯科技(深圳)有限公司 中文纠错方法和装置、存储介质及电子装置
CN110807319A (zh) * 2019-10-31 2020-02-18 北京奇艺世纪科技有限公司 一种文本内容检测方法、检测装置、电子设备及存储介质
CN111310443A (zh) * 2020-02-12 2020-06-19 新华智云科技有限公司 一种文本纠错方法和系统

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113449514A (zh) * 2021-06-21 2021-09-28 浙江康旭科技有限公司 一种适用于特定垂直领域的文本纠错方法及其纠错装置
CN114118065A (zh) * 2021-10-28 2022-03-01 国网江苏省电力有限公司电力科学研究院 一种电力领域中文文本纠错方法、装置、存储介质及计算设备
WO2023093525A1 (zh) * 2021-11-23 2023-06-01 中兴通讯股份有限公司 模型训练方法、中文文本纠错方法、电子设备和存储介质
CN114328831A (zh) * 2021-12-24 2022-04-12 江苏银承网络科技股份有限公司 票据信息识别与纠错方法及装置
CN114970502A (zh) * 2021-12-29 2022-08-30 中科大数据研究院 一种应用于数字政府的文本纠错方法
CN114970502B (zh) * 2021-12-29 2023-03-28 中科大数据研究院 一种应用于数字政府的文本纠错方法
CN115129877A (zh) * 2022-07-12 2022-09-30 北京有竹居网络技术有限公司 标点符号预测模型的生成方法、装置和电子设备
CN115270771A (zh) * 2022-10-08 2022-11-01 中国科学技术大学 细粒度自适应字音预测任务辅助的中文拼写纠错方法
CN115270771B (zh) * 2022-10-08 2023-01-17 中国科学技术大学 细粒度自适应字音预测任务辅助的中文拼写纠错方法
CN115809662A (zh) * 2023-02-03 2023-03-17 北京匠数科技有限公司 一种文本内容异常检测的方法、装置、设备及介质
CN116127953A (zh) * 2023-04-18 2023-05-16 之江实验室 一种基于对比学习的中文拼写纠错方法、装置和介质

Also Published As

Publication number Publication date
CN112016310A (zh) 2020-12-01

Similar Documents

Publication Publication Date Title
WO2021189851A1 (zh) 文本纠错方法、系统、设备及可读存储介质
US20210224485A1 (en) Templated rule-based data augmentation for intent extraction
CN107220235B (zh) 基于人工智能的语音识别纠错方法、装置及存储介质
WO2020186778A1 (zh) 错词纠正方法、装置、计算机装置及存储介质
JP5462001B2 (ja) 文脈上の入力方法
US7493251B2 (en) Using source-channel models for word segmentation
WO2021121198A1 (zh) 基于语义相似度的实体关系抽取方法、装置、设备及介质
CN112395385B (zh) 基于人工智能的文本生成方法、装置、计算机设备及介质
WO2021073390A1 (zh) 数据筛选方法、装置、设备及计算机可读存储介质
US20210248498A1 (en) Method and apparatus for training pre-trained knowledge model, and electronic device
KR102456535B1 (ko) 의료 사실 검증 방법, 장치, 전자 기기, 저장 매체 및 프로그램
WO2023030105A1 (zh) 训练自然语言处理模型和自然语言处理的方法、电子设备
CN112101010A (zh) 一种基于bert的电信行业oa办公自动化文稿审核的方法
CN109299471A (zh) 一种文本匹配的方法、装置及终端
CN111656453A (zh) 用于信息提取的层次实体识别和语义建模框架
CN111508502A (zh) 使用多标记结构的转录纠正
US11170765B2 (en) Contextual multi-channel speech to text
JP2022059021A (ja) モデル訓練方法および装置、テキスト予測方法および装置、電子デバイス、コンピュータ可読記憶媒体、およびコンピュータプログラム
CN113673228A (zh) 文本纠错方法、装置、计算机存储介质及计算机程序产品
CN114742037A (zh) 文本纠错方法、装置、计算机设备和存储介质
CN113553411B (zh) 查询语句的生成方法、装置、电子设备和存储介质
US20170229116A1 (en) Method of and system for processing a user-generated input command
CN115169370B (zh) 语料数据增强方法、装置、计算机设备及介质
WO2023103914A1 (zh) 文本情感分析方法、装置及计算机可读存储介质
CN113553833B (zh) 文本纠错的方法、装置及电子设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20927770

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20927770

Country of ref document: EP

Kind code of ref document: A1