WO2021218329A1 - Parallel corpus generation method, apparatus and device, and storage medium - Google Patents

Parallel corpus generation method, apparatus and device, and storage medium Download PDF

Info

Publication number
WO2021218329A1
WO2021218329A1 PCT/CN2021/078059 CN2021078059W WO2021218329A1 WO 2021218329 A1 WO2021218329 A1 WO 2021218329A1 CN 2021078059 W CN2021078059 W CN 2021078059W WO 2021218329 A1 WO2021218329 A1 WO 2021218329A1
Authority
WO
WIPO (PCT)
Prior art keywords
words
word
replaced
corpus
replacement
Prior art date
Application number
PCT/CN2021/078059
Other languages
French (fr)
Chinese (zh)
Inventor
邱煜
Original Assignee
深圳壹账通智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳壹账通智能科技有限公司 filed Critical 深圳壹账通智能科技有限公司
Publication of WO2021218329A1 publication Critical patent/WO2021218329A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/232Orthographic correction, e.g. spell checking or vowelisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries

Definitions

  • This application relates to the field of natural language processing, and in particular to a method, device, equipment and storage medium for generating parallel corpus.
  • the embodiments of this application provide a parallel corpus generation method, device, equipment, and storage medium.
  • the professional vocabulary can be flexibly constructed according to the professional dictionary corresponding to the application scenario.
  • the wrong word candidates and/or wrong word candidates are replaced according to word frequency or word frequency to obtain corresponding parallel sentences, flexibly construct correct-error parallel corpora for different application scenarios, and reduce the scene migration cost of the wrong word correction system based on the parallel training corpus. .
  • the first embodiment of the parallel corpus generation method in the embodiment of the present application includes:
  • A1 to A26 represent the coding fields corresponding to the sequential 26 initials in the initials table
  • the phonetic shape code includes eleven types of encoding fields. If the same type of encoding field is different between the pre-replaced phonetic shape code and the commonly used phonetic shape code, the edit distance is increased by 1, otherwise the original value is maintained. If all types of coding fields between the pre-replaced phonogram code and the commonly used phonogram code are consistent, it means that the two commonly used characters have the highest similarity, and the edit distance between the two is 0. The encoding fields of all types between the shape codes are inconsistent, which means that the two commonly used characters have the lowest similarity, so the edit distance between the two is 11, so the edit distance between the pre-replaced characters and the commonly used characters is between 0-11 between.
  • the editing distance between the pre-replaced word and the commonly used character code is calculated to compare the similarity of the two characters in the shape and sound.
  • the similarity is higher.
  • the common word can be used as a typo candidate for the pre-replaced word.
  • the editing distance between the pre-replaced word and each common word in turn can be the pre-replaced word All typo candidates are used to improve the similarity between the pre-replaced words and the corresponding typo candidates, thereby increasing the practicability of subsequent correct-error parallel corpus generation.
  • the wrong word candidates corresponding to "Gong Xi Fa Cai” include “Gong Xi”, “Combination”, and “Gong Xi”
  • the wrong word candidates corresponding to "Fa Cai” include " ⁇ ”
  • the wrong word candidates for "Gong Xi Fa Cai” include "Gong Xi Fa Cai”, “Song Xi Fa Cai”, “Gong Xi Fa Cai”, and "Gong Xi Fa Cai”.
  • the typo generation module 503 is used to construct the phonetic shape code of the pre-replaced word in the preset common word dictionary; based on the phonetic shape code of the pre-replaced word, filter out the corresponding common words from the preset common word dictionary Typo candidate
  • the corpus generation module 507 is configured to replace the corresponding pre-replaced words and pre-replaced words in the correct corpus based on the occurrence frequency to obtain the corresponding parallel corpus.
  • the correct corpus to be processed is obtained and word segmentation is performed to determine error-prone pre-replacement characters and pre-replacement words; construct pre-replacement characters and phonological codes of common characters; based on the pronunciation of pre-replacement characters and common characters Shape code, filter out the typo candidates corresponding to the pre-replaced words from the commonly used words; obtain the correct corpus corresponding to the application scenario and select multiple scenarios parallel to it; crawl the dictionary corresponding to multiple scenarios, and filter the similar-sounding words corresponding to the pre-replaced words As a typo candidate; count the occurrence frequency of the typo and the typo of the typo candidate and the typo candidate in multiple scenarios in the expression sentence; according to this, replace the pre-replaced words and pre-replaced words in the correct corpus to obtain the corresponding parallel corpus,
  • the correct-error parallel corpus for different application scenarios can be constructed flexibly, and the scene migration cost of the wrong word correction system based on the parallel training corpus can be reduced.
  • the calculation module 606 is configured to separately count the occurrence frequencies of the typos candidate and the typos candidate in the expression sentence;
  • the word segmentation module 602 includes: a segmentation unit 6021, a probability calculation unit 6022, and a screening unit 6023, wherein:
  • a shape code of the pre-replaced word in a preset common word dictionary includes a Chinese character structure code, a five-digit four-corner code, and the number of strokes;
  • the corresponding frequently-used word is a typo candidate of the pre-replaced word.
  • the wrong word generation module 604 is also used for:
  • the blockchain referred to in this application is a new application mode of computer technology such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm.
  • Blockchain essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information for verification. The validity of the information (anti-counterfeiting) and the generation of the next block.
  • the blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.

Abstract

A parallel corpus generation method, apparatus and device, and a storage medium. The method comprises: obtaining a correct corpus to be processed (101) and performing word segmentation processing, and determining error-prone pre-replacement characters and pre-replacement words (102); constructing phonetic and morphological codes of the pre-replacement characters (103), and on the basis of the phonetic and morphological codes of the pre-replacement characters and common characters, screening out wrong character candidates corresponding to the pre-replacement characters from the common characters (104); obtaining an application scene corresponding to the correct corpus and selecting a plurality of scenes parallel to the application scene (105); crawling dictionaries corresponding to the plurality of scenes, and screening similar phonetic words corresponding to the pre-replacement words to serve as wrong word candidates (106); counting occurrence frequencies of wrong characters and wrong words of the wrong character candidates and the wrong word candidates in expression statements in the plurality of scenes (108); and on this basis, replacing the pre-replacement characters and the pre-replacement words in the correct corpus to obtain a corresponding parallel corpus (109), so as to flexibly construct correct-wrong parallel corpora of different application scenes. The solution also relates to a blockchain technology, and the correct corpus can be stored in a blockchain node.

Description

平行语料生成方法、装置、设备及存储介质Parallel corpus generation method, device, equipment and storage medium
本申请要求于2020年4月28日提交中国专利局、申请号为202010351250.5、发明名称为“平行语料生成方法、装置、设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在申请中。This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on April 28, 2020, the application number is 202010351250.5, and the invention title is "Parallel corpus generation method, device, equipment and storage medium", the entire content of which is incorporated by reference In application.
技术领域Technical field
本申请涉及自然语言处理领域,尤其涉及一种平行语料生成方法、装置、设备及存储介质。This application relates to the field of natural language processing, and in particular to a method, device, equipment and storage medium for generating parallel corpus.
背景技术Background technique
近年来,随着信息处理技术和互联网的高速发展,传统的文本信息记录、处理、整理、归纳等工作已逐渐被计算机所取代,电子书、电子办公文档、电子报纸、网络社交平台等逐成为我们日常生活中必不可少的一部分。然而,传统文本信息是通过手写所得,而电子文本信息需要借助不同的输入规则与输入工具来间接转化所得,增加了电子文本信息输入的错误率,而且一般电子文本数据量庞大,给文本校正工作带来了更大的挑战。In recent years, with the rapid development of information processing technology and the Internet, traditional text information recording, processing, sorting, summarizing and other tasks have gradually been replaced by computers, and e-books, e-office documents, e-newspapers, and online social platforms have gradually become An essential part of our daily lives. However, traditional text information is obtained by handwriting, and electronic text information needs to be converted indirectly with the help of different input rules and input tools, which increases the error rate of electronic text information input, and generally the amount of electronic text data is huge, which makes text correction work. Brings greater challenges.
而今电子文本信息中常常存在错误字词,主要通过神经网路训练错词纠错模型,而在训练错词纠正模型时,需要大量正确-错误平行语料。发明人意识到,而现今主要通过人工纠正方法获取平行语料,费时费力,需花费很高的校正成本,而且部分易错字词较难被纠正,而影响模型表现;而电子文本信息校正与其应用场景紧密相关,在一个应用场景获取得到平行训练语料,在其他应用场景不一定能表现出较佳的效果,故不同应用场景需要配套不同的正确-错误平行语料,导致基于该平行训练语料的错词纠正系统的场景迁移成本大大增加。Nowadays, there are often wrong words in electronic text information. The error correction model is mainly trained through neural network, and when training the wrong word correction model, a large amount of correct-error parallel corpus is needed. The inventor realizes that nowadays, the manual correction method is mainly used to obtain parallel corpus, which is time-consuming and laborious, and requires a high correction cost. Moreover, some erroneous words are difficult to be corrected, which affects the performance of the model; and electronic text information correction and its application The scenarios are closely related. The parallel training corpus obtained in one application scenario may not show better results in other application scenarios. Therefore, different application scenarios need to be equipped with different correct-error parallel corpora, resulting in errors based on the parallel training corpus. The scene migration cost of the word correction system is greatly increased.
发明内容Summary of the invention
本申请的主要目的在于解决现有技术生成平行训练语料不灵活的问题。The main purpose of this application is to solve the problem of inflexibility in generating parallel training corpus in the prior art.
为实现上述目的,本申请第一方面提供了一种平行语料生成方法,包括:获取待处理的正确语料;对所述正确语料进行分词处理,并确定易错的预替换字、预替换词;构造所述预替换字在预置常用字字典中的音形码;基于所述预替换字的音形码,从预置常用字字典的常用字中筛选出对应的错字候选;获取所述正确语料对应的应用场景,并基于所述应用场景选择与其平行的多个场景;爬取所述多个场景对应的词典,筛选所述预替换词对应的相似音词语作为错词候选;获取所述错字候选和所述错词候选在所述多个场景中的表达语句;分别统计所述错字候选和所述错词候选在所述表达语句中的出现频率;基于所述出现频率替换正确语料中对应的预替换字和预替换词,得到对应的平行语料。To achieve the above objective, the first aspect of the present application provides a parallel corpus generation method, including: obtaining the correct corpus to be processed; performing word segmentation processing on the correct corpus, and determining error-prone pre-replacement words and pre-replacement words; Construct the phonetic shape code of the pre-replaced word in the preset common word dictionary; based on the phonetic shape code of the pre-replaced word, filter out the corresponding typo candidates from the common words in the preset common word dictionary; obtain the correctness Select the application scenarios corresponding to the corpus, and select multiple parallel scenarios based on the application scenarios; crawl the dictionaries corresponding to the multiple scenarios, and screen the similar-sounding words corresponding to the pre-replaced words as the wrong word candidates; obtain the Wrong word candidates and expression sentences of the wrong word candidates in the multiple scenes; respectively count the occurrence frequencies of the wrong word candidates and the wrong word candidates in the expression sentence; replace the correct corpus based on the occurrence frequency Corresponding pre-replacement words and pre-replacement words to obtain corresponding parallel corpus.
本申请第二方面提供了一种平行语料生成设备,包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机可读指令,所述处理器执行所述计算机可读指令时实现如下步骤:获取待处理的正确语料;对所述正确语料进行分词处理,并确定易错的预替换字、预替换词;构造所述预替换字在预置常用字字典中的音形码;基于所述预替换字的音形码,从预置常用字字典的常用字中筛选出对应的错字候选;获取所述正确语料对应的应用场景,并基于所述应用场景选择与其平行的多个场景;爬取所述多个场景对应的词典,筛选所述预替换词对应的相似音词语作为错词候选;获取所述错字候选和所述错词候选在所述多个场景中的表达语句;分别统计所述错字候选和所述错词候选在所述表达语句中的出现频率;基于所述出现频率替换正确语料中对应的预替换字和预替换词,得到对应的平行语料。The second aspect of the present application provides a parallel corpus generation device, including a memory, a processor, and computer-readable instructions stored in the memory and running on the processor, and the processor executes the computer When reading instructions, the following steps are implemented: obtain the correct corpus to be processed; perform word segmentation on the correct corpus, and determine the error-prone pre-replacement words and pre-replacement words; construct the pre-replacement words in the preset common word dictionary Phonetic shape code; based on the phonetic shape code of the pre-replaced word, filter out the corresponding typo candidates from the commonly used words in the preset common word dictionary; obtain the application scenario corresponding to the correct corpus, and select the corresponding typo based on the application scenario Parallel multiple scenes; crawl the dictionaries corresponding to the multiple scenes, and screen the similar-sounding words corresponding to the pre-replaced words as the wrong word candidates; obtain the wrong word candidates and the wrong word candidates in the multiple scenes Respectively count the occurrence frequency of the typo candidate and the wrong word candidate in the expression sentence; replace the corresponding pre-replacement word and pre-replacement word in the correct corpus based on the occurrence frequency to obtain the corresponding parallel Corpus.
本申请第三方面提供了一种计算机可读存储介质,所述计算机可读存储介质中存储有计算机指令,当所述计算机指令在计算机上运行时,使得计算机执行如下步骤:获取待处理的正确语料;对所述正确语料进行分词处理,并确定易错的预替换字、预替换词;构造 所述预替换字在预置常用字字典中的音形码;基于所述预替换字的音形码,从预置常用字字典的常用字中筛选出对应的错字候选;获取所述正确语料对应的应用场景,并基于所述应用场景选择与其平行的多个场景;爬取所述多个场景对应的词典,筛选所述预替换词对应的相似音词语作为错词候选;获取所述错字候选和所述错词候选在所述多个场景中的表达语句;分别统计所述错字候选和所述错词候选在所述表达语句中的出现频率;基于所述出现频率替换正确语料中对应的预替换字和预替换词,得到对应的平行语料。The third aspect of the present application provides a computer-readable storage medium, the computer-readable storage medium stores computer instructions, and when the computer instructions run on the computer, the computer executes the following steps: obtain the correctness to be processed Corpus; perform word segmentation processing on the correct corpus, and determine the error-prone pre-replacement words and pre-replacement words; construct the phonetic shape code of the pre-replacement word in the preset common word dictionary; based on the phonetics of the pre-replacement word The shape code is used to screen out the corresponding typo candidates from the commonly used words in the preset common word dictionary; obtain the application scenario corresponding to the correct corpus, and select multiple scenarios parallel to the correct corpus based on the application scenario; crawl the multiple The dictionary corresponding to the scene, and screen the similar-sounding words corresponding to the pre-replacement word as the wrong word candidates; obtain the wrong word candidate and the expression sentences of the wrong word candidate in the multiple scenes; respectively count the wrong word candidates and The occurrence frequency of the wrong word candidate in the expression sentence; and the corresponding pre-replacement word and the pre-replacement word in the correct corpus are replaced based on the occurrence frequency to obtain a corresponding parallel corpus.
本申请第四方面提供了一种平行语料生成装置,包括:语料获取模块,用于获取待处理的正确语料;分词模块,用于对所述正确语料进行分词处理,并确定易错的预替换字、预替换词;错字生成模块,用于构造所述预替换字在预置常用字字典中的音形码;基于所述预替换字的音形码,从预置常用字字典的常用字中筛选出对应的错字候选;错词生成模块,用于获取所述正确语料对应的应用场景,并基于所述应用场景选择与其平行的多个场景;爬取所述多个场景对应的词典,筛选所述预替换词对应的相似音词语作为错词候选;语句获取模块,用于获取所述错字候选和所述错词候选在所述多个场景中的表达语句;计算模块,用于分别统计所述错字候选和所述错词候选在所述表达语句中的出现频率;语料生成模块,用于基于所述出现频率替换正确语料中对应的预替换字和预替换词,得到对应的平行语料。The fourth aspect of the present application provides a parallel corpus generation device, including: a corpus acquisition module for acquiring the correct corpus to be processed; a word segmentation module for segmenting the correct corpus and determining error-prone pre-replacement Words, pre-replaced words; typo generation module, used to construct the phonetic shape codes of the pre-replaced words in the preset common word dictionary; based on the phonetic shape codes of the pre-replaced words, from the common words of the preset common word dictionary The corresponding typo candidates are screened out; the wrong word generation module is used to obtain the application scenario corresponding to the correct corpus, and select multiple scenarios parallel to it based on the application scenario; crawl the dictionary corresponding to the multiple scenarios, Screening similar-sounding words corresponding to the pre-replacement word as the wrong word candidates; a sentence acquisition module for obtaining the wrong word candidates and the expression sentences of the wrong word candidates in the multiple scenes; a calculation module for respectively Count the occurrence frequency of the typos candidate and the typos candidate in the expression sentence; a corpus generation module is used to replace the corresponding pre-replaced words and pre-replaced words in the correct corpus based on the occurrence frequency to obtain the corresponding parallel Corpus.
本申请提供的技术方案中,获取待处理的正确语料;对所述正确语料进行分词处理,并确定易错的预替换字、预替换词;构造所述预替换字在预置常用字字典中的音形码;基于所述预替换字的音形码与所述常用字的音形码,从所述常用字中筛选所述预替换字对应的错字候选;获取所述正确语料对应的应用场景,并基于所述应用场景选择与其平行的多个场景;爬取所述多个场景对应的词典,筛选所述预替换词对应的相似音词语作为错词候选;获取所述错字候选和所述错词候选在所述多个场景中的表达语句;分别统计所述错字候选和所述错词候选在所述表达语句中的出现频率;基于所述出现频率替换正确语料中对应的预替换字和预替换词,得到对应的平行语料。发明实施例中,获取应用场景对应的正确语料并进行分词,得到随机数量的预替换字与预替换词;根据预替换字的音形码从常用字字典中筛选出对应的错字候选,根据预替换词的读音,从与应用场景平行的多个场景对应的词典中,获取对应的错字候选,最后根据错字候选和错词候选在多个应用场景对应的表达语句中的出现词频,筛选预设数量的错字与错词替换正确语料中的预替换字和预替换词,以生成对应的平行预料,实现灵活构建不同应用场景的正确-错误平行语料,降低基于该平行训练语料的错词纠正系统的场景迁移成。In the technical solution provided by this application, obtain the correct corpus to be processed; perform word segmentation on the correct corpus, and determine error-prone pre-replacement words and pre-replacement words; construct the pre-replacement words in a preset common word dictionary The phonetic shape code; based on the phonetic shape code of the pre-replaced word and the phonetic shape code of the commonly-used word, filter the typo candidates corresponding to the pre-replaced word from the commonly-used words; obtain the application corresponding to the correct corpus Scene, and select multiple parallel scenes based on the application scene; crawl the dictionaries corresponding to the multiple scenes, and screen the similar-sound words corresponding to the pre-replaced words as the wrong word candidates; obtain the wrong word candidates and the corresponding words The expression sentences of the wrong word candidates in the multiple scenes; respectively count the occurrence frequencies of the wrong word candidates and the wrong word candidates in the expression sentence; replace the corresponding pre-replacement in the correct corpus based on the occurrence frequency Words and pre-replaced words to get the corresponding parallel corpus. In the embodiment of the invention, the correct corpus corresponding to the application scenario is obtained and word segmentation is performed to obtain a random number of pre-replaced words and pre-replaced words; the corresponding typo candidates are selected from the dictionary of commonly used words according to the phoneme code of the pre-replaced words, and according to the prediction The pronunciation of the replacement word is obtained from the dictionary corresponding to multiple scenarios parallel to the application scenario, and the corresponding typo candidate is obtained. Finally, the preset is filtered according to the word frequency of the typo candidate and the wrong word candidate in the expression sentences corresponding to the multiple application scenarios A large number of typos and wrong words replace the pre-replaced words and pre-replaced words in the correct corpus to generate corresponding parallel predictions, realize the flexible construction of correct-error parallel corpora for different application scenarios, and reduce the wrong word correction system based on the parallel training corpus The scene is migrated into.
附图说明Description of the drawings
图1为本申请实施例中平行语料生成方法的第一个实施例示意图;Fig. 1 is a schematic diagram of a first embodiment of a parallel corpus generation method in an embodiment of this application;
图2为本申请实施例中平行语料生成方法的第二个实施例示意图;2 is a schematic diagram of a second embodiment of a parallel corpus generation method in an embodiment of this application;
图3为本申请实施例中平行语料生成方法的第三个实施例示意图;3 is a schematic diagram of a third embodiment of a parallel corpus generation method in an embodiment of this application;
图4为本申请实施例中平行语料生成方法的第四个实施例示意图;4 is a schematic diagram of a fourth embodiment of a parallel corpus generation method in an embodiment of this application;
图5为本申请实施例中平行语料生成装置的一个实施例示意图;FIG. 5 is a schematic diagram of an embodiment of a parallel corpus generating device in an embodiment of this application;
图6为本申请实施例中平行语料生成装置的另一个实施例示意图;FIG. 6 is a schematic diagram of another embodiment of a parallel corpus generating device in an embodiment of this application;
图7为本申请实施例中平行语料生成设备的一个实施例示意图。Fig. 7 is a schematic diagram of an embodiment of a parallel corpus generating device in an embodiment of this application.
具体实施方式Detailed ways
本申请实施例提供了一种平行语料生成方法、装置、设备及存储介质,通过预先构建常用字的错字候选集合、常用词的错词候选集合,根据应用场景对应的专业词典灵活构建专业词汇之间的错词候选集合;并借助普通语料与给定的专业语料统计常用字、常用词、专业词的出现频率;再从输入的专业语句中筛选预替换字和/或预替换词,以对应的错字候 选和/或错词候选按字频或词频进行替换,得到对应的平行语句,灵活构建不同应用场景的正确-错误平行语料,降低基于该平行训练语料的错词纠正系统的场景迁移成。The embodiments of this application provide a parallel corpus generation method, device, equipment, and storage medium. By pre-constructing the wrong word candidate set of common words and the wrong word candidate set of common words, the professional vocabulary can be flexibly constructed according to the professional dictionary corresponding to the application scenario. Candidate sets of wrong words between time; and use common corpus and given professional corpus to count the frequency of common words, common words, and professional words; then filter the pre-replaced words and/or pre-replaced words from the input professional sentences to correspond The wrong word candidates and/or wrong word candidates are replaced according to word frequency or word frequency to obtain corresponding parallel sentences, flexibly construct correct-error parallel corpora for different application scenarios, and reduce the scene migration cost of the wrong word correction system based on the parallel training corpus. .
本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”、“第三”、“第四”等(如果存在)是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的实施例能够以除了在这里图示或描述的内容以外的顺序实施。此外,术语“包括”或“具有”及其任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。The terms "first", "second", "third", "fourth", etc. (if any) in the description and claims of this application and the above-mentioned drawings are used to distinguish similar objects, without having to use To describe a specific order or sequence. It should be understood that the data used in this way can be interchanged under appropriate circumstances so that the embodiments described herein can be implemented in a sequence other than the content illustrated or described herein. In addition, the terms "including" or "having" and any variations thereof are intended to cover non-exclusive inclusion. For example, a process, method, system, product, or device that includes a series of steps or units is not necessarily limited to those clearly listed. Steps or units, but may include other steps or units that are not clearly listed or are inherent to these processes, methods, products, or equipment.
为便于理解,下面对本申请实施例的具体流程进行描述,请参阅图1,本申请实施例中平行语料生成方法的第一个实施例包括:For ease of understanding, the following describes the specific process of the embodiment of the present application. Please refer to FIG. 1. The first embodiment of the parallel corpus generation method in the embodiment of the present application includes:
101、获取待处理的正确语料;101. Obtain the correct corpus to be processed;
可以理解的是,本申请的执行主体可以为平行语料生成装置,还可以是终端或者服务器,具体此处不做限定。本申请实施例以服务器为执行主体为例进行说明。It can be understood that the execution subject of this application may be a parallel corpus generating device, or a terminal or a server, which is not specifically limited here. The embodiment of the present application takes the server as the execution subject as an example for description.
本实施例中,在进行错词纠正系统时,需要运用到大量的正确-错误平行语料作为训练样本,而根据应用领域不同,对训练样本的要求亦不同,这里待处理的正确语料指的是应用领域内正确-错误平行语料里的正确语料通过本申请方法处理后可批量生成对应的错误语料。In this embodiment, when performing the wrong word correction system, a large number of correct-error parallel corpora needs to be used as training samples. According to different application fields, the requirements for training samples are also different. The correct corpus to be processed here refers to The correct corpus in the correct-error parallel corpus in the application field can be processed by the method of this application to generate the corresponding error corpus in batches.
102、对所述正确语料进行分词处理,并确定易错的预替换字、预替换词;102. Perform word segmentation processing on the correct corpus, and determine error-prone pre-replacement words and pre-replacement words;
本实施例中,从专业语句生成对应的平行语句,并不需要考虑专业语句中的全部单字与词语,生成所有可能出现的平行语句,只需通过从专业语句中以预置概率分布随机筛选若干个预替换字和/或预替换词即可。In this embodiment, the corresponding parallel sentences are generated from the professional sentences without considering all the words and words in the professional sentences, and all possible parallel sentences are generated. It is only necessary to randomly select a number of professional sentences with a preset probability distribution. Pre-replacement words and/or pre-replacement words are sufficient.
比如,专业语句为“小明去超市购买了氢化物”,这里总共有“小明”、“去”、“超市”、“购买”、“了”、“氢化物”六个字词,随机筛选得到两个预替换词分别为“购买”、“氢化物”作为预替换词即可。For example, the professional sentence is "Xiao Ming went to the supermarket to buy hydride", there are a total of six words "Xiao Ming", "go", "supermarket", "buy", "le", and "hydride", which were randomly selected The two pre-replacement words are "purchase" and "hydride" as the pre-replacement words.
103、构造所述预替换字在预置常用字字典中的音形码;103. Construct the phonetic shape code of the pre-replaced word in the preset common word dictionary;
本实施例中,对于不同应用场景的正确-错误平行语料中,不仅包含该应用场景内的专用词汇,亦包含常用字与常用词汇,比如“我”、“把”、“快”等字,介词、量词、连词等常见类型的词。若从正确语料中选中预替换字,首先根据正确-错误平行语料应用场景的需求读取不同涵盖范围的常用字表,如《现代汉语常用字表》内一级字表的2500个常用字,若有需要则可增加其1000个次常用字表,根据不同地区,亦可选用《常用国字标准字体表》、《常用字字形表》等。In this embodiment, the correct-error parallel corpus for different application scenarios includes not only the dedicated vocabulary in the application scenario, but also common words and common vocabulary, such as "我", "巴", "快", etc. Common types of words such as prepositions, quantifiers, and conjunctions. If you select the pre-replaced words from the correct corpus, first read the common word lists of different coverages according to the requirements of the correct-error parallel corpus application scenarios, such as the 2500 common words in the first-level word list in the "Modern Chinese Common Word List". If necessary, the list of 1000 frequently used characters can be added. According to different regions, the "Standard Font Table of Commonly Used Chinese Characters" and "Table of Commonly Used Characters" and so on can also be selected.
而预替换字与常用字均有其特殊的字音与字形组合,通过对两者的字音与字形进行编码并进行比较,以确定两者的相似程度。其中,对预替换字拼音与常用字拼音的声母、韵母、韵母补码、声调进行数字编码,得到其常用字字音的四位数字编码;对预替换字与常用字的汉字结构、五个四角码、笔画数量进行编码,得到其常用字字形的7位数字编码;两者组合即可形成预替换字与常用字特有的11位音形码。Pre-replaced characters and commonly used characters have their special phonetic and glyph combinations. The phonetic and glyph of the two are coded and compared to determine the degree of similarity between the two. Among them, digital coding is performed on the initials, vowels, vowel complements, and tones of the pre-replaced word pinyin and common word pinyin to obtain the four-digit digital code of the common word phonetic; the Chinese character structure, five four corners of the pre-replaced word and common word The code and the number of strokes are coded to obtain the 7-bit digital code of the commonly used characters; the combination of the two can form the pre-replaced character and the unique 11-bit phonetic code of the commonly used characters.
若以A1至A26代表声母表中顺序的26个声母对应的编码字段;If A1 to A26 represent the coding fields corresponding to the sequential 26 initials in the initials table;
以B1至B39代表韵母表中顺序的39个韵母对应的编码字段;Let B1 to B39 represent the coding fields corresponding to the 39 vowels in the sequence in the vowel table;
以C1至C39代表韵母表中顺序的39个韵母对应的韵母补码对应的编码字段;Let C1 to C39 represent the coding fields corresponding to the complements of the vowels corresponding to the 39 vowels in the sequence in the vowel table;
以D1至D4代表声调一声至四声对应的编码字段;Let D1 to D4 represent the code fields corresponding to tones from one to four;
则“花”字的字音码编码信息A11B13C13D1。Then the phonetic code encoding information of the "flower" character is A11B13C13D1.
若以E1至E7分别代表常用字的左右结构、上下结构、左中右结构、上中下结构、半 包围结构、全包围结构、镶嵌结构对应的编码字段;If E1 to E7 respectively represent the coding fields corresponding to the left-right structure, upper-lower structure, left-middle-right structure, upper-middle-lower structure, half-enclosed structure, full-enclosed structure, and mosaic structure of commonly used characters;
以F0至F9、G0至G9、H0至H9、J0至J9、K0至K9分别代表常用字的左上角、右上角、左下角、右下角、附号对应的十类笔形对应的编码字段;F0 to F9, G0 to G9, H0 to H9, J0 to J9, K0 to K9 represent the coding fields corresponding to the ten types of pen shapes corresponding to the upper left corner, upper right corner, lower left corner, and lower right corner of common characters, respectively;
以Li(i为笔画数量且i为正整数)代表笔画数量对应的编码字段;Let Li (i is the number of strokes and i is a positive integer) representing the code field corresponding to the number of strokes;
则“花”字的字形编码信息为E2F4G4H2J1K4L7,故“花”字的常用字编码信息为A11B13C13D1E2F4G4H2J1K4L7。Then the font coding information of the "flower" character is E2F4G4H2J1K4L7, so the commonly used character coding information of the "flower" character is A11B13C13D1E2F4G4H2J1K4L7.
104、基于所述预替换字的音形码,从预置常用字字典的常用字中筛选出对应的错字候选;104. Based on the phoneme code of the pre-replaced word, filter out the corresponding typo candidate from the common words in the preset common word dictionary;
本实施例中,比较预替换字与常用字的音形码,若同一类型编码字段不一致的数量越多,则对应的两个常用字相似度越低,反之则相似度越高,当预替换字与常用字对应的音形码中,同一类型编码字段不一致的数量小于预设数量时,则该常用字为预替换字的错字候选。In this embodiment, the phonetic shape codes of the pre-replaced words and the commonly used characters are compared. If the number of inconsistent encoding fields of the same type is greater, the similarity of the corresponding two commonly used characters is lower, otherwise, the similarity is higher. When the number of inconsistent coding fields of the same type in the phonetic shape codes corresponding to the common words is less than the preset number, the common word is a typo candidate for the pre-replaced word.
比如,“花”字的常用字编码信息为A11B13C13D1E2F4G4H2J1K4L7,“黄”字的常用字编码信息为A11B34C34D2E4F4G4H8J0K6L12,“华”字的常用字编码信息为A11B13C13D2E2F2G4H4J0K1L8则“花”字与“黄”字同一类型编码字段不一致的数量为8,“花”字与“华”字为6,“黄”字与“华”字为8,若预设数量为7,则只有“花”字与“华”字互为错字候选。For example, the common word encoding information of the "flower" is A11B13C13D1E2F4G4H2J1K4L7, the common word encoding information of the "yellow" word is A11B34C34D2E4F4G4H8J0K6L12, and the common word encoding information of the "hua" word is the same type of A11B13C13D2E2F2G4H4L8 "flower" and "Huang Zhe". The number of inconsistent fields is 8, "花" and "华" are 6, and "黄" and "华" are 8. If the preset number is 7, only "花" and "华" are mutually exclusive. Candidate for typo.
105、获取所述正确语料对应的应用场景,并基于所述应用场景选择与其平行的多个场景;105. Obtain an application scenario corresponding to the correct corpus, and select multiple scenarios parallel to the application scenario based on the application scenario;
本实施例中,通过正确语料生成平行的错误语料不仅需要考虑该正确预料对应的应用场景,亦需同时涉及到应用场景相关的多个场景,最简单的是应用场景+普通场景的组合,亦可为应用场景+相近领域应用场景+普通场景组合,比如需获取医学类领域的正确-错误平行语料,除了考虑到日常应用场景之外,亦可加入与医学类领域相近的护理类领域、生物化学领域、物理化学领域、生命科学领域等,以增加平行语料的可靠性及全面性。In this embodiment, generating parallel error corpus from correct corpus not only needs to consider the application scenario corresponding to the correct prediction, but also involves multiple scenarios related to the application scenario at the same time. The simplest is the combination of application scenario + common scenario. It can be a combination of application scenarios + application scenarios in similar fields + general scenarios. For example, it is necessary to obtain correct-error parallel corpus in the medical field. In addition to considering daily application scenarios, it can also be added to the nursing field and biology similar to the medical field. The fields of chemistry, physical chemistry, life sciences, etc., to increase the reliability and comprehensiveness of parallel corpus.
106、爬取所述多个场景对应的词典,筛选所述预替换词对应的相似音词语作为错词候选;106. Crawling the dictionaries corresponding to the multiple scenes, and selecting similar-sounding words corresponding to the pre-replaced words as wrong word candidates;
本实施例中,根据用户需求选取常用词词典,并根据应用场景选取对应的专用词词典;爬取常用词词典与专用词词典中拼音相似的常用词与专用词互为错词候选,其中,每一常用词与专用词不局限于出现在一个错词候选组别内,且拼音相似的词语包括同音词、同音但不同声调的词语、模糊音词等。In this embodiment, the common word dictionary is selected according to user needs, and the corresponding special word dictionary is selected according to the application scenario; common words and special words that are similar in pinyin in the common word dictionary and the special word dictionary are mutually wrong candidates for each other. Every common word and special word are not limited to appear in a wrong word candidate group, and words with similar pinyin include homophones, words with homophones but different tones, fuzzy words, etc.
比如,“氢化物”的错词候选中,可包括同音词语“氰化物”、模糊音词“请法务”、同音不同声调的词语“青花物”。For example, the wrong word candidates for "hydride" may include the homophone word "cyanide", the fuzzy word "please legal affairs", and the homophone word "blue and white things" with different tones.
107、获取所述错字候选和所述错词候选在所述多个场景中的表达语句;107. Acquire the typos candidate and the expression sentences of the typos candidate in the multiple scenes;
本实施例中,多个应用场景可包括正确语料的应用场景、相近领域应用场景、日常应用场景等,对于日常应用场景的表达语句来源包括新闻稿、博客文档、电子报纸、普通类电子书、百科文档等,正确语料的应用场景与相近领域应用场景的表达语句来源包括该应用领域的科技论文、刊物、技术类书籍等。In this embodiment, multiple application scenarios may include application scenarios of correct corpus, application scenarios in similar fields, daily application scenarios, etc. The sources of expression sentences for daily application scenarios include news releases, blog documents, electronic newspapers, general e-books, Encyclopedia documents, etc., the application scenarios of the correct corpus and the expression sources of application scenarios in similar fields include scientific papers, publications, and technical books in the application field.
108、分别统计所述错字候选和所述错词候选在所述表达语句中的出现频率;108. Count the occurrence frequencies of the typos candidate and the typos candidate in the expression sentence respectively;
本实施例中,在普通表达语句与专业表达语句结合的混合表达语句中,出现频率较低的错字候选与错词候选实用不足,一般不予考虑;出现频率较高的错字候选与错词候选生成专业语句对应的错误平行语料在该领域应用场景中才具有其应用价值。故先统计错字候选与错词候选在混合表达语句中的出现频率。In this embodiment, in a mixed expression sentence combining a common expression sentence and a professional expression sentence, the typo candidates and the typos candidates with a lower frequency of occurrence are not practical enough and generally not considered; the typos candidates and the typos candidates with a higher occurrence frequency are not considered Generating the wrong parallel corpus corresponding to professional sentences has its application value in application scenarios in this field. Therefore, we first count the occurrence frequency of typo candidates and wrong word candidates in mixed expression sentences.
比如,“氰化物”的错词候选包括“氢化物”、“请法务”、“请法务”,且应用场景为生物化 学领域,那么“氢化物”的词频明显较高,若“氢化物”的词频为280,“请法务”与“青花物”的词频分别为3、43,则优选“氢化物”作为替代“氰化物”的错词候选,其次是“青花物”,而“请法务”在本生物化学领域中出现的频率较低,失去其替换“氢化物”生成平行错误语料的应用价值,故不予考虑。For example, the wrong word candidates for "cyanide" include "hydride", "please legal affairs", "please legal affairs", and the application scenario is in the field of biochemistry, then the word frequency of "hydride" is obviously higher, if "hydride" The frequency of the word is 280, and the word frequencies of "Please legal affairs" and "Blue and white things" are 3 and 43 respectively. Therefore, "hydride" is preferred as a wrong word candidate to replace "cyanide", followed by "blue and white things", and "Please legal affairs" "The frequency of occurrence in this field of biochemistry is relatively low, and it loses its application value to replace "hydride" to generate parallel error corpus, so it is not considered.
109、基于所述出现频率替换正确语料中对应的预替换字和预替换词,得到对应的平行语料。109. Replace corresponding pre-replaced words and pre-replaced words in the correct corpus based on the occurrence frequency to obtain a corresponding parallel corpus.
本实施例中,按出现频率由高至低从错字候选与错词候选中筛选一定数量的错字与错词,并通过筛选所得的错字、错词对对应的预替换字、预替换词进行一一替换,得到对应的平行语料,其中,平行语料指的是正确错误平行语料中的错误语料。In this embodiment, a certain number of typos and typos are selected from the typos and typos in descending order of occurrence frequency, and the typos and typos obtained through the screening are combined with the corresponding pre-replacement and pre-replacement words. One replacement, the corresponding parallel corpus is obtained, where the parallel corpus refers to the wrong corpus in the correct and wrong parallel corpus.
比如从专业语句“小明去超市购买了氢化物”中随机筛选“购买”、“氢化物”作为预替换词,而替换“购买”的错词候选中筛选得到出现频率最高的有“够买”、“购满”,替换“氢化物”的错词候选中筛选得到出现频率最高的有“氰化物”,则生成的平行语句包括:“小明去超市购买了氰化物”、“小明去超市够买了氢化物”、“小明去超市购满了氢化物”。For example, from the professional sentence "Xiao Ming went to the supermarket to buy hydride" randomly selected "buy" and "hydride" as the pre-replacement words, and the wrong word candidates that replaced "buy" were selected to get the most frequent occurrence of "enough to buy". , "Buy full", replace "hydride" in the wrong word candidates, the most frequent occurrence of "cyanide" is selected, and the parallel sentences generated include: "Xiao Ming went to the supermarket to buy cyanide", "Xiao Ming went to the supermarket to buy cyanide" I bought hydride", "Xiao Ming went to the supermarket to buy full hydride."
本申请实施例中,获取待处理的正确语料并进行分词处理,确定易错的预替换字、预替换词;构造预替换字的音形码;基于预替换字与常用字的音形码,从常用字中筛选出预替换字对应的错字候选;获取正确语料对应应用场景并选择与其平行的多个场景;爬取多个场景对应的词典,筛选预替换词对应的相似音词语作为错词候选;统计错字候选和错词候选的错字和错词在多个场景中表达语句中的出现频率;据此替换正确语料中的预替换字、预替换词,得到对应的平行语料,以灵活构建不同应用场景的正确-错误平行语料、降低基于该平行训练语料的错词纠正系统的场景迁移成本。In the embodiment of this application, the correct corpus to be processed is obtained and word segmentation is performed to determine the error-prone pre-replacement words and pre-replacement words; construct the phonetic shape codes of the pre-replaced words; based on the phonetic shape codes of the pre-replaced words and common words, Filter out the typo candidates corresponding to the pre-replaced words from the commonly used words; obtain the correct corpus corresponding to the application scenario and select multiple scenarios parallel to it; crawl the dictionaries corresponding to multiple scenarios, and filter the similar-sounding words corresponding to the pre-replaced words as the wrong words Candidates; count the occurrence frequency of typos and typos of typos and typos in multiple scenarios in expression sentences; according to this, replace the pre-replaced words and pre-replaced words in the correct corpus to obtain the corresponding parallel corpus for flexible construction Correct-error parallel corpus for different application scenarios, reducing the scene migration cost of the wrong word correction system based on the parallel training corpus.
请参阅图2,本申请实施例中平行语料生成方法的第二个实施例包括:Please refer to FIG. 2. The second embodiment of the parallel corpus generation method in the embodiment of the present application includes:
201、获取待处理的正确语料;201. Obtain the correct corpus to be processed;
202、对所述正确语料进行分词处理,得到多个单字和词语;202. Perform word segmentation processing on the correct corpus to obtain multiple words and words;
本实施例中,将正确语料拆分成多个单字和/或词语,在单字、词语级别上实现平行语句的生成。In this embodiment, the correct corpus is split into multiple words and/or words, and parallel sentences are generated at the word and word level.
如“小明去超市购买了氢化物和甲醛”可拆分为:“小明”、“去”、“超市”、“购买”、“了”、“氢化物”、“和”、“甲醛”三个单字与五个词语。For example, "Xiao Ming went to the supermarket to buy hydride and formaldehyde" can be divided into: "Xiao Ming", "Go", "Supermarket", "Buy", "Liao", "Hydride", "He" and "Formaldehyde". One word and five words.
203、基于预设错词数量概率分布,确定可能存在错误输入的单字和词语数量分布;203. Based on the preset probability distribution of the number of wrong words, determine the number of words and words that may be incorrectly inputted;
本实施例中,首先根据用户设置的专业语句的最大错词比例,确定专业语句可能出现的最多错词数量,大于最多错词数量其发生概率为零;然后根据开发者预置的专业语句中可能出现的错词数量概率分布上限,模拟实际应用场景中每一专业语句可能出现的错词数量及对应的发生概率。其中,专业语句中可能出现的错词数量与对应的发生概率成反比,具体错词数量与对应的发生概率表现为:
Figure PCTCN2021078059-appb-000001
Pi为专业语句中可能发生的错词数量为i时对应的发生概率,i为专业语句可能发生的错词数量且i为正整数,Pmax为错词数量概率分布上限,n为专业语句可能发生的最多错词数量。
In this embodiment, first, according to the maximum wrong word ratio of professional sentences set by the user, the maximum number of wrong words that may appear in the professional sentence is determined. The upper limit of the probability distribution of the number of possible wrong words, simulating the number of possible wrong words in each professional sentence in the actual application scenario and the corresponding probability of occurrence. Among them, the number of wrong words that may appear in a professional sentence is inversely proportional to the corresponding probability of occurrence. The specific number of wrong words and the corresponding probability of occurrence are expressed as follows:
Figure PCTCN2021078059-appb-000001
Pi is the probability of occurrence when the number of wrong words that may occur in a professional sentence is i, i is the number of wrong words that may occur in a professional sentence and i is a positive integer, Pmax is the upper limit of the probability distribution of the number of wrong words, and n is the probability that a professional sentence may occur The maximum number of wrong words.
比如专业语句A可拆分为八个单字和/或词语,最大错词比例为50%,则最多错词数量为4;若开发者预置的错词数量概率分布上限为40%,则专业语句A可能出现的错词数量1、2、3、4对应的发生概率分别为40%、30%、20%、10%。For example, professional sentence A can be divided into eight single words and/or words, and the maximum percentage of wrong words is 50%, and the maximum number of wrong words is 4; if the upper limit of the probability distribution of the number of wrong words preset by the developer is 40%, then the professional The number of possible wrong words 1, 2, 3, and 4 in sentence A correspond to the occurrence probability of 40%, 30%, 20%, and 10%, respectively.
204、基于可能存在错误输入的单字和词语数量分布,随机筛选其中一数量的单字和词语作为预替换字和预替换词;204. Based on the distribution of the number of words and words that may be entered incorrectly, randomly select a number of words and words as pre-replacement words and pre-replacement words;
本实施例中,为模拟实际应用中可能出现的错误输入单字和/或词语,从获取的错词数量分布中随机挑选其中一错词数量的单字和/或词语,以对应的错字候选和/或错词候选进行 替换。In this embodiment, in order to simulate the wrong input words and/or words that may occur in practical applications, one of the words and/or words with the number of wrong words is randomly selected from the obtained distribution of the number of wrong words, and the corresponding wrong word candidates and/or words are selected. Or replace the wrong word candidate.
如“小明去超市购买了氢化物和甲醛”,若从错词数量分布中选择选择得到该专业语句可能发生的错词数量为1,则从该专业语句中随机挑选一个单字或单词作为预替换字或预替换词,可为“小明”、“去”、“超市”、“购买”、“了”、“氢化物”、“和”或“甲醛”。For example, "Xiao Ming went to the supermarket to buy hydride and formaldehyde", if the number of wrong words that may occur in the professional sentence is selected from the distribution of the number of wrong words, then a single word or word is randomly selected from the professional sentence as a pre-replacement The word or pre-replacement word can be "xiaoming", "go", "supermarket", "buy", "le", "hydride", "and" or "formaldehyde".
205、构造所述预替换字在预置常用字字典中的音形码;205. Construct the phonetic shape code of the pre-replaced word in a preset common word dictionary;
206、基于所述预替换字的音形码,从预置常用字字典的常用字中筛选出对应的错字候选;206. Based on the phoneme code of the pre-replaced word, filter out corresponding typo candidates from commonly used words in a preset commonly used word dictionary;
207、获取所述正确语料对应的应用场景,并基于所述应用场景选择与其平行的多个场景;207. Obtain an application scenario corresponding to the correct corpus, and select multiple scenarios parallel to the application scenario based on the application scenario;
208、爬取所述多个场景对应的词典,筛选所述预替换词对应的相似音词语作为错词候选;208. Crawling the dictionaries corresponding to the multiple scenes, and selecting similar-sounding words corresponding to the pre-replaced words as wrong word candidates;
209、获取所述错字候选和所述错词候选在所述多个场景中的表达语句;209. Obtain the typos candidate and the expression sentences of the typos candidate in the multiple scenes;
210、分别统计所述错字候选和所述错词候选在所述表达语句中的出现频率;210. Count the occurrence frequencies of the typos candidate and the typos candidate in the expression sentence respectively;
211、基于所述出现频率替换正确语料中对应的预替换字和预替换词,得到对应的平行语料。211. Replace corresponding pre-replaced words and pre-replaced words in the correct corpus based on the occurrence frequency to obtain a corresponding parallel corpus.
本申请实施例中,通过获取与应用场景相关的专业语句,拆分成多个单字与词语,通过预设错词数量概率分布,从多个单字与词语中随机筛选一定数量的单字与词语,作为预替换字与预替换词,以模拟现实情况中打字时可能发送错误输入的单字与词语发生的概率,可以更好的拟合现实中错字与错词出现的形式,且生成错句中错字、错词的数量可通过参数把控,提高生成算法的可控性。In the embodiment of this application, by obtaining professional sentences related to the application scenario, splitting them into multiple words and words, and by presetting the probability distribution of the number of wrong words, a certain number of words and words are randomly selected from the multiple words and words. As pre-replaced words and pre-replaced words, to simulate the occurrence of incorrectly typed words and words when typing in real situations, it can better fit the form of typos and wrong words in reality, and generate typos in wrong sentences The number of wrong words can be controlled by parameters to improve the controllability of the generation algorithm.
请参阅图3,本申请实施例中平行语料生成方法的第三个实施例包括:Referring to Fig. 3, the third embodiment of the parallel corpus generation method in the embodiment of the present application includes:
301、获取待处理的正确语料;301. Obtain the correct corpus to be processed;
302、对所述正确语料进行分词处理,并确定易错的预替换字、预替换词;302. Perform word segmentation processing on the correct corpus, and determine error-prone pre-replacement words and pre-replacement words;
303、构造所述预替换字在预置常用字字典中的音形码;303. Construct the phonetic shape code of the pre-replaced word in a preset common word dictionary;
304、对比所述预替换字的音形码与预置常用字字典中常用字的音形码的同类型编码字段;304. Compare the phonetic shape code of the pre-replaced word with the same type of encoding field of the phonetic shape code of the commonly used word preset in the common word dictionary;
本实施例中,预替换字与常用字的音形码包括:声母编码字段、韵母编码字段、韵母补码编码字段、声调编码字段、汉字结构编码字段、五个四角编码字段、笔画数量编码字段组成的11位数字编码,通过比较预替换字音形码与常用字音形码对应的编码字段不同的数量,以量化两者的相似度。In this embodiment, the phonetic shape codes of pre-replaced characters and commonly used characters include: initials encoding field, finals encoding field, finals complement encoding field, tone encoding field, Chinese character structure encoding field, five four-corner encoding fields, and stroke number encoding field Composed of 11-bit digital codes, the similarity of the two is quantified by comparing the number of different coding fields corresponding to the pre-replaced phonological shape code and the commonly used phonological shape code.
305、基于对比结果,统计所述预替换字与所述常用字中同类型编码字段不一致的数量;305. Based on the comparison result, count the number of inconsistencies in the same type of coding fields in the pre-replaced words and the commonly used words;
306、基于统计结果,确定所述预替换字与所述常用字之间的编辑距离;306. Determine the edit distance between the pre-replaced word and the commonly used word based on the statistical result.
本实施例中,音形码包括十一种类型的编码字段,预替换字音形码与常用字音形码之间同一类型的编码字段不同,则其编辑距离增加1,否则保持原值。若预替换字音形码与常用字音形码之间全部类型的编码字段一致,表示两个常用字相似度最高,所述两者之间的编辑距离为0,若预替换字音形码与常用字音形码之间全部类型的编码字段都不一致,表示两个常用字相似度最低,则所两者之间的编辑距离为11,故预替换字与常用字之间的编辑距离在0-11之间。In this embodiment, the phonetic shape code includes eleven types of encoding fields. If the same type of encoding field is different between the pre-replaced phonetic shape code and the commonly used phonetic shape code, the edit distance is increased by 1, otherwise the original value is maintained. If all types of coding fields between the pre-replaced phonogram code and the commonly used phonogram code are consistent, it means that the two commonly used characters have the highest similarity, and the edit distance between the two is 0. The encoding fields of all types between the shape codes are inconsistent, which means that the two commonly used characters have the lowest similarity, so the edit distance between the two is 11, so the edit distance between the pre-replaced characters and the commonly used characters is between 0-11 between.
307、若所述预替换字与所述常用字之间的编辑距离小于预设编辑距离,则对应的常用字为所述预替换字的错字候选;307. If the editing distance between the pre-replaced word and the commonly-used word is less than the preset editing distance, the corresponding commonly-used word is a typo candidate of the pre-replaced word;
本实施例中,由于编辑距离为预替换字与常用字相似度的量化数值,且编辑距离越小,相似度越高,故用户可设置当常用字做为预替换字的错字候选时的编辑距离最大阈值作为预设编辑距离,用于筛选与预替换字相似度较高的常用字做为错字候选,小于预设编辑距 离的常用字作为预替换字的错字候选,若预替换字与常用字的相似度较低,即编辑距离较大,则筛除。In this embodiment, since the editing distance is a quantified value of the similarity between the pre-replaced word and the commonly used word, and the smaller the editing distance, the higher the similarity, so the user can set the editing when the commonly used word is a typo candidate for the pre-replaced word The maximum distance threshold is used as the preset editing distance to filter common words with high similarity to the pre-replaced words as typo candidates, and common words less than the preset editing distance are used as typo candidates for the pre-replaced words. If the similarity of words is lower, that is, if the editing distance is larger, they are filtered out.
308、获取所述正确语料对应的应用场景,并基于所述应用场景选择与其平行的多个场景;308. Obtain an application scenario corresponding to the correct corpus, and select multiple scenarios parallel to the application scenario based on the application scenario.
309、爬取所述多个场景对应的词典,筛选所述预替换词对应的相似音词语作为错词候选;309. Crawl the dictionaries corresponding to the multiple scenes, and screen similar-sounding words corresponding to the pre-replaced words as wrong word candidates.
310、获取所述错字候选和所述错词候选在所述多个场景中的表达语句;310. Obtain the typos candidate and the expression sentences of the typos candidate in the multiple scenes;
311、分别统计所述错字候选和所述错词候选在所述表达语句中的出现频率;311. Count the occurrence frequencies of the typos candidate and the typos candidate in the expression sentence respectively;
312、基于所述出现频率替换正确语料中对应的预替换字和预替换词,得到对应的平行语料。312. Replace corresponding pre-replaced words and pre-replaced words in the correct corpus based on the occurrence frequency to obtain a corresponding parallel corpus.
本申请实施例中,通过计算预替换字与常用字音形码的编辑距离,来比较两个字在字形与字音上面的相似度,当预替换字与常用字的编辑距离越小,则两者相似度则越高,如果其编辑距离小于预设编辑距离,则该常用字即可做为预替换字的错字候选,依次预替换字与每一个常用字的编辑距离,即可预替换字的所有错字候选,以提高预替换字与对应错字候选的相似度,进而增加后续生成正确-错误平行语料的实用性。In the embodiment of this application, the editing distance between the pre-replaced word and the commonly used character code is calculated to compare the similarity of the two characters in the shape and sound. When the editing distance between the pre-replaced word and the commonly used word is smaller, The similarity is higher. If the editing distance is less than the preset editing distance, the common word can be used as a typo candidate for the pre-replaced word. The editing distance between the pre-replaced word and each common word in turn can be the pre-replaced word All typo candidates are used to improve the similarity between the pre-replaced words and the corresponding typo candidates, thereby increasing the practicability of subsequent correct-error parallel corpus generation.
请参阅图4,本申请实施例中平行语料生成方法的第四个实施例包括:Referring to Fig. 4, the fourth embodiment of the parallel corpus generation method in the embodiment of the present application includes:
401、获取待处理的正确语料;401. Obtain the correct corpus to be processed;
402、对所述正确语料进行分词处理,并确定易错的预替换字、预替换词;402. Perform word segmentation processing on the correct corpus, and determine error-prone pre-replacement words and pre-replacement words;
403、构造所述预替换字在预置常用字字典中的音形码;403. Construct the phonetic shape code of the pre-replaced word in a preset common word dictionary;
404、基于所述预替换字的音形码,从预置常用字字典的常用字中筛选出对应的错字候选;404. Based on the phonetic shape code of the pre-replaced word, filter out corresponding typo candidates from commonly used words in a preset commonly used word dictionary;
405、获取所述正确语料对应的应用场景,并基于所述应用场景选择与其平行的多个场景;405. Obtain an application scenario corresponding to the correct corpus, and select multiple scenarios parallel to the application scenario based on the application scenario.
406、爬取所述多个场景对应的词典,筛选所述预替换词对应的相似音词语作为错词候选;406. Crawl the dictionaries corresponding to the multiple scenes, and screen similar-sounding words corresponding to the pre-replaced words as wrong word candidates.
407、判断所述预替换词是否包含多个子词;407. Determine whether the pre-replacement word contains multiple subwords.
本实施例中,对于包含多个子词的预替换词可能难以匹配到相应的错词候选,而预替换词的子词又有机会在实际应用场景中被错误输入。故需先判断预替换词是否包含多个字词。In this embodiment, it may be difficult for the pre-replacement word containing multiple sub-words to match the corresponding wrong word candidate, and the sub-word of the pre-replacement word may be incorrectly input in actual application scenarios. Therefore, it is necessary to first determine whether the pre-replacement word contains multiple words.
408、若所述预替换词包含多个子词,则对所述预替换词进行分词,得到多个子词;408. If the pre-replacement word contains multiple sub-words, perform word segmentation on the pre-replacement word to obtain multiple sub-words.
本实施例中,为保证正确-错误平行语料在实际应用场景中的应用价值,并非以该包含多个子词的预替换词筛选对应的错词候选,而是拆分成多个子词,并以子词为基础筛选对应的子词错词候选,再根据不同的子词错词候选组合拼接成该预替换词对应的错词候选。In this embodiment, in order to ensure the application value of the correct-error parallel corpus in actual application scenarios, the pre-replacement word containing multiple sub-words is not used to filter the corresponding wrong word candidates, but is split into multiple sub-words and used Based on sub-words, the corresponding sub-word wrong word candidates are screened, and then according to different sub-word wrong word candidate combinations, the wrong word candidates corresponding to the pre-replacement word are spliced together.
比如“恭喜发财”,包含“恭喜”与“发财”两个子词,在实际应用场景中,子词“恭喜”与“发财”常见错误输入的相似音数量较多,如果直接以“恭喜发财”从词典中爬取相似音词语,则忽略实际应用场景中可能出现的其他多种错误输入的情况,故从“恭喜、“发财”中筛选对应的子词错词候选。For example, "Gong Xi Fa Cai" contains two sub-words of "Gong Xi" and "Fa Cai". In actual application scenarios, the sub-words "Gong Xi" and "Fa Cai" often have a large number of similar sounds entered incorrectly. If you directly use "Gong Xi Fa Cai" Crawling similar-sounding words from the dictionary ignores various other mistyped situations that may occur in actual application scenarios, so the corresponding sub-word wrong word candidates are selected from "Gongxi, "Fa Cai".
409、基于所述多个子词,筛选每一子词对应的错词候选并逐一进行替换,生成所述预替换词对应的错词候选;409. Based on the multiple sub-words, filter the wrong word candidates corresponding to each sub-word and replace them one by one to generate the wrong word candidates corresponding to the pre-replaced words.
本实施例中,通过每一子词筛选对应的子词错词候选可更全面地罗列其对应的预替换词在实际应用场景中可能出现的错误输入的可能情况。In this embodiment, filtering the corresponding sub-word wrong word candidates through each sub-word can more comprehensively list the possible misinput of the corresponding pre-replacement word in actual application scenarios.
比如筛选“恭喜发财”对应的错词候选时,子词“恭喜”对应的错词候选包括“工细”、“共栖”、“龚袭”,“发财”对应的错词候选包括“发菜”,则“恭喜发财”对应的错词候选包括“工细 发财”、“共栖发财”、“龚袭发财”、“恭喜发菜”。For example, when selecting the wrong word candidates corresponding to "Gong Xi Fa Cai", the wrong word candidates corresponding to the sub-word "Gong Xi" include "Gong Xi", "Combination", and "Gong Xi", and the wrong word candidates corresponding to "Fa Cai" include "发菜" , The wrong word candidates for "Gong Xi Fa Cai" include "Gong Xi Fa Cai", "Song Xi Fa Cai", "Gong Xi Fa Cai", and "Gong Xi Fa Cai".
410、获取所述错字候选和所述错词候选在所述多个场景中的表达语句;410. Obtain the typos candidate and the expression sentences of the typos candidate in the multiple scenes;
411、分别统计所述错字候选和所述错词候选在所述表达语句中的出现频率;411. Count the occurrence frequencies of the typos candidate and the typos candidate in the expression sentence respectively;
412、基于所述出现频率替换正确语料中对应的预替换字和预替换词,得到对应的平行语料。412. Replace corresponding pre-replaced words and pre-replaced words in the correct corpus based on the occurrence frequency to obtain a corresponding parallel corpus.
本申请实施例中,一些组合词形式的预替换词在该领域的应用场景中整体可能出现频率或者错误率较低,但是组成该预替换词的多个子词可能出现频率较高,这种情况下拆分成多个子词再筛选每个子词对应的错词候选,最后对子词对应的错词候选进行组合形成该预替换词的错词候选,使得最后生成的正确-错误平行语料涵盖更加全面,实用性更强,与实际应用场景的拟合度更高。In the embodiments of the present application, some pre-replacement words in the form of compound words may appear frequently or have a low error rate in application scenarios in this field, but multiple sub-words that make up the pre-replacement word may appear frequently. In this case Divide into multiple sub-words and then filter the wrong word candidates corresponding to each sub-word. Finally, the wrong word candidates corresponding to the sub-words are combined to form the wrong word candidates of the pre-replaced word, so that the final correct-error parallel corpus generated covers more Comprehensive, more practical, and higher degree of fit with actual application scenarios.
上面对本申请实施例中平行语料生成方法进行了描述,下面对本申请实施例中平行语料生成装置进行描述,请参阅图5,本申请实施例中平行语料生成装置一个实施例包括:The parallel corpus generation method in the embodiment of the application is described above, and the parallel corpus generation device in the embodiment of the application is described below. Please refer to FIG. 5. An embodiment of the parallel corpus generation device in the embodiment of the application includes:
语料获取模块501,用于获取待处理的正确语料;The corpus acquisition module 501 is used to acquire the correct corpus to be processed;
分词模块502,用于对所述正确语料进行分词处理,并确定易错的预替换字、预替换词;The word segmentation module 502 is used to perform word segmentation processing on the correct corpus and determine error-prone pre-replacement words and pre-replacement words;
错字生成模块503,用于构造所述预替换字在预置常用字字典中的音形码;基于所述预替换字的音形码,从预置常用字字典的常用字中筛选出对应的错字候选;The typo generation module 503 is used to construct the phonetic shape code of the pre-replaced word in the preset common word dictionary; based on the phonetic shape code of the pre-replaced word, filter out the corresponding common words from the preset common word dictionary Typo candidate
错词生成模块504,用于获取所述正确语料对应的应用场景,并基于所述应用场景选择与其平行的多个场景;爬取所述多个场景对应的词典,筛选所述预替换词对应的相似音词语作为错词候选;The wrong word generation module 504 is configured to obtain the application scenario corresponding to the correct corpus, and select multiple scenarios parallel to the correct corpus based on the application scenario; crawl the dictionary corresponding to the multiple scenarios, and filter the corresponding pre-replacement word Words with similar sounds as wrong word candidates;
语句获取模块505,用于获取所述错字候选和所述错词候选在所述多个场景中的表达语句;The sentence acquisition module 505 is configured to acquire the typos candidate and the expression sentences of the typos candidate in the multiple scenes;
计算模块506,用于分别统计所述错字候选和所述错词候选在所述表达语句中的出现频率;The calculation module 506 is configured to respectively count the occurrence frequencies of the typos candidate and the typos candidate in the expression sentence;
语料生成模块507,用于基于所述出现频率替换正确语料中对应的预替换字和预替换词,得到对应的平行语料。The corpus generation module 507 is configured to replace the corresponding pre-replaced words and pre-replaced words in the correct corpus based on the occurrence frequency to obtain the corresponding parallel corpus.
本申请实施例中,获取待处理的正确语料并进行分词处理,确定易错的预替换字、预替换词;构造预替换字与常用字的音形码;基于预替换字与常用字的音形码,从常用字中筛选出预替换字对应的错字候选;获取正确语料对应应用场景并选择与其平行的多个场景;爬取多个场景对应的词典,筛选预替换词对应的相似音词语作为错词候选;统计错字候选和错词候选的错字和错词在多个场景中表达语句中的出现频率;据此替换正确语料中的预替换字、预替换词,得到对应的平行语料,以灵活构建不同应用场景的正确-错误平行语料、降低基于该平行训练语料的错词纠正系统的场景迁移成本。In the embodiment of this application, the correct corpus to be processed is obtained and word segmentation is performed to determine error-prone pre-replacement characters and pre-replacement words; construct pre-replacement characters and phonological codes of common characters; based on the pronunciation of pre-replacement characters and common characters Shape code, filter out the typo candidates corresponding to the pre-replaced words from the commonly used words; obtain the correct corpus corresponding to the application scenario and select multiple scenarios parallel to it; crawl the dictionary corresponding to multiple scenarios, and filter the similar-sounding words corresponding to the pre-replaced words As a typo candidate; count the occurrence frequency of the typo and the typo of the typo candidate and the typo candidate in multiple scenarios in the expression sentence; according to this, replace the pre-replaced words and pre-replaced words in the correct corpus to obtain the corresponding parallel corpus, The correct-error parallel corpus for different application scenarios can be constructed flexibly, and the scene migration cost of the wrong word correction system based on the parallel training corpus can be reduced.
请参阅图6,本申请实施例中平行语料生成装置的另一个实施例包括:Please refer to FIG. 6, another embodiment of the parallel corpus generating device in the embodiment of the present application includes:
语料获取模块601,用于获取待处理的正确语料;The corpus acquisition module 601 is used to acquire the correct corpus to be processed;
分词模块602,用于对所述正确语料进行分词处理,并确定易错的预替换字、预替换词;The word segmentation module 602 is used to perform word segmentation processing on the correct corpus and determine error-prone pre-replacement words and pre-replacement words;
错字生成模块603,用于构造所述预替换字在预置常用字字典中的音形码;基于所述预替换字的音形码,从预置常用字字典的常用字中筛选出对应的错字候选;The typo generation module 603 is used to construct the phonetic shape code of the pre-replaced word in the preset common word dictionary; based on the phonetic shape code of the pre-replaced word, filter out the corresponding common words from the preset common word dictionary Typo candidate
错词生成模块604,用于获取所述正确语料对应的应用场景,并基于所述应用场景选择与其平行的多个场景;爬取所述多个场景对应的词典,筛选所述预替换词对应的相似音词语作为错词候选;The wrong word generation module 604 is configured to obtain the application scenario corresponding to the correct corpus, and select multiple scenarios parallel to the correct corpus based on the application scenario; crawl the dictionary corresponding to the multiple scenarios, and filter the corresponding pre-replacement word Words with similar sounds as wrong word candidates;
语句获取模块605,用于获取所述错字候选和所述错词候选在所述多个场景中的表达 语句;The sentence acquisition module 605 is configured to acquire the typos candidate and the expression sentences of the typos candidate in the multiple scenes;
计算模块606,用于分别统计所述错字候选和所述错词候选在所述表达语句中的出现频率;The calculation module 606 is configured to separately count the occurrence frequencies of the typos candidate and the typos candidate in the expression sentence;
语料生成模块607,用于基于所述出现频率替换正确语料中对应的预替换字和预替换词,得到对应的平行语料。The corpus generation module 607 is configured to replace the corresponding pre-replaced words and pre-replaced words in the correct corpus based on the occurrence frequency to obtain the corresponding parallel corpus.
具体的,所述分词模块602包括:分割单元6021、概率计算单元6022和筛选单元6023,其中:Specifically, the word segmentation module 602 includes: a segmentation unit 6021, a probability calculation unit 6022, and a screening unit 6023, wherein:
所述分割单元6021用于对所述正确语料进行分词处理,得到多个单字和词语;The segmentation unit 6021 is configured to perform word segmentation processing on the correct corpus to obtain multiple words and words;
所述概率计算单元6022用于基于预设错词数量概率分布,确定可能存在错误输入的单字和词语数量分布;The probability calculation unit 6022 is configured to determine the number distribution of words and words that may have incorrect input based on the preset probability distribution of the number of wrong words;
所述筛选单元6023用于基于可能存在错误输入的单字和词语数量分布,随机筛选其中一数量的单字和词语作为预替换字和预替换词。The screening unit 6023 is configured to randomly select a number of single characters and words as pre-replacement words and pre-replacement words based on the number of words and word distributions that may be incorrectly input.
具体的,所述错字生成模块603包括音形码构造单元6031,其具体用于:Specifically, the typo generation module 603 includes a phonetic shape code construction unit 6031, which is specifically used for:
基于所述预替换字的字音,构造所述预替换字在预置常用字字典中的音码,所述音码包括声母、韵母、韵母、补码、声调;Based on the phonetic sounds of the pre-replaced characters, construct the phonetic codes of the pre-replaced words in a preset common word dictionary, where the phonetic codes include initials, vowels, vowels, complements, and tones;
基于所述预替换字的字形,构造所述预替换字在预置常用字字典中的形码,所述形码包括汉字结构码、五位四角码、笔画数;Based on the glyph of the pre-replaced character, construct a shape code of the pre-replaced word in a preset common word dictionary, and the shape code includes a Chinese character structure code, a five-digit four-corner code, and the number of strokes;
基于所述音码和所述形码,确定所述预替换字在预置常用字字典中的音形码。Based on the phonetic code and the shape code, the phonetic shape code of the pre-replaced word in a preset common word dictionary is determined.
具体的,所述错字生成模块603还包括错字筛选单元6032,其具体用于:Specifically, the typos generating module 603 further includes a typos screening unit 6032, which is specifically used for:
对比所述预替换字的音形码与预置常用字字典中常用字的音形码的同类型编码字段;Comparing the phonetic shape code of the pre-replaced word with the same type of encoding field of the phonetic shape code of the commonly used word preset in the dictionary of commonly used words;
基于对比结果,统计所述预替换字与所述常用字中同类型编码字段不一致的数量;Based on the comparison result, count the number of inconsistencies between the pre-replaced word and the common word in the same type of coding field;
基于统计结果,确定所述预替换字与所述常用字之间的编辑距离;Determine the edit distance between the pre-replaced word and the commonly used word based on the statistical result;
若所述预替换字与所述常用字之间的编辑距离小于预设编辑距离,则对应的常用字为所述预替换字的错字候选。If the editing distance between the pre-replaced word and the frequently-used word is less than the preset editing distance, the corresponding frequently-used word is a typo candidate of the pre-replaced word.
具体的,所述错词生成模块604还用于:Specifically, the wrong word generation module 604 is also used for:
判断所述预替换词是否包含多个子词;Judging whether the pre-replacement word contains multiple sub-words;
若所述预替换词包含多个子词,则对所述预替换词进行分词,得到多个子词;If the pre-replacement word contains multiple sub-words, perform word segmentation on the pre-replacement word to obtain multiple sub-words;
基于所述多个子词,筛选每一子词对应的错词候选并逐一进行替换,生成所述预替换词对应的错词候选。Based on the multiple sub-words, the wrong word candidates corresponding to each sub-word are screened and replaced one by one to generate the wrong word candidates corresponding to the pre-replaced words.
具体的,所述平行语料生成装置还包括拼接模块608,其具体用于:Specifically, the parallel corpus generating device further includes a splicing module 608, which is specifically used for:
判断所述正确语料中的预替换字之后是否为一个单字或多个单字;Determine whether the pre-replaced word in the correct corpus is followed by a single word or multiple words;
若所述预替换字之后为一个单字或多个单字,则拼接所述预替换字与所述一个单字或所述多个单字,得到拼接预替换词;If the pre-replacement word is followed by a single word or multiple words, splicing the pre-replacement word and the single word or the multiple words to obtain a spliced pre-replacement word;
按词频降序筛选预设数量的所述拼接预替换词对应的错词候选;Screening a preset number of wrong word candidates corresponding to the spliced pre-replacement words in descending order of word frequency;
将筛选得到的错词候选一一替换所述拼接预替换词,并生成对应的平行语料。Replace the spliced pre-replaced words one by one with the selected wrong word candidates, and generate corresponding parallel corpus.
具体的,所述语料生成模块607具体用于:Specifically, the corpus generation module 607 is specifically configured to:
按词频降序筛选预设数量的所述预替换字对应的错字候选,按词频降序筛选预设数量的所述预替换词对应的错词候选;Filtering a preset number of typo candidates corresponding to the pre-replaced words in descending order of word frequency, and filtering a preset number of erroneous word candidates corresponding to the pre-replaced words in descending order of word frequency;
将筛选得到的错字候选和错词候选一一替换所述正确语料中对应的预替换字和预替换词,并生成对应的平行语料。Replace the corresponding pre-replaced words and pre-replaced words in the correct corpus one by one with the selected wrong character candidates and wrong word candidates, and generate corresponding parallel corpus.
本申请实施例中,获取应用场景对应的正确语料并进行分词,得到随机数量的预替换字与预替换词;根据预替换字的音形码从常用字字典中筛选出对应的错字候选,根据预替换词的读音,从与应用场景平行的多个场景对应的词典中,获取对应的错字候选,最后根 据错字候选和错词候选在多个应用场景对应的表达语句中的出现词频,筛选预设数量的错字与错词替换正确语料中的预替换字和预替换词,以生成对应的平行预料,实现灵活构建不同应用场景的正确-错误平行语料,降低基于该平行训练语料的错词纠正系统的场景迁移成。In the embodiment of this application, the correct corpus corresponding to the application scenario is obtained and word segmentation is performed to obtain a random number of pre-replaced words and pre-replaced words; according to the phonetic code of the pre-replaced words, the corresponding typo candidates are screened out from the dictionary of commonly used words, according to The pronunciation of pre-replaced words is obtained from the dictionaries corresponding to multiple scenarios parallel to the application scenario, and the corresponding typo candidates are obtained. Finally, the pre-replacement words are filtered according to the frequency of the typo candidates and the word candidates in the expression sentences corresponding to the multiple application scenarios. Set up a number of typos and wrong words to replace the pre-replaced words and pre-replaced words in the correct corpus to generate corresponding parallel predictions to achieve flexible construction of correct-error parallel corpora for different application scenarios, and reduce the correction of wrong words based on the parallel training corpus The scene of the system is migrated to.
上面图5和图6从模块化功能实体的角度对本申请实施例中的平行语料生成装置进行详细描述,下面从硬件处理的角度对本申请实施例中平行语料生成设备进行详细描述。The above figures 5 and 6 describe the parallel corpus generating device in the embodiment of the present application in detail from the perspective of modular functional entities, and the following describes the parallel corpus generating device in the embodiment of the present application in detail from the perspective of hardware processing.
图7是本申请实施例提供的一种平行语料生成设备的结构示意图,该平行语料生成设备700可因配置或性能不同而产生比较大的差异,可以包括一个或一个以上处理器(central processing units,CPU)710(例如,一个或一个以上处理器)和存储器720,一个或一个以上存储应用程序733或数据732的存储介质730(例如一个或一个以上海量存储设备)。其中,存储器720和存储介质730可以是短暂存储或持久存储。存储在存储介质730的程序可以包括一个或一个以上模块(图示没标出),每个模块可以包括对平行语料生成设备700中的一系列指令操作。更进一步地,处理器710可以设置为与存储介质730通信,在平行语料生成设备700上执行存储介质730中的一系列指令操作。FIG. 7 is a schematic structural diagram of a parallel corpus generation device provided by an embodiment of the present application. The parallel corpus generation device 700 may have relatively large differences due to different configurations or performance, and may include one or more processors (central processing units). , A CPU 710 (for example, one or more processors) and a memory 720, and one or more storage media 730 (for example, one or more storage devices) storing application programs 733 or data 732. Among them, the memory 720 and the storage medium 730 may be short-term storage or persistent storage. The program stored in the storage medium 730 may include one or more modules (not shown in the figure), and each module may include a series of instruction operations on the parallel corpus generation device 700. Further, the processor 710 may be configured to communicate with the storage medium 730, and execute a series of instruction operations in the storage medium 730 on the parallel corpus generation device 700.
平行语料生成设备700还可以包括一个或一个以上电源740,一个或一个以上有线或无线网络接口750,一个或一个以上输入输出接口760,和/或,一个或一个以上操作系统731,例如Windows Serve,Mac OS X,Unix,Linux,FreeBSD等等。本领域技术人员可以理解,图7示出的平行语料生成设备结构并不构成对平行语料生成设备的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。The parallel corpus generation device 700 may also include one or more power supplies 740, one or more wired or wireless network interfaces 750, one or more input and output interfaces 760, and/or one or more operating systems 731, such as Windows Serve , Mac OS X, Unix, Linux, FreeBSD, etc. Those skilled in the art can understand that the structure of the parallel corpus generation device shown in FIG. 7 does not constitute a limitation on the parallel corpus generation device, and may include more or fewer components than shown in the figure, or a combination of certain components, or different components. Component arrangement.
本申请还提供一种计算机可读存储介质,该计算机可读存储介质可以为非易失性计算机可读存储介质,该计算机可读存储介质也可以为易失性计算机可读存储介质,所述计算机可读存储介质中存储有指令,当所述指令在计算机上运行时,使得计算机执行如下步骤:This application also provides a computer-readable storage medium. The computer-readable storage medium may be a non-volatile computer-readable storage medium, and the computer-readable storage medium may also be a volatile computer-readable storage medium. The computer-readable storage medium stores instructions, and when the instructions are executed on the computer, the computer executes the following steps:
获取待处理的正确语料;Obtain the correct corpus to be processed;
对所述正确语料进行分词处理,并确定易错的预替换字、预替换词;Perform word segmentation processing on the correct corpus, and determine error-prone pre-replacement words and pre-replacement words;
构造所述预替换字在预置常用字字典中的音形码;Constructing the phonetic shape code of the pre-replaced word in the preset common word dictionary;
基于所述预替换字的音形码,从预置常用字字典的常用字中筛选出对应的错字候选;Based on the phonetic shape code of the pre-replaced word, filter out the corresponding typo candidate from the common words in the preset common word dictionary;
获取所述正确语料对应的应用场景,并基于所述应用场景选择与其平行的多个场景;Acquiring an application scenario corresponding to the correct corpus, and selecting multiple scenarios parallel to the application scenario based on the application scenario;
爬取所述多个场景对应的词典,筛选所述预替换词对应的相似音词语作为错词候选;Crawling the dictionaries corresponding to the multiple scenes, and selecting similar-sounding words corresponding to the pre-replaced words as wrong word candidates;
获取所述错字候选和所述错词候选在所述多个场景中的表达语句;Acquiring the typos candidate and the expression sentences of the typos candidate in the multiple scenes;
分别统计所述错字候选和所述错词候选在所述表达语句中的出现频率;Separately count the occurrence frequencies of the typos candidate and the typos candidate in the expression sentence;
基于所述出现频率替换正确语料中对应的预替换字和预替换词,得到对应的平行语料。The corresponding pre-replaced words and pre-replaced words in the correct corpus are replaced based on the occurrence frequency to obtain the corresponding parallel corpus.
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统,装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。Those skilled in the art can clearly understand that, for the convenience and conciseness of the description, the specific working process of the above-described system, device, and unit can refer to the corresponding process in the foregoing method embodiment, which will not be repeated here.
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(read-only memory,ROM)、随机存取存储器(random access memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。If the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable storage medium. Based on this understanding, the technical solution of the present application essentially or the part that contributes to the existing technology or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium , Including several instructions to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in the various embodiments of the present application. The aforementioned storage media include: U disk, mobile hard disk, read-only memory (read-only memory, ROM), random access memory (random access memory, RAM), magnetic disks or optical disks and other media that can store program codes. .
本申请所指区块链是分布式数据存储、点对点传输、共识机制、加密算法等计算机技术的新型应用模式。区块链(Blockchain),本质上是一个去中心化的数据库,是一串使用密 码学方法相关联产生的数据块,每一个数据块中包含了一批次网络交易的信息,用于验证其信息的有效性(防伪)和生成下一个区块。区块链可以包括区块链底层平台、平台产品服务层以及应用服务层等。The blockchain referred to in this application is a new application mode of computer technology such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm. Blockchain, essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information for verification. The validity of the information (anti-counterfeiting) and the generation of the next block. The blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.
以上所述,以上实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围。As mentioned above, the above embodiments are only used to illustrate the technical solutions of the present application, but not to limit them; although the present application has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that: The technical solutions recorded in the embodiments are modified, or some of the technical features are equivalently replaced; and these modifications or replacements do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present application.

Claims (20)

  1. 一种平行语料生成方法,其中,包括:A parallel corpus generation method, which includes:
    获取待处理的正确语料;Obtain the correct corpus to be processed;
    对所述正确语料进行分词处理,并确定易错的预替换字、预替换词;Perform word segmentation processing on the correct corpus, and determine error-prone pre-replacement words and pre-replacement words;
    构造所述预替换字在预置常用字字典中的音形码;Constructing the phonetic shape code of the pre-replaced word in the preset common word dictionary;
    基于所述预替换字的音形码,从预置常用字字典的常用字中筛选出对应的错字候选;Based on the phonetic shape code of the pre-replaced word, filter out the corresponding typo candidate from the common words in the preset common word dictionary;
    获取所述正确语料对应的应用场景,并基于所述应用场景选择与其平行的多个场景;Acquiring an application scenario corresponding to the correct corpus, and selecting multiple scenarios parallel to the application scenario based on the application scenario;
    爬取所述多个场景对应的词典,筛选所述预替换词对应的相似音词语作为错词候选;Crawling the dictionaries corresponding to the multiple scenes, and selecting similar-sounding words corresponding to the pre-replaced words as wrong word candidates;
    获取所述错字候选和所述错词候选在所述多个场景中的表达语句;Acquiring the typos candidate and the expression sentences of the typos candidate in the multiple scenes;
    分别统计所述错字候选和所述错词候选在所述表达语句中的出现频率;Separately count the occurrence frequencies of the typos candidate and the typos candidate in the expression sentence;
    基于所述出现频率替换正确语料中对应的预替换字和预替换词,得到对应的平行语料。The corresponding pre-replaced words and pre-replaced words in the correct corpus are replaced based on the occurrence frequency to obtain the corresponding parallel corpus.
  2. 根据权利要求1所述的平行语料生成方法,其中,所述对所述正确语料进行分词处理,并确定易错的预替换字、预替换词,包括:The parallel corpus generation method according to claim 1, wherein said performing word segmentation processing on the correct corpus and determining error-prone pre-replacement words and pre-replacement words comprises:
    对所述正确语料进行分词处理,得到多个单字和词语;Perform word segmentation processing on the correct corpus to obtain multiple words and phrases;
    基于预设错词数量概率分布,确定可能存在错误输入的单字和词语数量分布;Based on the preset probability distribution of the number of wrong words, determine the number of words and words that may be mistyped;
    基于可能存在错误输入的单字和词语数量分布,随机筛选其中一数量的单字和词语作为预替换字和预替换词。Based on the distribution of the number of words and words that may be entered incorrectly, a number of words and words are randomly selected as pre-replacement words and pre-replacement words.
  3. 根据权利要求1所述的平行语料生成方法,其中,所述构造所述预替换字在预置常用字字典中的音形码,包括:The parallel corpus generation method according to claim 1, wherein said constructing the phonetic shape code of the pre-replaced word in a preset common word dictionary comprises:
    基于所述预替换字的字音,构造所述预替换字在预置常用字字典中的音码,所述音码包括声母、韵母、韵母、补码、声调;Based on the phonetic sounds of the pre-replaced characters, construct the phonetic codes of the pre-replaced words in a preset common word dictionary, where the phonetic codes include initials, vowels, vowels, complements, and tones;
    基于所述预替换字的字形,构造所述预替换字在预置常用字字典中的形码,所述形码包括汉字结构码、五位四角码、笔画数;Based on the glyph of the pre-replaced character, construct a shape code of the pre-replaced word in a preset common word dictionary, and the shape code includes a Chinese character structure code, a five-digit four-corner code, and the number of strokes;
    基于所述音码和所述形码,确定所述预替换字在预置常用字字典中的音形码。Based on the phonetic code and the shape code, the phonetic shape code of the pre-replaced word in a preset common word dictionary is determined.
  4. 根据权利要求1或3所述的平行语料生成方法,其中,所述基于所述预替换字的音形码,从预置常用字字典的常用字中筛选出对应的错字候选,包括:The parallel corpus generation method according to claim 1 or 3, wherein said screening out corresponding typo candidates from common words in a preset common word dictionary based on the phoneme code of the pre-replaced word comprises:
    对比所述预替换字的音形码与预置常用字字典中常用字的音形码的同类型编码字段;Comparing the phonetic shape code of the pre-replaced word with the same type of encoding field of the phonetic shape code of the commonly used word preset in the dictionary of commonly used words;
    基于对比结果,统计所述预替换字与所述常用字中同类型编码字段不一致的数量;Based on the comparison result, count the number of inconsistencies between the pre-replaced word and the common word in the same type of coding field;
    基于统计结果,确定所述预替换字与所述常用字之间的编辑距离;Determine the edit distance between the pre-replaced word and the commonly used word based on the statistical result;
    若所述预替换字与所述常用字之间的编辑距离小于预设编辑距离,则对应的常用字为所述预替换字的错字候选。If the editing distance between the pre-replaced word and the frequently-used word is less than the preset editing distance, the corresponding frequently-used word is a typo candidate of the pre-replaced word.
  5. 根据权利要求1所述的平行语料生成方法,其中,在所述爬取所述多个场景对应的词典,筛选所述预替换词对应的相似音词语作为错词候选之后,还包括:The parallel corpus generation method according to claim 1, wherein after the crawling the dictionaries corresponding to the multiple scenes and screening the similar-sounding words corresponding to the pre-replaced words as wrong word candidates, the method further comprises:
    判断所述预替换词是否包含多个子词;Judging whether the pre-replacement word contains multiple sub-words;
    若所述预替换词包含多个子词,则对所述预替换词进行分词,得到多个子词;If the pre-replacement word contains multiple sub-words, perform word segmentation on the pre-replacement word to obtain multiple sub-words;
    基于所述多个子词,筛选每一子词对应的错词候选并逐一进行替换,生成所述预替换词对应的错词候选。Based on the multiple sub-words, the wrong word candidates corresponding to each sub-word are screened and replaced one by one to generate the wrong word candidates corresponding to the pre-replaced words.
  6. 根据权利要求1所述的平行语料生成方法,其中,在所述分别统计所述错字候选和所述错词候选在所述表达语句中的出现频率之后,还包括:The method for generating parallel corpus according to claim 1, wherein after said separately counting the frequency of occurrence of the typos candidate and the typos candidate in the expression sentence, the method further comprises:
    判断所述正确语料中的预替换字之后是否为一个单字或多个单字;Determine whether the pre-replaced word in the correct corpus is followed by a single word or multiple words;
    若所述预替换字之后为一个单字或多个单字,则拼接所述预替换字与所述一个单字或所述多个单字,得到拼接预替换词;If the pre-replacement word is followed by a single word or multiple words, splicing the pre-replacement word and the single word or the multiple words to obtain a spliced pre-replacement word;
    按词频降序筛选预设数量的所述拼接预替换词对应的错词候选;Screening a preset number of wrong word candidates corresponding to the spliced pre-replacement words in descending order of word frequency;
    将筛选得到的错词候选一一替换所述拼接预替换词,并生成对应的平行语料。Replace the spliced pre-replaced words one by one with the selected wrong word candidates, and generate corresponding parallel corpus.
  7. 根据权利要求1或6所述的平行语料生成方法,其中,所述基于所述出现频率替换正确语料中对应的预替换字和预替换词,得到对应的平行语料,包括:The parallel corpus generation method according to claim 1 or 6, wherein said replacing the corresponding pre-replaced words and pre-replaced words in the correct corpus based on the occurrence frequency to obtain the corresponding parallel corpus comprises:
    按词频降序筛选预设数量的所述预替换字对应的错字候选,按词频降序筛选预设数量的所述预替换词对应的错词候选;Filtering a preset number of typo candidates corresponding to the pre-replaced words in descending order of word frequency, and filtering a preset number of erroneous word candidates corresponding to the pre-replaced words in descending order of word frequency;
    将筛选得到的错字候选和错词候选一一替换所述正确语料中对应的预替换字和预替换词,并生成对应的平行语料。Replace the corresponding pre-replaced words and pre-replaced words in the correct corpus one by one with the selected wrong character candidates and wrong word candidates, and generate corresponding parallel corpus.
  8. 一种平行语料生成设备,包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机可读指令,所述处理器执行所述计算机可读指令时实现如下步骤:A parallel corpus generation device includes a memory, a processor, and computer-readable instructions stored on the memory and capable of running on the processor, and the processor implements the following steps when the processor executes the computer-readable instructions:
    获取待处理的正确语料;Obtain the correct corpus to be processed;
    对所述正确语料进行分词处理,并确定易错的预替换字、预替换词;Perform word segmentation processing on the correct corpus, and determine error-prone pre-replacement words and pre-replacement words;
    构造所述预替换字在预置常用字字典中的音形码;Constructing the phonetic shape code of the pre-replaced word in the preset common word dictionary;
    基于所述预替换字的音形码,从预置常用字字典的常用字中筛选出对应的错字候选;Based on the phonetic shape code of the pre-replaced word, filter out the corresponding typo candidate from the common words in the preset common word dictionary;
    获取所述正确语料对应的应用场景,并基于所述应用场景选择与其平行的多个场景;Acquiring an application scenario corresponding to the correct corpus, and selecting multiple scenarios parallel to the application scenario based on the application scenario;
    爬取所述多个场景对应的词典,筛选所述预替换词对应的相似音词语作为错词候选;Crawling the dictionaries corresponding to the multiple scenes, and selecting similar-sounding words corresponding to the pre-replaced words as wrong word candidates;
    获取所述错字候选和所述错词候选在所述多个场景中的表达语句;Acquiring the typos candidate and the expression sentences of the typos candidate in the multiple scenes;
    分别统计所述错字候选和所述错词候选在所述表达语句中的出现频率;Separately count the occurrence frequencies of the typos candidate and the typos candidate in the expression sentence;
    基于所述出现频率替换正确语料中对应的预替换字和预替换词,得到对应的平行语料。The corresponding pre-replaced words and pre-replaced words in the correct corpus are replaced based on the occurrence frequency to obtain the corresponding parallel corpus.
  9. 根据权利要求8所述的平行语料生成设备,所述处理器执行所述计算机程序时还实现以下步骤:According to the parallel corpus generating device according to claim 8, the processor further implements the following steps when executing the computer program:
    对所述正确语料进行分词处理,得到多个单字和词语;Perform word segmentation processing on the correct corpus to obtain multiple words and phrases;
    基于预设错词数量概率分布,确定可能存在错误输入的单字和词语数量分布;Based on the preset probability distribution of the number of wrong words, determine the number of words and words that may be mistyped;
    基于可能存在错误输入的单字和词语数量分布,随机筛选其中一数量的单字和词语作为预替换字和预替换词。Based on the distribution of the number of words and words that may be entered incorrectly, a number of words and words are randomly selected as pre-replacement words and pre-replacement words.
  10. 根据权利要求8所述的平行语料生成设备,所述处理器执行所述计算机程序时还实现以下步骤:According to the parallel corpus generating device according to claim 8, the processor further implements the following steps when executing the computer program:
    基于所述预替换字的字音,构造所述预替换字在预置常用字字典中的音码,所述音码包括声母、韵母、韵母、补码、声调;Based on the phonetic sounds of the pre-replaced characters, construct the phonetic codes of the pre-replaced words in a preset common word dictionary, the phonetic codes include initials, vowels, vowels, complements, and tones;
    基于所述预替换字的字形,构造所述预替换字在预置常用字字典中的形码,所述形码包括汉字结构码、五位四角码、笔画数;Based on the glyph of the pre-replaced character, construct a shape code of the pre-replaced word in a preset common word dictionary, where the shape code includes a Chinese character structure code, a five-digit four-corner code, and the number of strokes;
    基于所述音码和所述形码,确定所述预替换字在预置常用字字典中的音形码。Based on the phonetic code and the shape code, the phonetic shape code of the pre-replaced word in a preset common word dictionary is determined.
  11. 根据权利要求8或10所述的平行语料生成设备,所述处理器执行所述计算机程序时还实现以下步骤:According to the parallel corpus generation device of claim 8 or 10, the processor further implements the following steps when executing the computer program:
    对比所述预替换字的音形码与预置常用字字典中常用字的音形码的同类型编码字段;Comparing the phonetic shape code of the pre-replaced word with the same type of encoding field of the phonetic shape code of the commonly used word preset in the dictionary of commonly used words;
    基于对比结果,统计所述预替换字与所述常用字中同类型编码字段不一致的数量;Based on the comparison result, count the number of inconsistencies between the pre-replaced word and the common word in the same type of coding field;
    基于统计结果,确定所述预替换字与所述常用字之间的编辑距离;Determine the edit distance between the pre-replaced word and the commonly used word based on the statistical result;
    若所述预替换字与所述常用字之间的编辑距离小于预设编辑距离,则对应的常用字为所述预替换字的错字候选。If the editing distance between the pre-replaced word and the frequently-used word is less than the preset editing distance, the corresponding frequently-used word is a typo candidate of the pre-replaced word.
  12. 根据权利要求8所述的平行语料生成设备,所述处理器执行所述计算机程序时还实现以下步骤:According to the parallel corpus generating device according to claim 8, the processor further implements the following steps when executing the computer program:
    判断所述预替换词是否包含多个子词;Judging whether the pre-replacement word contains multiple sub-words;
    若所述预替换词包含多个子词,则对所述预替换词进行分词,得到多个子词;If the pre-replacement word contains multiple sub-words, perform word segmentation on the pre-replacement word to obtain multiple sub-words;
    基于所述多个子词,筛选每一子词对应的错词候选并逐一进行替换,生成所述预替换 词对应的错词候选。Based on the multiple sub-words, the wrong word candidates corresponding to each sub-word are screened and replaced one by one to generate the wrong word candidates corresponding to the pre-replaced words.
  13. 根据权利要求8所述的平行语料生成设备,所述处理器执行所述计算机程序时还实现以下步骤:According to the parallel corpus generating device according to claim 8, the processor further implements the following steps when executing the computer program:
    判断所述正确语料中的预替换字之后是否为一个单字或多个单字;Determine whether the pre-replaced word in the correct corpus is followed by a single word or multiple words;
    若所述预替换字之后为一个单字或多个单字,则拼接所述预替换字与所述一个单字或所述多个单字,得到拼接预替换词;If the pre-replacement word is followed by a single word or multiple words, splicing the pre-replacement word and the single word or the multiple words to obtain a spliced pre-replacement word;
    按词频降序筛选预设数量的所述拼接预替换词对应的错词候选;Screening a preset number of wrong word candidates corresponding to the spliced pre-replacement words in descending order of word frequency;
    将筛选得到的错词候选一一替换所述拼接预替换词,并生成对应的平行语料。Replace the spliced pre-replaced words one by one with the selected wrong word candidates, and generate corresponding parallel corpus.
  14. 根据权利要求8或13所述的平行语料生成设备,所述处理器执行所述计算机程序时还实现以下步骤:According to the parallel corpus generating device according to claim 8 or 13, the processor further implements the following steps when executing the computer program:
    按词频降序筛选预设数量的所述预替换字对应的错字候选,按词频降序筛选预设数量的所述预替换词对应的错词候选;Filtering a preset number of typos candidates corresponding to the pre-replaced words in descending order of word frequency, and filtering a preset number of erroneous word candidates corresponding to the pre-replaced words in descending order of word frequency;
    将筛选得到的错字候选和错词候选一一替换所述正确语料中对应的预替换字和预替换词,并生成对应的平行语料。Replace the corresponding pre-replaced words and pre-replaced words in the correct corpus one by one with the selected wrong character candidates and wrong word candidates, and generate corresponding parallel corpus.
  15. 一种计算机可读存储介质,所述计算机可读存储介质中存储计算机指令,当所述计算机指令在计算机上运行时,使得计算机执行如下步骤:A computer-readable storage medium in which computer instructions are stored, and when the computer instructions are executed on a computer, the computer executes the following steps:
    获取待处理的正确语料;Obtain the correct corpus to be processed;
    对所述正确语料进行分词处理,并确定易错的预替换字、预替换词;Perform word segmentation processing on the correct corpus, and determine error-prone pre-replacement words and pre-replacement words;
    构造所述预替换字在预置常用字字典中的音形码;Constructing the phonetic shape code of the pre-replaced word in the preset common word dictionary;
    基于所述预替换字的音形码,从预置常用字字典的常用字中筛选出对应的错字候选;Based on the phonetic shape code of the pre-replaced word, filter out the corresponding typo candidate from the common words in the preset common word dictionary;
    获取所述正确语料对应的应用场景,并基于所述应用场景选择与其平行的多个场景;Acquiring an application scenario corresponding to the correct corpus, and selecting multiple scenarios parallel to the application scenario based on the application scenario;
    爬取所述多个场景对应的词典,筛选所述预替换词对应的相似音词语作为错词候选;Crawling the dictionaries corresponding to the multiple scenes, and selecting similar-sounding words corresponding to the pre-replaced words as wrong word candidates;
    获取所述错字候选和所述错词候选在所述多个场景中的表达语句;Acquiring the typos candidate and the expression sentences of the typos candidate in the multiple scenes;
    分别统计所述错字候选和所述错词候选在所述表达语句中的出现频率;Separately count the occurrence frequencies of the typos candidate and the typos candidate in the expression sentence;
    基于所述出现频率替换正确语料中对应的预替换字和预替换词,得到对应的平行语料。The corresponding pre-replaced words and pre-replaced words in the correct corpus are replaced based on the occurrence frequency to obtain the corresponding parallel corpus.
  16. 根据权利要求15所述的计算机可读存储介质,当所述计算机指令在计算机上运行时,使得计算机还执行以下步骤:The computer-readable storage medium according to claim 15, when the computer instructions are executed on the computer, the computer is caused to further perform the following steps:
    对所述正确语料进行分词处理,得到多个单字和词语;Perform word segmentation processing on the correct corpus to obtain multiple words and phrases;
    基于预设错词数量概率分布,确定可能存在错误输入的单字和词语数量分布;Based on the preset probability distribution of the number of wrong words, determine the number of words and words that may be mistyped;
    基于可能存在错误输入的单字和词语数量分布,随机筛选其中一数量的单字和词语作为预替换字和预替换词。Based on the distribution of the number of words and words that may be entered incorrectly, a number of words and words are randomly selected as pre-replacement words and pre-replacement words.
  17. 根据权利要求15所述的计算机可读存储介质,当所述计算机指令在计算机上运行时,使得计算机还执行以下步骤:The computer-readable storage medium according to claim 15, when the computer instructions are executed on the computer, the computer is caused to further perform the following steps:
    基于所述预替换字的字音,构造所述预替换字在预置常用字字典中的音码,所述音码包括声母、韵母、韵母、补码、声调;Based on the phonetic sounds of the pre-replaced characters, construct the phonetic codes of the pre-replaced words in a preset common word dictionary, where the phonetic codes include initials, vowels, vowels, complements, and tones;
    基于所述预替换字的字形,构造所述预替换字在预置常用字字典中的形码,所述形码包括汉字结构码、五位四角码、笔画数;Based on the glyph of the pre-replaced character, construct a shape code of the pre-replaced word in a preset common word dictionary, and the shape code includes a Chinese character structure code, a five-digit four-corner code, and the number of strokes;
    基于所述音码和所述形码,确定所述预替换字在预置常用字字典中的音形码。Based on the phonetic code and the shape code, the phonetic shape code of the pre-replaced word in a preset common word dictionary is determined.
  18. 根据权利要求15或17所述的计算机可读存储介质,当所述计算机指令在计算机上运行时,使得计算机还执行以下步骤:The computer-readable storage medium according to claim 15 or 17, when the computer instructions are executed on the computer, the computer is caused to further execute the following steps:
    对比所述预替换字的音形码与预置常用字字典中常用字的音形码的同类型编码字段;Comparing the phonetic shape code of the pre-replaced word with the same type of encoding field of the phonetic shape code of the commonly used word preset in the dictionary of commonly used words;
    基于对比结果,统计所述预替换字与所述常用字中同类型编码字段不一致的数量;Based on the comparison result, count the number of inconsistencies between the pre-replaced word and the common word in the same type of coding field;
    基于统计结果,确定所述预替换字与所述常用字之间的编辑距离;Determine the edit distance between the pre-replaced word and the commonly used word based on the statistical result;
    若所述预替换字与所述常用字之间的编辑距离小于预设编辑距离,则对应的常用字为所述预替换字的错字候选。If the editing distance between the pre-replaced word and the frequently-used word is less than the preset editing distance, the corresponding frequently-used word is a typo candidate of the pre-replaced word.
  19. 根据权利要求15所述的计算机可读存储介质,当所述计算机指令在计算机上运行执行以下步骤时,使得计算机还执行以下步骤:The computer-readable storage medium according to claim 15, when the computer instructions are executed on the computer to execute the following steps, the computer is caused to further execute the following steps:
    判断所述预替换词是否包含多个子词;Judging whether the pre-replacement word contains multiple sub-words;
    若所述预替换词包含多个子词,则对所述预替换词进行分词,得到多个子词;If the pre-replacement word contains multiple sub-words, perform word segmentation on the pre-replacement word to obtain multiple sub-words;
    基于所述多个子词,筛选每一子词对应的错词候选并逐一进行替换,生成所述预替换词对应的错词候选。Based on the multiple sub-words, the wrong word candidates corresponding to each sub-word are screened and replaced one by one to generate the wrong word candidates corresponding to the pre-replaced words.
  20. 一种平行语料生成装置,其中,所述平行语料生成装置包括:A parallel corpus generating device, wherein the parallel corpus generating device includes:
    语料获取模块,用于获取待处理的正确语料;The corpus acquisition module is used to acquire the correct corpus to be processed;
    分词模块,用于对所述正确语料进行分词处理,并确定易错的预替换字、预替换词;The word segmentation module is used to segment the correct corpus and determine the error-prone pre-replacement words and pre-replacement words;
    错字生成模块,用于构造所述预替换字在预置常用字字典中的音形码;基于所述预替换字的音形码,从预置常用字字典的常用字中筛选出对应的错字候选;The typo generation module is used to construct the phonetic shape code of the pre-replaced word in the preset common word dictionary; based on the phonetic shape code of the pre-replaced word, filter out the corresponding typo from the common words in the preset common word dictionary Candidate
    错词生成模块,用于获取所述正确语料对应的应用场景,并基于所述应用场景选择与其平行的多个场景;爬取所述多个场景对应的词典,筛选所述预替换词对应的相似音词语作为错词候选;The wrong word generation module is used to obtain the application scenario corresponding to the correct corpus, and select multiple scenarios parallel to it based on the application scenario; crawl the dictionaries corresponding to the multiple scenarios, and filter the corresponding pre-replacement words Words with similar sounds are used as candidates for wrong words;
    语句获取模块,用于获取所述错字候选和所述错词候选在所述多个场景中的表达语句;A sentence acquisition module, configured to acquire the typos candidate and the expression sentences of the typos candidate in the multiple scenes;
    计算模块,用于分别统计所述错字候选和所述错词候选在所述表达语句中的出现频率;A calculation module, configured to separately count the occurrence frequencies of the typos candidate and the typos candidate in the expression sentence;
    语料生成模块,用于基于所述出现频率替换正确语料中对应的预替换字和预替换词,得到对应的平行语料。The corpus generation module is used to replace the corresponding pre-replaced words and pre-replaced words in the correct corpus based on the occurrence frequency to obtain the corresponding parallel corpus.
PCT/CN2021/078059 2020-04-28 2021-02-26 Parallel corpus generation method, apparatus and device, and storage medium WO2021218329A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010351250.5A CN111639495A (en) 2020-04-28 2020-04-28 Parallel corpus generation method, device, equipment and storage medium
CN202010351250.5 2020-04-28

Publications (1)

Publication Number Publication Date
WO2021218329A1 true WO2021218329A1 (en) 2021-11-04

Family

ID=72329865

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/078059 WO2021218329A1 (en) 2020-04-28 2021-02-26 Parallel corpus generation method, apparatus and device, and storage medium

Country Status (2)

Country Link
CN (1) CN111639495A (en)
WO (1) WO2021218329A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115438650A (en) * 2022-11-08 2022-12-06 深圳擎盾信息科技有限公司 Contract text error correction method, system, equipment and medium fusing multi-source characteristics

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111639495A (en) * 2020-04-28 2020-09-08 深圳壹账通智能科技有限公司 Parallel corpus generation method, device, equipment and storage medium
CN112329447B (en) * 2020-10-29 2024-03-26 语联网(武汉)信息技术有限公司 Training method of Chinese error correction model, chinese error correction method and device
CN112560452B (en) * 2021-02-25 2021-05-18 智者四海(北京)技术有限公司 Method and system for automatically generating error correction corpus
CN113204966B (en) * 2021-06-08 2023-03-28 重庆度小满优扬科技有限公司 Corpus augmentation method, apparatus, device and storage medium
CN113536776A (en) * 2021-06-22 2021-10-22 深圳价值在线信息科技股份有限公司 Confusion statement generation method, terminal device and computer-readable storage medium
CN113343674B (en) * 2021-07-09 2022-04-01 北京海泰方圆科技股份有限公司 Method, device, equipment and medium for generating text error correction model training corpus

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140298168A1 (en) * 2013-03-28 2014-10-02 Est Soft Corp. System and method for spelling correction of misspelled keyword
CN106598939A (en) * 2016-10-21 2017-04-26 北京三快在线科技有限公司 Method and device for text error correction, server and storage medium
CN106708893A (en) * 2015-11-17 2017-05-24 华为技术有限公司 Error correction method and device for search query term
CN107665190A (en) * 2017-09-29 2018-02-06 李晓妮 A kind of method for automatically constructing and device of text proofreading mistake dictionary
CN107729321A (en) * 2017-10-23 2018-02-23 上海百芝龙网络科技有限公司 A kind of method for correcting error of voice identification result
CN108959250A (en) * 2018-06-27 2018-12-07 众安信息技术服务有限公司 A kind of error correction method and its system based on language model and word feature
CN111639495A (en) * 2020-04-28 2020-09-08 深圳壹账通智能科技有限公司 Parallel corpus generation method, device, equipment and storage medium

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140298168A1 (en) * 2013-03-28 2014-10-02 Est Soft Corp. System and method for spelling correction of misspelled keyword
CN106708893A (en) * 2015-11-17 2017-05-24 华为技术有限公司 Error correction method and device for search query term
CN106598939A (en) * 2016-10-21 2017-04-26 北京三快在线科技有限公司 Method and device for text error correction, server and storage medium
CN107665190A (en) * 2017-09-29 2018-02-06 李晓妮 A kind of method for automatically constructing and device of text proofreading mistake dictionary
CN107729321A (en) * 2017-10-23 2018-02-23 上海百芝龙网络科技有限公司 A kind of method for correcting error of voice identification result
CN108959250A (en) * 2018-06-27 2018-12-07 众安信息技术服务有限公司 A kind of error correction method and its system based on language model and word feature
CN111639495A (en) * 2020-04-28 2020-09-08 深圳壹账通智能科技有限公司 Parallel corpus generation method, device, equipment and storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
CHEN MING;DU QING-ZHI;SHAO YU-BIN;LONG HUA: "Chinese Characters Similarity Comparison Algorithm Based on Phonetic Code and Shape Code", INFORMATION TECHNOLOGY, 20 November 2018 (2018-11-20), pages 73 - 75, XP055861762, ISSN: 1009-2552, DOI: 10.13274/j.cnki.hdzj.2018.11.016 *
WEIXIN_34258782: "Chinese String Similarity Algorithm Based on Phonetic Code (Rev)", 7 May 2018 (2018-05-07), XP055861759, Retrieved from the Internet <URL:CSDN blog-https://www.cnblogs.com/Dennis-mi/articles/9001600.html> *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115438650A (en) * 2022-11-08 2022-12-06 深圳擎盾信息科技有限公司 Contract text error correction method, system, equipment and medium fusing multi-source characteristics

Also Published As

Publication number Publication date
CN111639495A (en) 2020-09-08

Similar Documents

Publication Publication Date Title
WO2021218329A1 (en) Parallel corpus generation method, apparatus and device, and storage medium
CN110110041B (en) Wrong word correcting method, wrong word correcting device, computer device and storage medium
US11636264B2 (en) Stylistic text rewriting for a target author
CN107220235B (en) Speech recognition error correction method and device based on artificial intelligence and storage medium
CN106649783B (en) Synonym mining method and device
CN107704102B (en) Text input method and device
US20120262461A1 (en) System and Method for the Normalization of Text
CN110276071B (en) Text matching method and device, computer equipment and storage medium
CN104239289B (en) Syllabification method and syllabification equipment
Abbad et al. Multi-components system for automatic Arabic diacritization
Bawden et al. Automatic normalisation of early Modern French
de Silva et al. Singlish to sinhala transliteration using rule-based approach
CN112328621A (en) SQL conversion method and device, computer equipment and computer readable storage medium
Pirinen Weighted finite-state methods for spell-checking and correction
US20220229990A1 (en) System and method for lookup source segmentation scoring in a natural language understanding (nlu) framework
US20220229998A1 (en) Lookup source framework for a natural language understanding (nlu) framework
JP5106431B2 (en) Machine translation apparatus, program and method
WO2020170804A1 (en) Synonym extraction device, synonym extraction method, and synonym extraction program
Andrews Digital Techniques for Critical Edition
Wroblewska et al. Dependency parsing of Polish
RU2817524C1 (en) Method and system for generating text
CN115080603B (en) Database query language conversion method, device, equipment and storage medium
RU2796208C1 (en) Method and system for digital assistant text generation
KR102300457B1 (en) Electronic device that supports efficient typing practice by presenting words by level according to phoneme classification and operating method thereof
Akmuradov et al. Text Analyzing Algorithm for Speech Synthesizer of Uzbek Language

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21796416

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205 DATED 20/02/2023)

122 Ep: pct application non-entry in european phase

Ref document number: 21796416

Country of ref document: EP

Kind code of ref document: A1