WO2021218329A1

WO2021218329A1 - Parallel corpus generation method, apparatus and device, and storage medium

Info

Publication number: WO2021218329A1
Application number: PCT/CN2021/078059
Authority: WO
Inventors: 邱煜
Original assignee: 深圳壹账通智能科技有限公司
Priority date: 2020-04-28
Filing date: 2021-02-26
Publication date: 2021-11-04
Also published as: CN111639495A

Abstract

A parallel corpus generation method, apparatus and device, and a storage medium. The method comprises: obtaining a correct corpus to be processed (101) and performing word segmentation processing, and determining error-prone pre-replacement characters and pre-replacement words (102); constructing phonetic and morphological codes of the pre-replacement characters (103), and on the basis of the phonetic and morphological codes of the pre-replacement characters and common characters, screening out wrong character candidates corresponding to the pre-replacement characters from the common characters (104); obtaining an application scene corresponding to the correct corpus and selecting a plurality of scenes parallel to the application scene (105); crawling dictionaries corresponding to the plurality of scenes, and screening similar phonetic words corresponding to the pre-replacement words to serve as wrong word candidates (106); counting occurrence frequencies of wrong characters and wrong words of the wrong character candidates and the wrong word candidates in expression statements in the plurality of scenes (108); and on this basis, replacing the pre-replacement characters and the pre-replacement words in the correct corpus to obtain a corresponding parallel corpus (109), so as to flexibly construct correct-wrong parallel corpora of different application scenes. The solution also relates to a blockchain technology, and the correct corpus can be stored in a blockchain node.

Description

Parallel corpus generation method, device, equipment and storage medium

This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on April 28, 2020, the application number is 202010351250.5, and the invention title is "Parallel corpus generation method, device, equipment and storage medium", the entire content of which is incorporated by reference In application.

Technical field

This application relates to the field of natural language processing, and in particular to a method, device, equipment and storage medium for generating parallel corpus.

Background technique

In recent years, with the rapid development of information processing technology and the Internet, traditional text information recording, processing, sorting, summarizing and other tasks have gradually been replaced by computers, and e-books, e-office documents, e-newspapers, and online social platforms have gradually become An essential part of our daily lives. However, traditional text information is obtained by handwriting, and electronic text information needs to be converted indirectly with the help of different input rules and input tools, which increases the error rate of electronic text information input, and generally the amount of electronic text data is huge, which makes text correction work. Brings greater challenges.

Nowadays, there are often wrong words in electronic text information. The error correction model is mainly trained through neural network, and when training the wrong word correction model, a large amount of correct-error parallel corpus is needed. The inventor realizes that nowadays, the manual correction method is mainly used to obtain parallel corpus, which is time-consuming and laborious, and requires a high correction cost. Moreover, some erroneous words are difficult to be corrected, which affects the performance of the model; and electronic text information correction and its application The scenarios are closely related. The parallel training corpus obtained in one application scenario may not show better results in other application scenarios. Therefore, different application scenarios need to be equipped with different correct-error parallel corpora, resulting in errors based on the parallel training corpus. The scene migration cost of the word correction system is greatly increased.

Summary of the invention

The main purpose of this application is to solve the problem of inflexibility in generating parallel training corpus in the prior art.

To achieve the above objective, the first aspect of the present application provides a parallel corpus generation method, including: obtaining the correct corpus to be processed; performing word segmentation processing on the correct corpus, and determining error-prone pre-replacement words and pre-replacement words; Construct the phonetic shape code of the pre-replaced word in the preset common word dictionary; based on the phonetic shape code of the pre-replaced word, filter out the corresponding typo candidates from the common words in the preset common word dictionary; obtain the correctness Select the application scenarios corresponding to the corpus, and select multiple parallel scenarios based on the application scenarios; crawl the dictionaries corresponding to the multiple scenarios, and screen the similar-sounding words corresponding to the pre-replaced words as the wrong word candidates; obtain the Wrong word candidates and expression sentences of the wrong word candidates in the multiple scenes; respectively count the occurrence frequencies of the wrong word candidates and the wrong word candidates in the expression sentence; replace the correct corpus based on the occurrence frequency Corresponding pre-replacement words and pre-replacement words to obtain corresponding parallel corpus.

The second aspect of the present application provides a parallel corpus generation device, including a memory, a processor, and computer-readable instructions stored in the memory and running on the processor, and the processor executes the computer When reading instructions, the following steps are implemented: obtain the correct corpus to be processed; perform word segmentation on the correct corpus, and determine the error-prone pre-replacement words and pre-replacement words; construct the pre-replacement words in the preset common word dictionary Phonetic shape code; based on the phonetic shape code of the pre-replaced word, filter out the corresponding typo candidates from the commonly used words in the preset common word dictionary; obtain the application scenario corresponding to the correct corpus, and select the corresponding typo based on the application scenario Parallel multiple scenes; crawl the dictionaries corresponding to the multiple scenes, and screen the similar-sounding words corresponding to the pre-replaced words as the wrong word candidates; obtain the wrong word candidates and the wrong word candidates in the multiple scenes Respectively count the occurrence frequency of the typo candidate and the wrong word candidate in the expression sentence; replace the corresponding pre-replacement word and pre-replacement word in the correct corpus based on the occurrence frequency to obtain the corresponding parallel Corpus.

The third aspect of the present application provides a computer-readable storage medium, the computer-readable storage medium stores computer instructions, and when the computer instructions run on the computer, the computer executes the following steps: obtain the correctness to be processed Corpus; perform word segmentation processing on the correct corpus, and determine the error-prone pre-replacement words and pre-replacement words; construct the phonetic shape code of the pre-replacement word in the preset common word dictionary; based on the phonetics of the pre-replacement word The shape code is used to screen out the corresponding typo candidates from the commonly used words in the preset common word dictionary; obtain the application scenario corresponding to the correct corpus, and select multiple scenarios parallel to the correct corpus based on the application scenario; crawl the multiple The dictionary corresponding to the scene, and screen the similar-sounding words corresponding to the pre-replacement word as the wrong word candidates; obtain the wrong word candidate and the expression sentences of the wrong word candidate in the multiple scenes; respectively count the wrong word candidates and The occurrence frequency of the wrong word candidate in the expression sentence; and the corresponding pre-replacement word and the pre-replacement word in the correct corpus are replaced based on the occurrence frequency to obtain a corresponding parallel corpus.

The fourth aspect of the present application provides a parallel corpus generation device, including: a corpus acquisition module for acquiring the correct corpus to be processed; a word segmentation module for segmenting the correct corpus and determining error-prone pre-replacement Words, pre-replaced words; typo generation module, used to construct the phonetic shape codes of the pre-replaced words in the preset common word dictionary; based on the phonetic shape codes of the pre-replaced words, from the common words of the preset common word dictionary The corresponding typo candidates are screened out; the wrong word generation module is used to obtain the application scenario corresponding to the correct corpus, and select multiple scenarios parallel to it based on the application scenario; crawl the dictionary corresponding to the multiple scenarios, Screening similar-sounding words corresponding to the pre-replacement word as the wrong word candidates; a sentence acquisition module for obtaining the wrong word candidates and the expression sentences of the wrong word candidates in the multiple scenes; a calculation module for respectively Count the occurrence frequency of the typos candidate and the typos candidate in the expression sentence; a corpus generation module is used to replace the corresponding pre-replaced words and pre-replaced words in the correct corpus based on the occurrence frequency to obtain the corresponding parallel Corpus.

In the technical solution provided by this application, obtain the correct corpus to be processed; perform word segmentation on the correct corpus, and determine error-prone pre-replacement words and pre-replacement words; construct the pre-replacement words in a preset common word dictionary The phonetic shape code; based on the phonetic shape code of the pre-replaced word and the phonetic shape code of the commonly-used word, filter the typo candidates corresponding to the pre-replaced word from the commonly-used words; obtain the application corresponding to the correct corpus Scene, and select multiple parallel scenes based on the application scene; crawl the dictionaries corresponding to the multiple scenes, and screen the similar-sound words corresponding to the pre-replaced words as the wrong word candidates; obtain the wrong word candidates and the corresponding words The expression sentences of the wrong word candidates in the multiple scenes; respectively count the occurrence frequencies of the wrong word candidates and the wrong word candidates in the expression sentence; replace the corresponding pre-replacement in the correct corpus based on the occurrence frequency Words and pre-replaced words to get the corresponding parallel corpus. In the embodiment of the invention, the correct corpus corresponding to the application scenario is obtained and word segmentation is performed to obtain a random number of pre-replaced words and pre-replaced words; the corresponding typo candidates are selected from the dictionary of commonly used words according to the phoneme code of the pre-replaced words, and according to the prediction The pronunciation of the replacement word is obtained from the dictionary corresponding to multiple scenarios parallel to the application scenario, and the corresponding typo candidate is obtained. Finally, the preset is filtered according to the word frequency of the typo candidate and the wrong word candidate in the expression sentences corresponding to the multiple application scenarios A large number of typos and wrong words replace the pre-replaced words and pre-replaced words in the correct corpus to generate corresponding parallel predictions, realize the flexible construction of correct-error parallel corpora for different application scenarios, and reduce the wrong word correction system based on the parallel training corpus The scene is migrated into.

Description of the drawings

Fig. 1 is a schematic diagram of a first embodiment of a parallel corpus generation method in an embodiment of this application;

2 is a schematic diagram of a second embodiment of a parallel corpus generation method in an embodiment of this application;

3 is a schematic diagram of a third embodiment of a parallel corpus generation method in an embodiment of this application;

4 is a schematic diagram of a fourth embodiment of a parallel corpus generation method in an embodiment of this application;

FIG. 5 is a schematic diagram of an embodiment of a parallel corpus generating device in an embodiment of this application;

FIG. 6 is a schematic diagram of another embodiment of a parallel corpus generating device in an embodiment of this application;

Fig. 7 is a schematic diagram of an embodiment of a parallel corpus generating device in an embodiment of this application.

Detailed ways

The embodiments of this application provide a parallel corpus generation method, device, equipment, and storage medium. By pre-constructing the wrong word candidate set of common words and the wrong word candidate set of common words, the professional vocabulary can be flexibly constructed according to the professional dictionary corresponding to the application scenario. Candidate sets of wrong words between time; and use common corpus and given professional corpus to count the frequency of common words, common words, and professional words; then filter the pre-replaced words and/or pre-replaced words from the input professional sentences to correspond The wrong word candidates and/or wrong word candidates are replaced according to word frequency or word frequency to obtain corresponding parallel sentences, flexibly construct correct-error parallel corpora for different application scenarios, and reduce the scene migration cost of the wrong word correction system based on the parallel training corpus. .

The terms "first", "second", "third", "fourth", etc. (if any) in the description and claims of this application and the above-mentioned drawings are used to distinguish similar objects, without having to use To describe a specific order or sequence. It should be understood that the data used in this way can be interchanged under appropriate circumstances so that the embodiments described herein can be implemented in a sequence other than the content illustrated or described herein. In addition, the terms "including" or "having" and any variations thereof are intended to cover non-exclusive inclusion. For example, a process, method, system, product, or device that includes a series of steps or units is not necessarily limited to those clearly listed. Steps or units, but may include other steps or units that are not clearly listed or are inherent to these processes, methods, products, or equipment.

For ease of understanding, the following describes the specific process of the embodiment of the present application. Please refer to FIG. 1. The first embodiment of the parallel corpus generation method in the embodiment of the present application includes:

101. Obtain the correct corpus to be processed;

It can be understood that the execution subject of this application may be a parallel corpus generating device, or a terminal or a server, which is not specifically limited here. The embodiment of the present application takes the server as the execution subject as an example for description.

In this embodiment, when performing the wrong word correction system, a large number of correct-error parallel corpora needs to be used as training samples. According to different application fields, the requirements for training samples are also different. The correct corpus to be processed here refers to The correct corpus in the correct-error parallel corpus in the application field can be processed by the method of this application to generate the corresponding error corpus in batches.

102. Perform word segmentation processing on the correct corpus, and determine error-prone pre-replacement words and pre-replacement words;

In this embodiment, the corresponding parallel sentences are generated from the professional sentences without considering all the words and words in the professional sentences, and all possible parallel sentences are generated. It is only necessary to randomly select a number of professional sentences with a preset probability distribution. Pre-replacement words and/or pre-replacement words are sufficient.

For example, the professional sentence is "Xiao Ming went to the supermarket to buy hydride", there are a total of six words "Xiao Ming", "go", "supermarket", "buy", "le", and "hydride", which were randomly selected The two pre-replacement words are "purchase" and "hydride" as the pre-replacement words.

103. Construct the phonetic shape code of the pre-replaced word in the preset common word dictionary;

In this embodiment, the correct-error parallel corpus for different application scenarios includes not only the dedicated vocabulary in the application scenario, but also common words and common vocabulary, such as "我", "巴", "快", etc. Common types of words such as prepositions, quantifiers, and conjunctions. If you select the pre-replaced words from the correct corpus, first read the common word lists of different coverages according to the requirements of the correct-error parallel corpus application scenarios, such as the 2500 common words in the first-level word list in the "Modern Chinese Common Word List". If necessary, the list of 1000 frequently used characters can be added. According to different regions, the "Standard Font Table of Commonly Used Chinese Characters" and "Table of Commonly Used Characters" and so on can also be selected.

Pre-replaced characters and commonly used characters have their special phonetic and glyph combinations. The phonetic and glyph of the two are coded and compared to determine the degree of similarity between the two. Among them, digital coding is performed on the initials, vowels, vowel complements, and tones of the pre-replaced word pinyin and common word pinyin to obtain the four-digit digital code of the common word phonetic; the Chinese character structure, five four corners of the pre-replaced word and common word The code and the number of strokes are coded to obtain the 7-bit digital code of the commonly used characters; the combination of the two can form the pre-replaced character and the unique 11-bit phonetic code of the commonly used characters.

If A1 to A26 represent the coding fields corresponding to the sequential 26 initials in the initials table;

Let B1 to B39 represent the coding fields corresponding to the 39 vowels in the sequence in the vowel table;

Let C1 to C39 represent the coding fields corresponding to the complements of the vowels corresponding to the 39 vowels in the sequence in the vowel table;

Let D1 to D4 represent the code fields corresponding to tones from one to four;

Then the phonetic code encoding information of the "flower" character is A11B13C13D1.

If E1 to E7 respectively represent the coding fields corresponding to the left-right structure, upper-lower structure, left-middle-right structure, upper-middle-lower structure, half-enclosed structure, full-enclosed structure, and mosaic structure of commonly used characters;

F0 to F9, G0 to G9, H0 to H9, J0 to J9, K0 to K9 represent the coding fields corresponding to the ten types of pen shapes corresponding to the upper left corner, upper right corner, lower left corner, and lower right corner of common characters, respectively;

Let Li (i is the number of strokes and i is a positive integer) representing the code field corresponding to the number of strokes;

Then the font coding information of the "flower" character is E2F4G4H2J1K4L7, so the commonly used character coding information of the "flower" character is A11B13C13D1E2F4G4H2J1K4L7.

104. Based on the phoneme code of the pre-replaced word, filter out the corresponding typo candidate from the common words in the preset common word dictionary;

In this embodiment, the phonetic shape codes of the pre-replaced words and the commonly used characters are compared. If the number of inconsistent encoding fields of the same type is greater, the similarity of the corresponding two commonly used characters is lower, otherwise, the similarity is higher. When the number of inconsistent coding fields of the same type in the phonetic shape codes corresponding to the common words is less than the preset number, the common word is a typo candidate for the pre-replaced word.

For example, the common word encoding information of the "flower" is A11B13C13D1E2F4G4H2J1K4L7, the common word encoding information of the "yellow" word is A11B34C34D2E4F4G4H8J0K6L12, and the common word encoding information of the "hua" word is the same type of A11B13C13D2E2F2G4H4L8 "flower" and "Huang Zhe". The number of inconsistent fields is 8, "花" and "华" are 6, and "黄" and "华" are 8. If the preset number is 7, only "花" and "华" are mutually exclusive. Candidate for typo.

105. Obtain an application scenario corresponding to the correct corpus, and select multiple scenarios parallel to the application scenario based on the application scenario;

In this embodiment, generating parallel error corpus from correct corpus not only needs to consider the application scenario corresponding to the correct prediction, but also involves multiple scenarios related to the application scenario at the same time. The simplest is the combination of application scenario + common scenario. It can be a combination of application scenarios + application scenarios in similar fields + general scenarios. For example, it is necessary to obtain correct-error parallel corpus in the medical field. In addition to considering daily application scenarios, it can also be added to the nursing field and biology similar to the medical field. The fields of chemistry, physical chemistry, life sciences, etc., to increase the reliability and comprehensiveness of parallel corpus.

106. Crawling the dictionaries corresponding to the multiple scenes, and selecting similar-sounding words corresponding to the pre-replaced words as wrong word candidates;

In this embodiment, the common word dictionary is selected according to user needs, and the corresponding special word dictionary is selected according to the application scenario; common words and special words that are similar in pinyin in the common word dictionary and the special word dictionary are mutually wrong candidates for each other. Every common word and special word are not limited to appear in a wrong word candidate group, and words with similar pinyin include homophones, words with homophones but different tones, fuzzy words, etc.

For example, the wrong word candidates for "hydride" may include the homophone word "cyanide", the fuzzy word "please legal affairs", and the homophone word "blue and white things" with different tones.

107. Acquire the typos candidate and the expression sentences of the typos candidate in the multiple scenes;

In this embodiment, multiple application scenarios may include application scenarios of correct corpus, application scenarios in similar fields, daily application scenarios, etc. The sources of expression sentences for daily application scenarios include news releases, blog documents, electronic newspapers, general e-books, Encyclopedia documents, etc., the application scenarios of the correct corpus and the expression sources of application scenarios in similar fields include scientific papers, publications, and technical books in the application field.

108. Count the occurrence frequencies of the typos candidate and the typos candidate in the expression sentence respectively;

In this embodiment, in a mixed expression sentence combining a common expression sentence and a professional expression sentence, the typo candidates and the typos candidates with a lower frequency of occurrence are not practical enough and generally not considered; the typos candidates and the typos candidates with a higher occurrence frequency are not considered Generating the wrong parallel corpus corresponding to professional sentences has its application value in application scenarios in this field. Therefore, we first count the occurrence frequency of typo candidates and wrong word candidates in mixed expression sentences.

For example, the wrong word candidates for "cyanide" include "hydride", "please legal affairs", "please legal affairs", and the application scenario is in the field of biochemistry, then the word frequency of "hydride" is obviously higher, if "hydride" The frequency of the word is 280, and the word frequencies of "Please legal affairs" and "Blue and white things" are 3 and 43 respectively. Therefore, "hydride" is preferred as a wrong word candidate to replace "cyanide", followed by "blue and white things", and "Please legal affairs" "The frequency of occurrence in this field of biochemistry is relatively low, and it loses its application value to replace "hydride" to generate parallel error corpus, so it is not considered.

109. Replace corresponding pre-replaced words and pre-replaced words in the correct corpus based on the occurrence frequency to obtain a corresponding parallel corpus.

In this embodiment, a certain number of typos and typos are selected from the typos and typos in descending order of occurrence frequency, and the typos and typos obtained through the screening are combined with the corresponding pre-replacement and pre-replacement words. One replacement, the corresponding parallel corpus is obtained, where the parallel corpus refers to the wrong corpus in the correct and wrong parallel corpus.

For example, from the professional sentence "Xiao Ming went to the supermarket to buy hydride" randomly selected "buy" and "hydride" as the pre-replacement words, and the wrong word candidates that replaced "buy" were selected to get the most frequent occurrence of "enough to buy". , "Buy full", replace "hydride" in the wrong word candidates, the most frequent occurrence of "cyanide" is selected, and the parallel sentences generated include: "Xiao Ming went to the supermarket to buy cyanide", "Xiao Ming went to the supermarket to buy cyanide" I bought hydride", "Xiao Ming went to the supermarket to buy full hydride."

In the embodiment of this application, the correct corpus to be processed is obtained and word segmentation is performed to determine the error-prone pre-replacement words and pre-replacement words; construct the phonetic shape codes of the pre-replaced words; based on the phonetic shape codes of the pre-replaced words and common words, Filter out the typo candidates corresponding to the pre-replaced words from the commonly used words; obtain the correct corpus corresponding to the application scenario and select multiple scenarios parallel to it; crawl the dictionaries corresponding to multiple scenarios, and filter the similar-sounding words corresponding to the pre-replaced words as the wrong words Candidates; count the occurrence frequency of typos and typos of typos and typos in multiple scenarios in expression sentences; according to this, replace the pre-replaced words and pre-replaced words in the correct corpus to obtain the corresponding parallel corpus for flexible construction Correct-error parallel corpus for different application scenarios, reducing the scene migration cost of the wrong word correction system based on the parallel training corpus.

Please refer to FIG. 2. The second embodiment of the parallel corpus generation method in the embodiment of the present application includes:

201. Obtain the correct corpus to be processed;

202. Perform word segmentation processing on the correct corpus to obtain multiple words and words;

In this embodiment, the correct corpus is split into multiple words and/or words, and parallel sentences are generated at the word and word level.

For example, "Xiao Ming went to the supermarket to buy hydride and formaldehyde" can be divided into: "Xiao Ming", "Go", "Supermarket", "Buy", "Liao", "Hydride", "He" and "Formaldehyde". One word and five words.

203. Based on the preset probability distribution of the number of wrong words, determine the number of words and words that may be incorrectly inputted;

In this embodiment, first, according to the maximum wrong word ratio of professional sentences set by the user, the maximum number of wrong words that may appear in the professional sentence is determined. The upper limit of the probability distribution of the number of possible wrong words, simulating the number of possible wrong words in each professional sentence in the actual application scenario and the corresponding probability of occurrence. Among them, the number of wrong words that may appear in a professional sentence is inversely proportional to the corresponding probability of occurrence. The specific number of wrong words and the corresponding probability of occurrence are expressed as follows:

Pi is the probability of occurrence when the number of wrong words that may occur in a professional sentence is i, i is the number of wrong words that may occur in a professional sentence and i is a positive integer, Pmax is the upper limit of the probability distribution of the number of wrong words, and n is the probability that a professional sentence may occur The maximum number of wrong words.

For example, professional sentence A can be divided into eight single words and/or words, and the maximum percentage of wrong words is 50%, and the maximum number of wrong words is 4; if the upper limit of the probability distribution of the number of wrong words preset by the developer is 40%, then the professional The number of possible wrong words 1, 2, 3, and 4 in sentence A correspond to the occurrence probability of 40%, 30%, 20%, and 10%, respectively.

204. Based on the distribution of the number of words and words that may be entered incorrectly, randomly select a number of words and words as pre-replacement words and pre-replacement words;

In this embodiment, in order to simulate the wrong input words and/or words that may occur in practical applications, one of the words and/or words with the number of wrong words is randomly selected from the obtained distribution of the number of wrong words, and the corresponding wrong word candidates and/or words are selected. Or replace the wrong word candidate.

For example, "Xiao Ming went to the supermarket to buy hydride and formaldehyde", if the number of wrong words that may occur in the professional sentence is selected from the distribution of the number of wrong words, then a single word or word is randomly selected from the professional sentence as a pre-replacement The word or pre-replacement word can be "xiaoming", "go", "supermarket", "buy", "le", "hydride", "and" or "formaldehyde".

205. Construct the phonetic shape code of the pre-replaced word in a preset common word dictionary;

206. Based on the phoneme code of the pre-replaced word, filter out corresponding typo candidates from commonly used words in a preset commonly used word dictionary;

207. Obtain an application scenario corresponding to the correct corpus, and select multiple scenarios parallel to the application scenario based on the application scenario;

208. Crawling the dictionaries corresponding to the multiple scenes, and selecting similar-sounding words corresponding to the pre-replaced words as wrong word candidates;

209. Obtain the typos candidate and the expression sentences of the typos candidate in the multiple scenes;

210. Count the occurrence frequencies of the typos candidate and the typos candidate in the expression sentence respectively;

211. Replace corresponding pre-replaced words and pre-replaced words in the correct corpus based on the occurrence frequency to obtain a corresponding parallel corpus.

In the embodiment of this application, by obtaining professional sentences related to the application scenario, splitting them into multiple words and words, and by presetting the probability distribution of the number of wrong words, a certain number of words and words are randomly selected from the multiple words and words. As pre-replaced words and pre-replaced words, to simulate the occurrence of incorrectly typed words and words when typing in real situations, it can better fit the form of typos and wrong words in reality, and generate typos in wrong sentences The number of wrong words can be controlled by parameters to improve the controllability of the generation algorithm.

Referring to Fig. 3, the third embodiment of the parallel corpus generation method in the embodiment of the present application includes:

301. Obtain the correct corpus to be processed;

302. Perform word segmentation processing on the correct corpus, and determine error-prone pre-replacement words and pre-replacement words;

303. Construct the phonetic shape code of the pre-replaced word in a preset common word dictionary;

304. Compare the phonetic shape code of the pre-replaced word with the same type of encoding field of the phonetic shape code of the commonly used word preset in the common word dictionary;

In this embodiment, the phonetic shape codes of pre-replaced characters and commonly used characters include: initials encoding field, finals encoding field, finals complement encoding field, tone encoding field, Chinese character structure encoding field, five four-corner encoding fields, and stroke number encoding field Composed of 11-bit digital codes, the similarity of the two is quantified by comparing the number of different coding fields corresponding to the pre-replaced phonological shape code and the commonly used phonological shape code.

305. Based on the comparison result, count the number of inconsistencies in the same type of coding fields in the pre-replaced words and the commonly used words;

306. Determine the edit distance between the pre-replaced word and the commonly used word based on the statistical result.

In this embodiment, the phonetic shape code includes eleven types of encoding fields. If the same type of encoding field is different between the pre-replaced phonetic shape code and the commonly used phonetic shape code, the edit distance is increased by 1, otherwise the original value is maintained. If all types of coding fields between the pre-replaced phonogram code and the commonly used phonogram code are consistent, it means that the two commonly used characters have the highest similarity, and the edit distance between the two is 0. The encoding fields of all types between the shape codes are inconsistent, which means that the two commonly used characters have the lowest similarity, so the edit distance between the two is 11, so the edit distance between the pre-replaced characters and the commonly used characters is between 0-11 between.

307. If the editing distance between the pre-replaced word and the commonly-used word is less than the preset editing distance, the corresponding commonly-used word is a typo candidate of the pre-replaced word;

In this embodiment, since the editing distance is a quantified value of the similarity between the pre-replaced word and the commonly used word, and the smaller the editing distance, the higher the similarity, so the user can set the editing when the commonly used word is a typo candidate for the pre-replaced word The maximum distance threshold is used as the preset editing distance to filter common words with high similarity to the pre-replaced words as typo candidates, and common words less than the preset editing distance are used as typo candidates for the pre-replaced words. If the similarity of words is lower, that is, if the editing distance is larger, they are filtered out.

308. Obtain an application scenario corresponding to the correct corpus, and select multiple scenarios parallel to the application scenario based on the application scenario.

309. Crawl the dictionaries corresponding to the multiple scenes, and screen similar-sounding words corresponding to the pre-replaced words as wrong word candidates.

310. Obtain the typos candidate and the expression sentences of the typos candidate in the multiple scenes;

311. Count the occurrence frequencies of the typos candidate and the typos candidate in the expression sentence respectively;

312. Replace corresponding pre-replaced words and pre-replaced words in the correct corpus based on the occurrence frequency to obtain a corresponding parallel corpus.

In the embodiment of this application, the editing distance between the pre-replaced word and the commonly used character code is calculated to compare the similarity of the two characters in the shape and sound. When the editing distance between the pre-replaced word and the commonly used word is smaller, The similarity is higher. If the editing distance is less than the preset editing distance, the common word can be used as a typo candidate for the pre-replaced word. The editing distance between the pre-replaced word and each common word in turn can be the pre-replaced word All typo candidates are used to improve the similarity between the pre-replaced words and the corresponding typo candidates, thereby increasing the practicability of subsequent correct-error parallel corpus generation.

Referring to Fig. 4, the fourth embodiment of the parallel corpus generation method in the embodiment of the present application includes:

401. Obtain the correct corpus to be processed;

402. Perform word segmentation processing on the correct corpus, and determine error-prone pre-replacement words and pre-replacement words;

403. Construct the phonetic shape code of the pre-replaced word in a preset common word dictionary;

404. Based on the phonetic shape code of the pre-replaced word, filter out corresponding typo candidates from commonly used words in a preset commonly used word dictionary;

405. Obtain an application scenario corresponding to the correct corpus, and select multiple scenarios parallel to the application scenario based on the application scenario.

406. Crawl the dictionaries corresponding to the multiple scenes, and screen similar-sounding words corresponding to the pre-replaced words as wrong word candidates.

407. Determine whether the pre-replacement word contains multiple subwords.

In this embodiment, it may be difficult for the pre-replacement word containing multiple sub-words to match the corresponding wrong word candidate, and the sub-word of the pre-replacement word may be incorrectly input in actual application scenarios. Therefore, it is necessary to first determine whether the pre-replacement word contains multiple words.

408. If the pre-replacement word contains multiple sub-words, perform word segmentation on the pre-replacement word to obtain multiple sub-words.

In this embodiment, in order to ensure the application value of the correct-error parallel corpus in actual application scenarios, the pre-replacement word containing multiple sub-words is not used to filter the corresponding wrong word candidates, but is split into multiple sub-words and used Based on sub-words, the corresponding sub-word wrong word candidates are screened, and then according to different sub-word wrong word candidate combinations, the wrong word candidates corresponding to the pre-replacement word are spliced together.

For example, "Gong Xi Fa Cai" contains two sub-words of "Gong Xi" and "Fa Cai". In actual application scenarios, the sub-words "Gong Xi" and "Fa Cai" often have a large number of similar sounds entered incorrectly. If you directly use "Gong Xi Fa Cai" Crawling similar-sounding words from the dictionary ignores various other mistyped situations that may occur in actual application scenarios, so the corresponding sub-word wrong word candidates are selected from "Gongxi, "Fa Cai".

409. Based on the multiple sub-words, filter the wrong word candidates corresponding to each sub-word and replace them one by one to generate the wrong word candidates corresponding to the pre-replaced words.

In this embodiment, filtering the corresponding sub-word wrong word candidates through each sub-word can more comprehensively list the possible misinput of the corresponding pre-replacement word in actual application scenarios.

For example, when selecting the wrong word candidates corresponding to "Gong Xi Fa Cai", the wrong word candidates corresponding to the sub-word "Gong Xi" include "Gong Xi", "Combination", and "Gong Xi", and the wrong word candidates corresponding to "Fa Cai" include "发菜" , The wrong word candidates for "Gong Xi Fa Cai" include "Gong Xi Fa Cai", "Song Xi Fa Cai", "Gong Xi Fa Cai", and "Gong Xi Fa Cai".

410. Obtain the typos candidate and the expression sentences of the typos candidate in the multiple scenes;

411. Count the occurrence frequencies of the typos candidate and the typos candidate in the expression sentence respectively;

412. Replace corresponding pre-replaced words and pre-replaced words in the correct corpus based on the occurrence frequency to obtain a corresponding parallel corpus.

In the embodiments of the present application, some pre-replacement words in the form of compound words may appear frequently or have a low error rate in application scenarios in this field, but multiple sub-words that make up the pre-replacement word may appear frequently. In this case Divide into multiple sub-words and then filter the wrong word candidates corresponding to each sub-word. Finally, the wrong word candidates corresponding to the sub-words are combined to form the wrong word candidates of the pre-replaced word, so that the final correct-error parallel corpus generated covers more Comprehensive, more practical, and higher degree of fit with actual application scenarios.

The parallel corpus generation method in the embodiment of the application is described above, and the parallel corpus generation device in the embodiment of the application is described below. Please refer to FIG. 5. An embodiment of the parallel corpus generation device in the embodiment of the application includes:

The corpus acquisition module 501 is used to acquire the correct corpus to be processed;

The word segmentation module 502 is used to perform word segmentation processing on the correct corpus and determine error-prone pre-replacement words and pre-replacement words;

The typo generation module 503 is used to construct the phonetic shape code of the pre-replaced word in the preset common word dictionary; based on the phonetic shape code of the pre-replaced word, filter out the corresponding common words from the preset common word dictionary Typo candidate

The wrong word generation module 504 is configured to obtain the application scenario corresponding to the correct corpus, and select multiple scenarios parallel to the correct corpus based on the application scenario; crawl the dictionary corresponding to the multiple scenarios, and filter the corresponding pre-replacement word Words with similar sounds as wrong word candidates;

The sentence acquisition module 505 is configured to acquire the typos candidate and the expression sentences of the typos candidate in the multiple scenes;

The calculation module 506 is configured to respectively count the occurrence frequencies of the typos candidate and the typos candidate in the expression sentence;

The corpus generation module 507 is configured to replace the corresponding pre-replaced words and pre-replaced words in the correct corpus based on the occurrence frequency to obtain the corresponding parallel corpus.

In the embodiment of this application, the correct corpus to be processed is obtained and word segmentation is performed to determine error-prone pre-replacement characters and pre-replacement words; construct pre-replacement characters and phonological codes of common characters; based on the pronunciation of pre-replacement characters and common characters Shape code, filter out the typo candidates corresponding to the pre-replaced words from the commonly used words; obtain the correct corpus corresponding to the application scenario and select multiple scenarios parallel to it; crawl the dictionary corresponding to multiple scenarios, and filter the similar-sounding words corresponding to the pre-replaced words As a typo candidate; count the occurrence frequency of the typo and the typo of the typo candidate and the typo candidate in multiple scenarios in the expression sentence; according to this, replace the pre-replaced words and pre-replaced words in the correct corpus to obtain the corresponding parallel corpus, The correct-error parallel corpus for different application scenarios can be constructed flexibly, and the scene migration cost of the wrong word correction system based on the parallel training corpus can be reduced.

Please refer to FIG. 6, another embodiment of the parallel corpus generating device in the embodiment of the present application includes:

The corpus acquisition module 601 is used to acquire the correct corpus to be processed;

The word segmentation module 602 is used to perform word segmentation processing on the correct corpus and determine error-prone pre-replacement words and pre-replacement words;

The typo generation module 603 is used to construct the phonetic shape code of the pre-replaced word in the preset common word dictionary; based on the phonetic shape code of the pre-replaced word, filter out the corresponding common words from the preset common word dictionary Typo candidate

The wrong word generation module 604 is configured to obtain the application scenario corresponding to the correct corpus, and select multiple scenarios parallel to the correct corpus based on the application scenario; crawl the dictionary corresponding to the multiple scenarios, and filter the corresponding pre-replacement word Words with similar sounds as wrong word candidates;

The sentence acquisition module 605 is configured to acquire the typos candidate and the expression sentences of the typos candidate in the multiple scenes;

The calculation module 606 is configured to separately count the occurrence frequencies of the typos candidate and the typos candidate in the expression sentence;

The corpus generation module 607 is configured to replace the corresponding pre-replaced words and pre-replaced words in the correct corpus based on the occurrence frequency to obtain the corresponding parallel corpus.

Specifically, the word segmentation module 602 includes: a segmentation unit 6021, a probability calculation unit 6022, and a screening unit 6023, wherein:

The segmentation unit 6021 is configured to perform word segmentation processing on the correct corpus to obtain multiple words and words;

The probability calculation unit 6022 is configured to determine the number distribution of words and words that may have incorrect input based on the preset probability distribution of the number of wrong words;

The screening unit 6023 is configured to randomly select a number of single characters and words as pre-replacement words and pre-replacement words based on the number of words and word distributions that may be incorrectly input.

Specifically, the typo generation module 603 includes a phonetic shape code construction unit 6031, which is specifically used for:

Based on the phonetic sounds of the pre-replaced characters, construct the phonetic codes of the pre-replaced words in a preset common word dictionary, where the phonetic codes include initials, vowels, vowels, complements, and tones;

Based on the glyph of the pre-replaced character, construct a shape code of the pre-replaced word in a preset common word dictionary, and the shape code includes a Chinese character structure code, a five-digit four-corner code, and the number of strokes;

Based on the phonetic code and the shape code, the phonetic shape code of the pre-replaced word in a preset common word dictionary is determined.

Specifically, the typos generating module 603 further includes a typos screening unit 6032, which is specifically used for:

Comparing the phonetic shape code of the pre-replaced word with the same type of encoding field of the phonetic shape code of the commonly used word preset in the dictionary of commonly used words;

Based on the comparison result, count the number of inconsistencies between the pre-replaced word and the common word in the same type of coding field;

Determine the edit distance between the pre-replaced word and the commonly used word based on the statistical result;

If the editing distance between the pre-replaced word and the frequently-used word is less than the preset editing distance, the corresponding frequently-used word is a typo candidate of the pre-replaced word.

Specifically, the wrong word generation module 604 is also used for:

Judging whether the pre-replacement word contains multiple sub-words;

If the pre-replacement word contains multiple sub-words, perform word segmentation on the pre-replacement word to obtain multiple sub-words;

Based on the multiple sub-words, the wrong word candidates corresponding to each sub-word are screened and replaced one by one to generate the wrong word candidates corresponding to the pre-replaced words.

Specifically, the parallel corpus generating device further includes a splicing module 608, which is specifically used for:

Determine whether the pre-replaced word in the correct corpus is followed by a single word or multiple words;

If the pre-replacement word is followed by a single word or multiple words, splicing the pre-replacement word and the single word or the multiple words to obtain a spliced pre-replacement word;

Screening a preset number of wrong word candidates corresponding to the spliced pre-replacement words in descending order of word frequency;

Replace the spliced pre-replaced words one by one with the selected wrong word candidates, and generate corresponding parallel corpus.

Specifically, the corpus generation module 607 is specifically configured to:

Filtering a preset number of typo candidates corresponding to the pre-replaced words in descending order of word frequency, and filtering a preset number of erroneous word candidates corresponding to the pre-replaced words in descending order of word frequency;

Replace the corresponding pre-replaced words and pre-replaced words in the correct corpus one by one with the selected wrong character candidates and wrong word candidates, and generate corresponding parallel corpus.

In the embodiment of this application, the correct corpus corresponding to the application scenario is obtained and word segmentation is performed to obtain a random number of pre-replaced words and pre-replaced words; according to the phonetic code of the pre-replaced words, the corresponding typo candidates are screened out from the dictionary of commonly used words, according to The pronunciation of pre-replaced words is obtained from the dictionaries corresponding to multiple scenarios parallel to the application scenario, and the corresponding typo candidates are obtained. Finally, the pre-replacement words are filtered according to the frequency of the typo candidates and the word candidates in the expression sentences corresponding to the multiple application scenarios. Set up a number of typos and wrong words to replace the pre-replaced words and pre-replaced words in the correct corpus to generate corresponding parallel predictions to achieve flexible construction of correct-error parallel corpora for different application scenarios, and reduce the correction of wrong words based on the parallel training corpus The scene of the system is migrated to.

The above figures 5 and 6 describe the parallel corpus generating device in the embodiment of the present application in detail from the perspective of modular functional entities, and the following describes the parallel corpus generating device in the embodiment of the present application in detail from the perspective of hardware processing.

FIG. 7 is a schematic structural diagram of a parallel corpus generation device provided by an embodiment of the present application. The parallel corpus generation device 700 may have relatively large differences due to different configurations or performance, and may include one or more processors (central processing units). , A CPU 710 (for example, one or more processors) and a memory 720, and one or more storage media 730 (for example, one or more storage devices) storing application programs 733 or data 732. Among them, the memory 720 and the storage medium 730 may be short-term storage or persistent storage. The program stored in the storage medium 730 may include one or more modules (not shown in the figure), and each module may include a series of instruction operations on the parallel corpus generation device 700. Further, the processor 710 may be configured to communicate with the storage medium 730, and execute a series of instruction operations in the storage medium 730 on the parallel corpus generation device 700.

The parallel corpus generation device 700 may also include one or more power supplies 740, one or more wired or wireless network interfaces 750, one or more input and output interfaces 760, and/or one or more operating systems 731, such as Windows Serve , Mac OS X, Unix, Linux, FreeBSD, etc. Those skilled in the art can understand that the structure of the parallel corpus generation device shown in FIG. 7 does not constitute a limitation on the parallel corpus generation device, and may include more or fewer components than shown in the figure, or a combination of certain components, or different components. Component arrangement.

This application also provides a computer-readable storage medium. The computer-readable storage medium may be a non-volatile computer-readable storage medium, and the computer-readable storage medium may also be a volatile computer-readable storage medium. The computer-readable storage medium stores instructions, and when the instructions are executed on the computer, the computer executes the following steps:

Obtain the correct corpus to be processed;

Perform word segmentation processing on the correct corpus, and determine error-prone pre-replacement words and pre-replacement words;

Constructing the phonetic shape code of the pre-replaced word in the preset common word dictionary;

Based on the phonetic shape code of the pre-replaced word, filter out the corresponding typo candidate from the common words in the preset common word dictionary;

Acquiring an application scenario corresponding to the correct corpus, and selecting multiple scenarios parallel to the application scenario based on the application scenario;

Crawling the dictionaries corresponding to the multiple scenes, and selecting similar-sounding words corresponding to the pre-replaced words as wrong word candidates;

Acquiring the typos candidate and the expression sentences of the typos candidate in the multiple scenes;

Separately count the occurrence frequencies of the typos candidate and the typos candidate in the expression sentence;

The corresponding pre-replaced words and pre-replaced words in the correct corpus are replaced based on the occurrence frequency to obtain the corresponding parallel corpus.

Those skilled in the art can clearly understand that, for the convenience and conciseness of the description, the specific working process of the above-described system, device, and unit can refer to the corresponding process in the foregoing method embodiment, which will not be repeated here.

If the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable storage medium. Based on this understanding, the technical solution of the present application essentially or the part that contributes to the existing technology or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium , Including several instructions to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in the various embodiments of the present application. The aforementioned storage media include: U disk, mobile hard disk, read-only memory (read-only memory, ROM), random access memory (random access memory, RAM), magnetic disks or optical disks and other media that can store program codes. .

The blockchain referred to in this application is a new application mode of computer technology such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm. Blockchain, essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information for verification. The validity of the information (anti-counterfeiting) and the generation of the next block. The blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.

As mentioned above, the above embodiments are only used to illustrate the technical solutions of the present application, but not to limit them; although the present application has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that: The technical solutions recorded in the embodiments are modified, or some of the technical features are equivalently replaced; and these modifications or replacements do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present application.

Claims

A parallel corpus generation method, which includes:

Obtain the correct corpus to be processed;

Perform word segmentation processing on the correct corpus, and determine error-prone pre-replacement words and pre-replacement words;

Constructing the phonetic shape code of the pre-replaced word in the preset common word dictionary;

Based on the phonetic shape code of the pre-replaced word, filter out the corresponding typo candidate from the common words in the preset common word dictionary;

Acquiring an application scenario corresponding to the correct corpus, and selecting multiple scenarios parallel to the application scenario based on the application scenario;

Crawling the dictionaries corresponding to the multiple scenes, and selecting similar-sounding words corresponding to the pre-replaced words as wrong word candidates;

Acquiring the typos candidate and the expression sentences of the typos candidate in the multiple scenes;

Separately count the occurrence frequencies of the typos candidate and the typos candidate in the expression sentence;

The corresponding pre-replaced words and pre-replaced words in the correct corpus are replaced based on the occurrence frequency to obtain the corresponding parallel corpus.
The parallel corpus generation method according to claim 1, wherein said performing word segmentation processing on the correct corpus and determining error-prone pre-replacement words and pre-replacement words comprises:

Perform word segmentation processing on the correct corpus to obtain multiple words and phrases;

Based on the preset probability distribution of the number of wrong words, determine the number of words and words that may be mistyped;

Based on the distribution of the number of words and words that may be entered incorrectly, a number of words and words are randomly selected as pre-replacement words and pre-replacement words.
The parallel corpus generation method according to claim 1, wherein said constructing the phonetic shape code of the pre-replaced word in a preset common word dictionary comprises:

Based on the phonetic sounds of the pre-replaced characters, construct the phonetic codes of the pre-replaced words in a preset common word dictionary, where the phonetic codes include initials, vowels, vowels, complements, and tones;

Based on the glyph of the pre-replaced character, construct a shape code of the pre-replaced word in a preset common word dictionary, and the shape code includes a Chinese character structure code, a five-digit four-corner code, and the number of strokes;

Based on the phonetic code and the shape code, the phonetic shape code of the pre-replaced word in a preset common word dictionary is determined.
The parallel corpus generation method according to claim 1 or 3, wherein said screening out corresponding typo candidates from common words in a preset common word dictionary based on the phoneme code of the pre-replaced word comprises:

Comparing the phonetic shape code of the pre-replaced word with the same type of encoding field of the phonetic shape code of the commonly used word preset in the dictionary of commonly used words;

Based on the comparison result, count the number of inconsistencies between the pre-replaced word and the common word in the same type of coding field;

Determine the edit distance between the pre-replaced word and the commonly used word based on the statistical result;

If the editing distance between the pre-replaced word and the frequently-used word is less than the preset editing distance, the corresponding frequently-used word is a typo candidate of the pre-replaced word.
The parallel corpus generation method according to claim 1, wherein after the crawling the dictionaries corresponding to the multiple scenes and screening the similar-sounding words corresponding to the pre-replaced words as wrong word candidates, the method further comprises:

Judging whether the pre-replacement word contains multiple sub-words;

If the pre-replacement word contains multiple sub-words, perform word segmentation on the pre-replacement word to obtain multiple sub-words;

Based on the multiple sub-words, the wrong word candidates corresponding to each sub-word are screened and replaced one by one to generate the wrong word candidates corresponding to the pre-replaced words.
The method for generating parallel corpus according to claim 1, wherein after said separately counting the frequency of occurrence of the typos candidate and the typos candidate in the expression sentence, the method further comprises:

Determine whether the pre-replaced word in the correct corpus is followed by a single word or multiple words;

If the pre-replacement word is followed by a single word or multiple words, splicing the pre-replacement word and the single word or the multiple words to obtain a spliced pre-replacement word;

Screening a preset number of wrong word candidates corresponding to the spliced pre-replacement words in descending order of word frequency;

Replace the spliced pre-replaced words one by one with the selected wrong word candidates, and generate corresponding parallel corpus.
The parallel corpus generation method according to claim 1 or 6, wherein said replacing the corresponding pre-replaced words and pre-replaced words in the correct corpus based on the occurrence frequency to obtain the corresponding parallel corpus comprises:

Filtering a preset number of typo candidates corresponding to the pre-replaced words in descending order of word frequency, and filtering a preset number of erroneous word candidates corresponding to the pre-replaced words in descending order of word frequency;

Replace the corresponding pre-replaced words and pre-replaced words in the correct corpus one by one with the selected wrong character candidates and wrong word candidates, and generate corresponding parallel corpus.
A parallel corpus generation device includes a memory, a processor, and computer-readable instructions stored on the memory and capable of running on the processor, and the processor implements the following steps when the processor executes the computer-readable instructions:

Obtain the correct corpus to be processed;

Perform word segmentation processing on the correct corpus, and determine error-prone pre-replacement words and pre-replacement words;

Constructing the phonetic shape code of the pre-replaced word in the preset common word dictionary;

Based on the phonetic shape code of the pre-replaced word, filter out the corresponding typo candidate from the common words in the preset common word dictionary;

Acquiring an application scenario corresponding to the correct corpus, and selecting multiple scenarios parallel to the application scenario based on the application scenario;

Crawling the dictionaries corresponding to the multiple scenes, and selecting similar-sounding words corresponding to the pre-replaced words as wrong word candidates;

Acquiring the typos candidate and the expression sentences of the typos candidate in the multiple scenes;

Separately count the occurrence frequencies of the typos candidate and the typos candidate in the expression sentence;

The corresponding pre-replaced words and pre-replaced words in the correct corpus are replaced based on the occurrence frequency to obtain the corresponding parallel corpus.
According to the parallel corpus generating device according to claim 8, the processor further implements the following steps when executing the computer program:

Perform word segmentation processing on the correct corpus to obtain multiple words and phrases;

Based on the preset probability distribution of the number of wrong words, determine the number of words and words that may be mistyped;

Based on the distribution of the number of words and words that may be entered incorrectly, a number of words and words are randomly selected as pre-replacement words and pre-replacement words.
According to the parallel corpus generating device according to claim 8, the processor further implements the following steps when executing the computer program:

Based on the phonetic sounds of the pre-replaced characters, construct the phonetic codes of the pre-replaced words in a preset common word dictionary, the phonetic codes include initials, vowels, vowels, complements, and tones;

Based on the glyph of the pre-replaced character, construct a shape code of the pre-replaced word in a preset common word dictionary, where the shape code includes a Chinese character structure code, a five-digit four-corner code, and the number of strokes;

Based on the phonetic code and the shape code, the phonetic shape code of the pre-replaced word in a preset common word dictionary is determined.
According to the parallel corpus generation device of claim 8 or 10, the processor further implements the following steps when executing the computer program:

Comparing the phonetic shape code of the pre-replaced word with the same type of encoding field of the phonetic shape code of the commonly used word preset in the dictionary of commonly used words;

Based on the comparison result, count the number of inconsistencies between the pre-replaced word and the common word in the same type of coding field;

Determine the edit distance between the pre-replaced word and the commonly used word based on the statistical result;

If the editing distance between the pre-replaced word and the frequently-used word is less than the preset editing distance, the corresponding frequently-used word is a typo candidate of the pre-replaced word.
According to the parallel corpus generating device according to claim 8, the processor further implements the following steps when executing the computer program:

Judging whether the pre-replacement word contains multiple sub-words;

If the pre-replacement word contains multiple sub-words, perform word segmentation on the pre-replacement word to obtain multiple sub-words;

Based on the multiple sub-words, the wrong word candidates corresponding to each sub-word are screened and replaced one by one to generate the wrong word candidates corresponding to the pre-replaced words.
According to the parallel corpus generating device according to claim 8, the processor further implements the following steps when executing the computer program:

Determine whether the pre-replaced word in the correct corpus is followed by a single word or multiple words;

If the pre-replacement word is followed by a single word or multiple words, splicing the pre-replacement word and the single word or the multiple words to obtain a spliced pre-replacement word;

Screening a preset number of wrong word candidates corresponding to the spliced pre-replacement words in descending order of word frequency;

Replace the spliced pre-replaced words one by one with the selected wrong word candidates, and generate corresponding parallel corpus.
According to the parallel corpus generating device according to claim 8 or 13, the processor further implements the following steps when executing the computer program:

Filtering a preset number of typos candidates corresponding to the pre-replaced words in descending order of word frequency, and filtering a preset number of erroneous word candidates corresponding to the pre-replaced words in descending order of word frequency;

Replace the corresponding pre-replaced words and pre-replaced words in the correct corpus one by one with the selected wrong character candidates and wrong word candidates, and generate corresponding parallel corpus.
A computer-readable storage medium in which computer instructions are stored, and when the computer instructions are executed on a computer, the computer executes the following steps:

Obtain the correct corpus to be processed;

Perform word segmentation processing on the correct corpus, and determine error-prone pre-replacement words and pre-replacement words;

Constructing the phonetic shape code of the pre-replaced word in the preset common word dictionary;

Based on the phonetic shape code of the pre-replaced word, filter out the corresponding typo candidate from the common words in the preset common word dictionary;

Acquiring an application scenario corresponding to the correct corpus, and selecting multiple scenarios parallel to the application scenario based on the application scenario;

Crawling the dictionaries corresponding to the multiple scenes, and selecting similar-sounding words corresponding to the pre-replaced words as wrong word candidates;

Acquiring the typos candidate and the expression sentences of the typos candidate in the multiple scenes;

Separately count the occurrence frequencies of the typos candidate and the typos candidate in the expression sentence;

The corresponding pre-replaced words and pre-replaced words in the correct corpus are replaced based on the occurrence frequency to obtain the corresponding parallel corpus.
The computer-readable storage medium according to claim 15, when the computer instructions are executed on the computer, the computer is caused to further perform the following steps:

Perform word segmentation processing on the correct corpus to obtain multiple words and phrases;

Based on the preset probability distribution of the number of wrong words, determine the number of words and words that may be mistyped;

Based on the distribution of the number of words and words that may be entered incorrectly, a number of words and words are randomly selected as pre-replacement words and pre-replacement words.
The computer-readable storage medium according to claim 15, when the computer instructions are executed on the computer, the computer is caused to further perform the following steps:

Based on the phonetic sounds of the pre-replaced characters, construct the phonetic codes of the pre-replaced words in a preset common word dictionary, where the phonetic codes include initials, vowels, vowels, complements, and tones;

Based on the glyph of the pre-replaced character, construct a shape code of the pre-replaced word in a preset common word dictionary, and the shape code includes a Chinese character structure code, a five-digit four-corner code, and the number of strokes;

Based on the phonetic code and the shape code, the phonetic shape code of the pre-replaced word in a preset common word dictionary is determined.
The computer-readable storage medium according to claim 15 or 17, when the computer instructions are executed on the computer, the computer is caused to further execute the following steps:

Comparing the phonetic shape code of the pre-replaced word with the same type of encoding field of the phonetic shape code of the commonly used word preset in the dictionary of commonly used words;

Based on the comparison result, count the number of inconsistencies between the pre-replaced word and the common word in the same type of coding field;

Determine the edit distance between the pre-replaced word and the commonly used word based on the statistical result;

If the editing distance between the pre-replaced word and the frequently-used word is less than the preset editing distance, the corresponding frequently-used word is a typo candidate of the pre-replaced word.
The computer-readable storage medium according to claim 15, when the computer instructions are executed on the computer to execute the following steps, the computer is caused to further execute the following steps:

Judging whether the pre-replacement word contains multiple sub-words;

If the pre-replacement word contains multiple sub-words, perform word segmentation on the pre-replacement word to obtain multiple sub-words;

Based on the multiple sub-words, the wrong word candidates corresponding to each sub-word are screened and replaced one by one to generate the wrong word candidates corresponding to the pre-replaced words.
A parallel corpus generating device, wherein the parallel corpus generating device includes:

The corpus acquisition module is used to acquire the correct corpus to be processed;

The word segmentation module is used to segment the correct corpus and determine the error-prone pre-replacement words and pre-replacement words;

The typo generation module is used to construct the phonetic shape code of the pre-replaced word in the preset common word dictionary; based on the phonetic shape code of the pre-replaced word, filter out the corresponding typo from the common words in the preset common word dictionary Candidate

The wrong word generation module is used to obtain the application scenario corresponding to the correct corpus, and select multiple scenarios parallel to it based on the application scenario; crawl the dictionaries corresponding to the multiple scenarios, and filter the corresponding pre-replacement words Words with similar sounds are used as candidates for wrong words;

A sentence acquisition module, configured to acquire the typos candidate and the expression sentences of the typos candidate in the multiple scenes;

A calculation module, configured to separately count the occurrence frequencies of the typos candidate and the typos candidate in the expression sentence;

The corpus generation module is used to replace the corresponding pre-replaced words and pre-replaced words in the correct corpus based on the occurrence frequency to obtain the corresponding parallel corpus.