CN115169328A

CN115169328A - High-accuracy Chinese spelling check method, system and medium

Info

Publication number: CN115169328A
Application number: CN202210573678.3A
Authority: CN
Inventors: 王重阳
Original assignee: Suzhou Moduo Information Technology Co ltd
Current assignee: Suzhou Moduo Information Technology Co ltd
Priority date: 2022-05-25
Filing date: 2022-05-25
Publication date: 2022-10-11

Abstract

The invention discloses a high-accuracy Chinese spelling check method, a system and a medium, wherein the method comprises the following steps: obtaining a corpus to be used, configuring a substitution table to be used, and generating a near-sound error sample to be used and a near-shape error sample to be used based on the corpus to be used; training a decision model to be used based on the corpus to be used, and training an inspection model to be used based on a near-sound error sample to be used and a near-shape error sample to be used; performing text inspection and error correction operation on the acquired text to be inspected based on the substitution table to be used, the trained decision model to be used and the trained inspection model to be used; the invention can efficiently generate training samples with high quality and high authenticity, can carry out error detection on the Chinese text by independently setting the trained error detection model and the post-processing decision model, and carries out high-quality error correction judgment after error detection, thereby further reducing the error correction rate of the model, improving the correction quality and opening up a new method for Chinese spelling error detection.

Description

High-accuracy Chinese spelling checking method, system and medium

Technical Field

The invention relates to the technical field of intelligent error correction, in particular to a high-accuracy Chinese spelling check method, a high-accuracy Chinese spelling check system and a high-accuracy Chinese spelling check medium.

Background

With the rapid development of the internet, online input becomes a necessary function of each terminal and application, and because the online input is mostly manually input, the situation of input errors is inevitable; under some informal scenes, the expression meaning of the set information cannot be influenced by wrongly written words, but for a search scene, a search result is not matched with the requirement of a user due to wrong input, and for more rigorous scenes such as government announcements and outhanded speech, serious political accidents can be caused due to wrong input; at present, spelling input is mostly adopted for online input, and error correction processing of spelling input errors is just needed in the current era.

In the prior art, spelling input error correction is performed based on an artificial intelligence learning model, wherein technical directions such as sample configuration, error detection, error correction and the like are involved;

for sample configuration, the current sample configuration method adopts a random generation or manual labeling mode, which has poor authenticity, poor sample quality and high labor cost;

for the detection and correction of errors, a model is adopted for multi-section processing or a seq2seq model is adopted for error correction processing; for the multi-section processing mode, on one hand, the model is solidified, on the other hand, each processing stage in the model forms strong dependence, systematic knowledge sharing is difficult to carry out, and the multi-section processing model needs strong engineering support and is high in use cost; for the seq2seq model, on one hand, the requirement on the quality of the sample is extremely high, so that the error correction rate is high, and the limitation is large, and on the other hand, the model needs high calculation force support, so that the use cost is relatively high;

in view of the foregoing, there is a need to develop a method for correcting a chinese spell check that can efficiently generate high-quality and high-authenticity samples and has a strong recall capability and a low error correction rate.

Disclosure of Invention

The invention aims to develop a Chinese spelling check error correction method which can efficiently generate samples with high quality and high authenticity, has strong recall capability and low error correction rate, further makes up the defects of the prior art and solves the problems in the prior art.

In order to achieve the purpose, the invention adopts a technical scheme that: a high-accuracy Chinese spell checking method is provided, which comprises the following steps:

and a corpus obtaining step:

screening a target website, and crawling data from the target website to obtain a first corpus; configuring a first open source project; setting cleaning characteristic data, symbol reservation cutting logic, cutting identifier and length interval; performing corpus cleaning operation on the first corpus based on the first open source project, the cleaning feature data, the symbol reservation cutting logic, the cutting identifier and the length interval to obtain a corpus to be used;

and a replacement table configuration step:

setting a first word frequency value, a second word frequency value and a first word frequency value; configuring front and back nasal sound disambiguation logic, compound word segmentation logic, a second open source item, a first simplified and traditional word list and a plurality of first near sound sample words; executing a word table replacing operation based on the corpus to be used, the first word frequency value, the second word frequency value and the first word frequency value to obtain a word table to be replaced; performing a near-sound near-shape table generation operation based on the front and back nasal sound disambiguation logic, the compound word segmentation logic, the second open source item and the plurality of first near-sound sample words to obtain a near-sound near-shape table; setting the first simplified and traditional word list as a Chinese traditional word list, and integrating the word list to be replaced, the near-sound near-form list and the Chinese traditional word list to obtain a replacement list to be used;

an error sample generation step:

setting a near sound error probability and a near shape error probability; screening the linguistic data to be subjected to the near-phonetic processing and the linguistic data to be subjected to the near-shape processing in the linguistic data to be used based on the near-phonetic error probability and the near-shape error probability; configuring a random position recall algorithm, a probability limit recall algorithm, a multi-error recall algorithm and a sample random selection algorithm; executing a near-sound error sample generation operation based on the to-be-used substitution table, the to-be-near-sound processing corpus, the random position recall algorithm, the probability limit recall algorithm, the multi-error recall algorithm and the sample random selection algorithm to generate a to-be-used near-sound error sample; generating a to-be-used near-form error sample by adopting operation logic of the near-sound error sample generation operation based on the to-be-near-form processing corpus;

model configuration step:

configuring a first decision model and a first open source package; training the first decision model based on the corpus to be used and the first open source package to obtain a decision model to be used; configuring a first inspection model, and setting an error detection network in a model architecture of the first inspection model to obtain a second inspection model; training the second inspection model based on the to-be-used near-sound error sample and the to-be-used near-shape error sample to obtain an inspection model to be used; configuring a post-processing prediction model;

text checking:

acquiring a text to be checked; setting a probability threshold, error screening characteristics and error length characteristic values; and executing text inspection error correction operation on the text to be inspected based on the substitution table to be used, the decision model to be used, the inspection model to be used, the post-processing prediction model, the probability threshold, the error screening characteristic and the error length characteristic value.

As an improvement, the corpus washing operation includes:

performing corpus cleaning on the first corpus based on a regular expression and the cleaning characteristic data to obtain a second corpus; cutting the second corpus based on the symbol reservation cutting logic and the cutting identifier to obtain a third corpus; filtering the corpus with the corpus length outside the length interval in the third corpus to obtain a fourth corpus; and performing word segmentation processing on the fourth corpus based on the first open source project to obtain the corpus to be used.

As an improvement, the replacement word list generation operation comprises:

performing word frequency statistics and word frequency statistics on the linguistic data to be used to obtain a word frequency ranking sequence and a word frequency ranking sequence; in the word frequency ranking sequence, selecting a first word corpus corresponding to the first word frequency value quantity according to a first direction, and selecting a second word corpus corresponding to the second word frequency value quantity according to a second direction; selecting a first word corpus corresponding to the first word frequency value quantity in the word frequency ranking sequence according to the first direction; and constructing the word list to be replaced based on the first word corpus, the second word corpus and the first word corpus.

As an improvement, the near-sound near-shape table generating operation includes:

creating a plurality of pinyin mappings respectively matched with the first near sound sample words, and sorting the first near sound sample words and the pinyin mappings to obtain a first near sound word list; performing word segmentation processing on a plurality of first near sound sample words in the first near sound word list based on the compound word segmentation logic to obtain a plurality of second near sound sample words; updating a plurality of pinyin mappings based on a plurality of second near sound sample words to obtain a second near sound word list; disambiguating a plurality of pinyin mappings respectively corresponding to a plurality of second near sound sample words in the second near sound word list based on the front and back nasal sound disambiguation logic to obtain a near sound word list to be used; acquiring a word list of approximate forms to be used based on the second open source project; and integrating the to-be-used word list and the to-be-used word list to obtain the near-sound word list.

As an improvement, the near-tone error sample generation operation includes:

creating a part-of-speech set to be filtered and a first word recall set; setting a first length threshold value and a first Cartesian product limiting formula; calling the random position recall algorithm based on the part of speech set to be filtered, the first word recall set, the first length threshold, the first Cartesian product limiting formula, the linguistic data to be processed by near-sounds, the word list to be used by near-sounds and the Chinese traditional word list to obtain a plurality of first near-sound error texts;

setting single text recall probability and sequence recall positions; setting recall limit logic based on the word list to be replaced; calling the probability limit recall algorithm based on the single text recall probability, the sequence recall position, the recall limit logic, the linguistic data to be subjected to near-sound processing and the near-sound word list to be used to obtain a plurality of second near-sound error texts;

setting a second length threshold and a multi-dislocation setting algorithm; calling the multi-error recall algorithm based on the second length threshold, the multi-dislocation setting algorithm, the to-be-near-speech processing corpus and the to-be-used near-speech vocabulary to obtain a third multi-error sample;

and calling the sample random selection algorithm based on the third multi-error sample, the plurality of first near-sound error texts and the plurality of second near-sound error texts to obtain the near-sound error sample to be used.

As an improved solution, the random location recall algorithm is:

selecting a first text sequence from the corpus to be subjected to the proximal sound processing, and selecting a first segment from the first text sequence; performing data recall processing on the first word recall set based on the first length threshold, the first Cartesian product limiting formula, the to-be-used near-word list, and the first segment; screening the first word recall set subjected to the data recall processing based on the Chinese traditional word list and the part of speech set to be filtered to obtain a second word recall set; generating a number of the first near-sound erroneous texts based on sample words in the second word recall set and the first text sequence;

the probability limit recall algorithm is as follows:

selecting a second text sequence from the linguistic data to be subjected to the proximal sound processing; recalling texts positioned at the sequence recall positions on the second text sequence according to the near-sound word list to be used and the single text recall probability to obtain a plurality of fourth recall sample words; screening a plurality of fourth recall samples based on the recall limiting logic to obtain a plurality of fifth recall sample words; generating a plurality of second near-sound error texts based on the plurality of fifth recall sample words and the second text sequence;

the multi-error recall algorithm is as follows:

selecting a third text sequence from the linguistic data to be subjected to the proximal sound processing based on the second length threshold; setting a plurality of error replacements in the third text sequence based on the multiple-bit setting algorithm; acquiring a plurality of sixth candidate sample words related to a plurality of error replacement positions on the basis of the near-sound word list to be used; generating the third multiple-error sample based on the number of error replacements, the number of sixth candidate sample words, and the third text sequence;

the sample random selection algorithm comprises the following steps:

setting a selected quantity calculation formula; counting a first number and a value of the third multiple-error sample, a plurality of the first near-sound error texts and a plurality of the second near-sound error texts; substituting the first quantity sum value into the selected quantity calculation formula to obtain a selected quantity value; and selecting a plurality of to-be-used near sound error samples from the third multiple error samples, the first near sound error texts and the second near sound error texts according to the selection quantity value.

As an improvement, the text checking and error correcting operation includes:

inputting the text to be checked into the checking model to be used to obtain predicted suspicious errors and error probability; comparing the error probability with the probability threshold;

if the error probability reaches the probability threshold, performing error correction processing on the predicted suspicious errors based on the word list of the words to be replaced;

if the error probability does not reach the probability threshold, performing adjacent error combination processing on the predicted suspicious error to obtain a first error to be judged; performing feature filtering processing on the first to-be-judged error based on the error screening feature, and judging whether filtering is performed to obtain a second to-be-judged error;

if the second to-be-determined error is obtained, performing advanced decision processing on the second to-be-determined error; and if the second to-be-judged error is not obtained, carrying out error reservation processing on the predicted suspicious error.

As an improved solution, the advanced decision processing includes:

identifying the error length value of the second error to be judged; if the error length value is larger than the error length characteristic value, performing gram grading decision processing based on the second to-be-judged error and the predicted suspicious error; if the error length value is not larger than the error length characteristic value, performing prediction processing on the second error to be judged based on the post-processing prediction model to obtain a third error to be judged; performing gram grading decision processing based on the third to-be-judged error and the predicted suspicious error;

the gram scoring decision processing includes:

calling the decision model to be used to calculate a first error correction index of the predicted suspicious error and a second error correction index of the second error to be judged or a third error correction index of the third error to be judged; comparing the first error correction index with the second error correction index or the third error correction index; if the first error correction index is smaller than the second error correction index or the third error correction index, performing the error correction processing on the predicted suspicious error based on the word list to be replaced; and if the first error correction index is not smaller than the second error correction index or the third error correction index, performing error retention processing on the predicted suspicious error.

The invention also provides a high-accuracy Chinese spell checking system, which comprises:

the system comprises a corpus acquisition module, a substitution table configuration module, an error sample generation module, a model configuration module and a text inspection module;

the corpus obtaining module is used for screening a target website and crawling data from the target website to obtain a first corpus; the corpus obtaining module is further used for configuring a first open-source project; the corpus acquiring module is also used for setting cleaning characteristic data, symbol reservation cutting logic, cutting identifiers and length intervals; the corpus obtaining module executes corpus cleaning operation on the first corpus based on the first open source project, the cleaning feature data, the symbol reservation cutting logic, the cutting identifier and the length interval to obtain a corpus to be used;

the substitution table configuration module is used for setting a first word frequency value, a second word frequency value and a first word frequency value; the substitution table configuration module is also used for configuring front and back nasal sound disambiguation logic, compound word segmentation logic, a second open source item, a first simplified and complex word table and a plurality of first near sound sample words; the replacement table configuration module executes a replacement word table generation operation based on the corpus to be used, the first word frequency value, the second word frequency value and the first word frequency value to obtain a word table to be replaced; the substitution table configuration module executes a near-sound near-shape table generation operation based on the front and back nasal sound disambiguation logic, the compound word segmentation logic, the second open source item and the first near-sound sample words to obtain a near-sound near-shape table; the replacement table configuration module sets the first simplified and traditional word table as a Chinese traditional word table, and integrates the word table to be replaced, the near-sound near-form table and the Chinese traditional word table to obtain a replacement table to be used;

the error sample generation module is used for setting a near sound error probability and a near shape error probability; the error sample generation module screens the linguistic data to be processed by the near pronunciation and the linguistic data to be processed by the near shape in the linguistic data to be used based on the near pronunciation error probability and the near shape error probability; the error sample generation module is also used for configuring a random position recall algorithm, a probability limit recall algorithm, a multi-error recall algorithm and a sample random selection algorithm; the error sample generation module executes a near sound error sample generation operation based on the to-be-used substitution table, the to-be-near-sound processing corpus, the random position recall algorithm, the probability limit recall algorithm, the multiple error recall algorithm and the sample random selection algorithm to generate a to-be-used near sound error sample; the error sample generation module generates a to-be-used near-shape error sample by adopting operation logic of the near-sound error sample generation operation based on the to-be-near-shape processing corpus;

the model configuration module is used for configuring a first decision model and a first open source package; the model configuration module trains the first decision model based on the corpus to be used and the first open source package to obtain a decision model to be used; the model configuration module is further used for configuring a first inspection model, and the model configuration module is used for setting an error detection network in a model architecture of the first inspection model to obtain a second inspection model; the model configuration module trains the second inspection model based on the to-be-used near-sound error sample and the to-be-used near-shape error sample to obtain an inspection model to be used; configuring a post-processing prediction model;

the text inspection module is used for acquiring a text to be inspected; the text checking module is also used for setting a probability threshold value, an error screening characteristic and an error length characteristic value; the text inspection module executes text inspection error correction operation on the text to be inspected based on the substitution table to be used, the decision model to be used, the inspection model to be used, the post-processing prediction model, the probability threshold, the error screening characteristic and the error length characteristic value.

The present invention also provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the high accuracy Chinese spell checking method.

The invention has the beneficial effects that:

1. the high-accuracy Chinese spelling check method can efficiently generate high-quality and high-authenticity training samples, improves the accuracy of an error detection model based on the configuration aspect of the samples, and is simple in sample generation mode and convenient to expand and apply; the method can automatically set a trained error detection model and a post-processing decision model to detect the error of the Chinese text, and perform high-quality error correction judgment after the error detection, thereby further reducing the error correction rate of the model, improving the correction quality, making up the defects of the prior art, having high error correction efficiency and strong expansibility in the whole Chinese text inspection, applying the model combination learning to the model, opening up a new method way of Chinese spelling error detection, and having extremely high market value and application value when being deployed in an unsupervised scene.

2. The high-accuracy Chinese spelling check system can efficiently generate high-quality and high-authenticity training samples through the mutual cooperation of the corpus acquisition module, the substitution table configuration module, the error sample generation module, the model configuration module and the text check module, improves the accuracy of an error detection model based on the configuration aspect of the samples, has a simple sample generation mode, and is convenient to expand and apply; the system can automatically set the trained error detection model and the post-processing decision model to detect the error of the Chinese text, and perform high-quality error correction judgment after error detection, thereby further reducing the error correction rate of the model, improving the correction quality, overcoming the defects of the prior art, having high efficiency of checking and correcting the error of the whole Chinese text, strong expansibility, applying model combined learning to the model, opening up a new method way of detecting the error of the Chinese spelling, executing the whole system in an unsupervised scene, and having extremely high market value and application value.

3. The computer-readable storage medium can realize the cooperation of the corpus guide acquisition module, the substitution table configuration module, the error sample generation module, the model configuration module and the text inspection module, further realize the high-quality and high-authenticity training sample generation, can carry out error detection on a Chinese text by independently setting a training error detection model and a post-processing decision model, and carry out high-quality error correction judgment after error detection, further reduce the error correction rate of the model, improve the correction quality, open up a new way for Chinese spelling error detection, and effectively improve the operability of the high-accuracy Chinese spelling inspection method.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.

FIG. 1 is a flow chart of a high-accuracy Chinese spell checking method, according to embodiment 1 of the present invention;

FIG. 2 is a flowchart illustrating the specific process of step S100 in the method for checking Chinese spellings with high accuracy according to embodiment 1 of the present invention;

FIG. 3 is a flowchart illustrating a step S200 of the high-accuracy Chinese spell checking method according to embodiment 1 of the present invention;

FIG. 4 is a flowchart illustrating a step S300 of the high-accuracy Chinese spell checking method according to embodiment 1 of the present invention;

FIG. 5 is a flowchart illustrating a step S400 of the high-accuracy Chinese spell checking method according to embodiment 1 of the present invention;

FIG. 6 is a flowchart illustrating the specific steps S500 of the method for checking a Chinese spelling with high accuracy according to embodiment 1 of the present invention;

FIG. 7 is an architecture diagram of the high accuracy Chinese spell checking system of embodiment 2 of the present invention.

Detailed Description

The following detailed description of the preferred embodiments of the present invention, taken in conjunction with the accompanying drawings, will make the advantages and features of the invention easier to understand by those skilled in the art, and thus will clearly and clearly define the scope of the invention.

In the description of the present invention, it should be noted that the described embodiments of the present invention are a part of the embodiments of the present invention, and not all embodiments; all other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In the description of the present invention, it should be noted that the terms "first", "second", "third", "fourth", "fifth" and "sixth" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.

In the description of the present invention, unless otherwise specified or limited, the terms "destination site", "cleaning feature data", "symbol-preserving cut logic", "cut identifier", "corpus cleaning operation", "corpus to-be-used", "term frequency value", "front-back nasal sound disambiguation logic", "compound term word segmentation logic", "simple word table", "near-sound sample word", "word table to be replaced", "near-sound near-shape table", "to-be-used replacement table", "near-sound processing corpus", "to-be-processed corpus", "random position recall algorithm", "probability-limiting recall algorithm", "multiple error recall algorithm", "sample random selection algorithm", "to-be-used near-sound error sample", "to-be-used near-shape error sample", "decision model", "check model", "error check network", "post-processing prediction model", "to-be-checked text", "error screening feature", "error length feature value", "text checking error correction operation regular expression", "cutting processing", "word frequency row sequence", "pinyin mapping", "to-be-filtered word set recall formula", "recall formula recall position retrieval processing algorithm", "error-limiting index-setting processing data processing module", "error-filtering processing data set number-processing module", "error-selecting and processing algorithm", "error-limiting processing module", "retrieval result set number of error-searching algorithm", "retrieval" and filtering processing data set, the terms "error sample generation module", "model configuration module" and "text inspection module" should be understood in a broad sense. The specific meanings of the above terms in the present invention can be understood in specific cases to those skilled in the art.

In the description of the present invention, it should be noted that: seq2seq is a natural language processing model; python is a programming language; unihan is a Chinese database; n-gram is a natural language processing model; macBert is a Chinese natural language pre-training model; roberta is a pre-trained model.

Example 1

The embodiment provides a high-accuracy Chinese spell checking method, as shown in fig. 1 to 6, including the following steps:

s100, a corpus obtaining step, which specifically comprises the following steps:

s110, screening a target website, and crawling data from the target website to obtain a first corpus; for an algorithm system and a processing model, the quality of the data and the characteristics of the foundation of the algorithm system and the processing model has important influence on the processing quality and the accuracy of the model; therefore, in this embodiment, step S100 is to primarily configure — corpus for the underlying data; therefore, the quality of the data is taken as a starting point, basic data acquisition channels are screened, correspondingly, the source platforms of government documents or other platforms with higher requirements on the text quality are selected as the target websites because the government documents or other platforms with higher requirements on the text quality have low text error probability, higher sentence fluency and higher text data quality, databases of requests, query and the like of python language are adopted for crawling the data, and the open source project ProxyPool is adopted as an agent pool, so that the primary first corpus is obtained;

s120, configuring a first open source project; setting cleaning characteristic data, symbol reservation cutting logic, cutting identifier and length interval; correspondingly, in the present embodiment, the first open-source item is pyltp; the cleaning characteristic data is irregular symbolic characteristics; the symbol reservation and cutting logic is used for taking the paired punctuation symbols in the reserved corpus as a premise when the corpus is cut; the cutting identifier is an end identifier of question, period, exclamation mark and the like in the corpus; the length interval is greater than or equal to 8 to less than or equal to 126 in the embodiment;

s130, performing corpus cleaning operation on the first corpus based on the first open source project, the cleaning feature data, the symbol reservation cutting logic, the cutting identifier and the length interval to obtain a corpus to be used; correspondingly, the corpus cleaning operation is to clean the first corpus obtained in step S110 according to the content configured in step S120, so as to further improve the normalization and quality of the corpus;

specifically, the corpus cleaning operation includes:

performing corpus cleaning on the first corpus based on a regular expression and the cleaning characteristic data to obtain a second corpus, namely cleaning off some irregular symbols in the first corpus by adopting the regular expression to obtain a corresponding second corpus with higher relative quality; the second corpus is cut based on the symbol reservation cut logic and the cut identifier to obtain a third corpus, which is a commonly used corpus processing logic, but in order to ensure that when corpus is cut, the corpus has its own semantic meaning or semantic meaning without context support, so as to further improve corpus authenticity and reduce ambiguity, in this embodiment, a previously configured cut logic is adopted, that is, when each corpus sentence is cut, on the premise of reserving a landmark symbol in the sentence, a sentence end symbol (such as a sentence stop, question mark or exclamation mark) in the sentence is used to cut the corpus; after the cutting processing, filtering is needed, so that the corpus with the corpus length outside the length interval in the third corpus is filtered to obtain a fourth corpus, and correspondingly, the fourth corpus is a text with the length greater than or equal to 8 and less than or equal to 126; finally, performing word segmentation processing on the fourth corpus based on the first open source project to obtain the corpus to be used; the corpus to be used is the first corpus after the whole corpus cleaning process, and can be put into subsequent model training and algorithm use.

S200, a replacement table configuration step, which specifically comprises:

s210, setting a first word frequency value, a second word frequency value and a first word frequency value; in this embodiment, step S200 is also a configuration of a bottom-layer data support, specifically a configuration of a substitution table, where the substitution table generated is used for generating a high-quality and high-reality sample and performing subsequent model training; correspondingly, in the embodiment, according to the intelligent recommendation characteristic of the input method, tests show that the user usually inputs an error, mostly because the word or character that the inputter wants to input is not in the front or the top of the input method recommendation, that is, the frequency of inputting the error segment in the input method is higher than that of the correct segment; meanwhile, there is another reason that since the wrong segment is unfamiliar or uncommon to the user, the inputter may input a mistake, and thus, the configuration of the substitution table is performed based on the frequency of the words and phrases in the corpus, taking this as a starting point; furthermore, because the substitution table is used for generating the error sample, if the substitution table is analyzed and generated by combining the real input condition of the user, the authenticity of the error sample is greatly improved, and the error correction accuracy of the model is improved; in this embodiment, the first word frequency value, the second word frequency value, and the first word frequency value are all percentage values, the first word frequency value is 50%, the second word frequency value is 10%, and the first word frequency value is 30%; the content in step S210 is used for generating a word list to be replaced;

s220, configuring front and back nasal sound disambiguation logic, compound word segmentation logic, a second open source item, a first simplified and complex word list and a plurality of first near sound sample words; in this embodiment, the front and back nasal sound disambiguation logic is to perform an additional adjustment of the nasal sound pinyin for the pinyin mapping of the words with nasal sounds; the compound word segmentation logic is to perform word segmentation processing on a word group which is long in length and is formed by compounding a plurality of words, for example: the 'error sample' is composed of two words of 'error' and 'sample', and the word segmentation logic can further improve the coverage of the substitution list; the second open source item is pycorrector in this embodiment; the first simplified and traditional word list is part of traditional characters in the unihan Chinese simplified and traditional word list and is used for filtering and screening traditional texts; in this embodiment, the plurality of first near sound sample words are about 200 ten thousand word samples obtained by integrating a jieba thesaurus, a dog searching vocabulary and a Baidu open source language understanding atlas; correspondingly, the configured content in step S220 is used for generating a near sound shape table;

s230, performing a word table replacing generation operation based on the corpus to be used, the first word frequency value, the second word frequency value and the first word frequency value to obtain a word table to be replaced;

specifically, the operation of generating the replacement word list includes: carrying out word frequency statistics and word frequency statistics on the linguistic data to be used, wherein the word frequency and the word frequency are input occurrence probabilities of characters and words respectively, and after mapping of each character or word and the probability of the character or word is obtained, arranging the characters or words according to an arrangement sequence of the probabilities from large to small to obtain a character frequency row sequence and a word frequency row sequence; in the word frequency ranking sequence, selecting a first word corpus corresponding to the first word frequency value quantity according to a first direction, and selecting a second word corpus corresponding to the second word frequency value quantity according to a second direction, wherein in the embodiment, the first direction is from top to bottom, namely a positive number direction, and the second direction is from bottom to top, namely a reciprocal direction, so that the first word corpus is the words ranked 50% in the word frequency ranking sequence, and the second word corpus is the words ranked 10% in the word frequency ranking sequence; in the word frequency ranking sequence, selecting first word language materials corresponding to the first word frequency value quantity according to the first direction, wherein the first word language materials are words which are ranked 30% of the word frequency ranking sequence correspondingly; and constructing the word list to be replaced based on the first word corpus, the second word corpus and the first word corpus, namely correspondingly sorting and integrating the words to obtain the word list to be replaced which can be put into use.

S240, performing a near-sound near-shape table generating operation based on the front and back nasal sound disambiguation logic, the compound word segmentation logic, the second open source item and the first near-sound sample words to obtain a near-sound near-shape table;

specifically, the operation of generating the near-sound near-shape table includes:

creating a plurality of pinyin mappings respectively matched with the first near-sound sample words, namely one mapping of one first near-sound sample word corresponding to the pinyin of the first near-sound sample word, and sorting the first near-sound sample words and the pinyin mappings to obtain a first near-sound word list; performing word segmentation processing on a plurality of first near sound sample words in the first near sound word list based on the compound word segmentation logic to obtain a plurality of second near sound sample words, namely performing word segmentation on the compound words to further expand the sample words; updating a plurality of pinyin mappings based on a plurality of second near-sound sample words to obtain a second near-sound word list, namely performing pinyin mapping on the sample words after word segmentation to ensure the correctness of the near-sound word list; disambiguating a plurality of pinyin mappings corresponding to a plurality of second near sound sample words in the second near sound word list based on the front and rear nasal sound disambiguation logic to obtain a near sound word list to be used, for example, if the pinyin mapping corresponding to the second near sound sample word is yinggai, considering that the corresponding word has a nasal sound, processing is performed based on the front and rear nasal sound disambiguation logic, and then, if the pinyin mapping corresponding to the word should be added is yingai, performing the nasal sound disambiguation to obtain the near sound word list to be used, which can be used; for the near-form table, directly adopting the near-form table in the second open-source project to carry out merging and repairing to obtain a near-form word table to be used; and finally integrating the to-be-used word list and the to-be-used word list to obtain the near sound near form list.

S250, setting the first simplified and traditional word list as a Chinese traditional word list, and integrating the word list to be replaced, the near-sound near-form list and the Chinese traditional word list to obtain a replacement list to be used; the substitution table to be used is used for subsequent model training and error detection processing.

S300, an error sample generating step, which specifically comprises the following steps:

s310, setting a near sound error probability and a near shape error probability; correspondingly, step S300 is also an important step in the method, and by step S300, in combination with the corpus to be used and the list of alternatives to be used obtained in steps S100 and S200, an error sample with high quality and high authenticity and matching with a real environment can be generated; correspondingly, based on the analysis of a real scene, for a near-sound error, an inputter is probably caused by adopting a pinyin input method, for a near-shape error, the inputter is probably caused by adopting a five-stroke input method, so that a near-sound error sample and a near-shape error sample are mainly generated in the step, correspondingly, the further analysis is carried out, for the same sentence, the situation that the inputter adopts pinyin input and five-stroke input to complete does not exist basically, and therefore, a mixed sample of the near-sound error and the near-shape error is not generated in the step, so that for one sentence, judgment is carried out according to the near-sound error probability and the near-shape error probability, and whether the near-sound error sample or the near-shape error sample is generated is selected to be completely consistent with the real environment; in this embodiment, since there are many users in the pinyin input method, the probability of the nearing error is set to 0.7, and the probability of the nearing error is set to 0.3;

s320, screening the linguistic data to be subjected to the near-phonetic processing and the linguistic data to be subjected to the near-shape processing in the linguistic data to be used based on the near-phonetic error probability and the near-shape error probability; therefore, based on the number of the linguistic data to be used, the product of the number of the linguistic data to be used and the probability of the near-sound error and the probability of the near-shape error is calculated, and the linguistic data to be processed by the near-sound and the linguistic data to be processed by the near-shape are divided in the linguistic data to be used respectively according to the obtained values; for example, if the number of the corpora to be used is 10, then for 10 corpora sentences, 7 corpora are used for generating a near-sound error sample, that is, the corpora to be subjected to near-sound processing, and 3 corpora are used for generating a near-shape error sample, that is, the corpora to be subjected to near-shape processing;

s330, configuring a random position recall algorithm, a probability limit recall algorithm, a multiple-error recall algorithm and a sample random selection algorithm; in this embodiment, the random location recall algorithm, the probability limit recall algorithm, and the multiple error recall algorithm respectively generate logics for different error samples, and the sample random selection algorithm is a random sample extraction logic for further reducing the number of samples, preventing the number of samples from being excessive, and ensuring the validity of the samples; the random position recall algorithm, the probability limit recall algorithm and the multiple error recall algorithm are respectively used for generating an error sample with a random error position, an error sample analyzed according to a probability specification and an error sample with a plurality of errors in a sentence;

s340, executing a near-sound error sample generating operation based on the to-be-used substitution table, the to-be-near-sound processing corpus, the random position recall algorithm, the probability limit recall algorithm, the multiple error recall algorithm and the sample random selection algorithm, and generating a to-be-used near-sound error sample;

specifically, the operation of generating the near-tone error sample includes:

in this embodiment, in order to prevent the model from being over-fitted, that is, in the process of generating an error sample, the names of people or the names of organizations of some entities need to be filtered out and are not used as the error sample, so that the error correction accuracy is further improved, and therefore, the part-of-speech set to be filtered, that is, the names of people or the names of organizations of the entities, is created by a specific creation method including, but not limited to, a part-of-speech tagging manner; the first word recall set is an empty candidate set in the embodiment and is used for storing some recalled texts to be replaced; setting a first length threshold and a first cartesian product restriction formula, in this embodiment, the first length threshold is 1, and the first cartesian product restriction formula is: the average recall number of each pinyin = the number of recalled candidate words at the word level the cartesian product recall probability at the word level (0.3 in this embodiment) +1 of the number of pinyins in the selected text, and the first cartesian product restriction formula is used to balance the number of recalled words and improve randomness at the same time; therefore, the random position recall algorithm is called based on the part of speech set to be filtered, the first word recall set, the first length threshold, the first Cartesian product limiting formula, the linguistic data to be subjected to near-phonetic processing, the word list to be used and the Chinese traditional word list to obtain a plurality of first near-phonetic error texts;

setting a single text recall probability and a sequence recall position, wherein in the embodiment, the single text recall probability is set to be 0.03, and the sequence recall position is set to be each position of a text sequence; setting a recall limiting logic based on the word list to be replaced, wherein in the embodiment, the recall limiting logic is that the fragments to be recalled must exist in the word list to be replaced, so as to further limit the recall and improve the recall quality; calling the probability limit recall algorithm based on the single text recall probability, the sequence recall position, the recall limit logic, the to-be-nearly-phonetic processed corpus and the to-be-used near-phonetic vocabulary to obtain a plurality of second near-phonetic error texts;

setting a second length threshold and a multi-dislocation setting algorithm, wherein the second length threshold is 32 in the embodiment; calling the multi-error recall algorithm based on the second length threshold, the multi-dislocation setting algorithm, the to-be-near-speech processing corpus and the to-be-used near-speech vocabulary to obtain a plurality of third multi-error samples;

in order to reduce the number of samples and improve the effectiveness of the samples, the sample random selection algorithm is called based on the third multiple error samples, the first near-sound error texts and the second near-sound error texts to obtain a plurality of near-sound error samples to be used.

Specifically, the random location recall algorithm is as follows:

selecting a first text sequence from the corpus to be subjected to the proximal sound processing, and selecting a first segment from the first text sequence, wherein the selection of the first segment is random selection; performing data recall processing on the first word recall set on the basis of the first length threshold, the first Cartesian product limiting formula, the to-be-used near-word list and the first segment, wherein sample words of error samples to be generated are stored in the data-recalled first word set; therefore, the first word recall set subjected to the data recall processing is screened based on the Chinese traditional word list and the part-of-speech set to be filtered, namely, the sample words in the first word recall set are filtered and removed in traditional and entity names, and a second word recall set is obtained; generating a plurality of first near sound error texts based on the sample words in the second word recall set and the first text sequence, namely replacing the part of the first text sequence corresponding to the first segment with each word in the corresponding amount in the second word recall set to obtain a plurality of different near sound error samples;

specifically, the data recall processing includes:

identifying a first length value for the first segment;

if the first length value is matched with the first length threshold value and the first segment is a Chinese character, the first segment is indicated to be a character, so that the first pinyin of the first segment is identified, and a first sample word related to the first pinyin is recalled to the first word recall set based on the near-sound word list to be used, namely, the character or the word corresponding to the pinyin mapping corresponding to the first pinyin in the near-sound word list to be used is recalled to the first word recall set;

if the first length value is not matched with the first length threshold value, the first segment is a word or a word sentence, and therefore a first pinyin combination of the first segment, namely the whole pinyin of the first segment is identified; setting a plurality of second pinyins based on the first pinyin combination, wherein the second pinyins are pinyins corresponding to each character in the first pinyin combination, the logic is to perform near sound candidate recall on the whole pinyin of the phrase or the word and sentence, and also perform near sound candidate recall on each character in the phrase or the word and sentence, so that the recall quality is further improved; therefore, the second sample words related to the first pinyin combination are recalled to the first word recall set based on the to-be-used near word list, and a plurality of third sample words related to a plurality of second pinyins are recalled to the first word recall set based on the first cartesian product restriction formula and the to-be-used near word list.

Specifically, the probability limit recall algorithm is as follows:

in this embodiment, in order to improve the error correction accuracy, a plurality of different near-sound samples may be generated for the same text sequence, so that the second text sequence selection is the same as, corresponding to, the first text sequence, and may also select different text sequences; based on the to-be-used near-sound word list, recalling the text at the sequence recall position on the second text sequence according to the single text recall probability to obtain a plurality of fourth recall sample words, in the embodiment, the single text recall probability is combined with the sequence recall position to be the recall probability of each word unit position in one text sequence, in the embodiment, the single text recall probability is 0.03, so that the probability of not being recalled in a text with the length of T is (1-0.03) T, namely 0.97T, so that the probability can indicate that if the length of one text is longer, the probability of having recalled words or words therein is greater, when the length of the text exceeds 32, a recalled word inevitably exists, so that the logic is more suitable for the real situation, namely the length of the text is greater, the probability of the user inputting errors is greater, the logic is the core of the algorithm, and based on this, the second text recall probability is used for confirming the corresponding to-be-recalled word in the near-to-be-called word list according to the recall probability of the second text sequence, and the to-be used to confirm the near-recalled word or word after the recall probability of the text is used for recalling word; in the algorithm, further limitation is made, namely, a plurality of fourth recall samples are screened based on the recall limiting logic, namely, the fragments in the text corresponding to the recalled fourth recall samples must exist in a word list to be replaced, so a plurality of fifth recall sample words are obtained after screening; finally, the same logic as that of text replacement in the random position recall algorithm is adopted, so that a plurality of second nearsound error texts are generated based on a plurality of fifth recall sample words and the second text sequence;

specifically, the multi-error recall algorithm is as follows:

selecting a third text sequence from the corpus to be subjected to the proximal sound processing based on the second length threshold, wherein the third text sequence is a sequence with a text length greater than 32 (the second length threshold) in the embodiment; setting a plurality of error replacements in the third text sequence based on the multi-dislocation setting algorithm; in this embodiment, the multiple error setting algorithm is: randomly selecting a plurality of positions to be recalled of the selected third text sequence as positions to be replaced by errors, wherein the logic of the random selection is as follows: firstly, determining a text position to be recalled in a third text sequence according to the single text recall probability in the probability limit recall algorithm, wherein a second length threshold is 32, and the third text sequence is inevitably larger than 32, so that a word candidate position to be recalled is inevitably generated in the third text sequence, then setting the total number of the word candidate positions determined by the single text recall probability in the third text sequence as d, simultaneously setting an error _ count calculation formula based on the second length threshold 32 and the length l of the third text sequence in order to limit the number of multiple errors, calculating an error _ count value based on the length l and the error _ count calculation formula, comparing the error _ count value with the value of d, and selecting the minimum number as an error replacement position which should be generated in the third text sequence; for example, when d > error _ count, selecting an error _ count value as the number of error replacement positions which should be generated in the third text sequence, and then randomly selecting error replacement positions of the error _ count from d positions as a plurality of error replacement positions; based on the random selection logic in the multi-dislocation setting algorithm, on one hand, the randomness is ensured, on the other hand, the number of a plurality of errors in one sentence is limited, and the actual situation is met;

in this embodiment, the error _ count calculation formula is:

correspondingly, the plurality of error replacement positions are the minimum number of candidate replacement positions selected in the third text sequence after being confirmed by the multi-dislocation setting algorithm; acquiring a plurality of sixth candidate sample words related to a plurality of wrong substitution positions on the basis of the to-be-used near-sound word list, namely confirming the sixth candidate sample words respectively corresponding to each wrong substitution position according to the screening principle of pinyin mapping; generating a third multi-error sample based on the error substitution positions, the sixth candidate sample words and the third text sequence, that is, replacing the error substitution position corresponding to each sixth candidate sample word in the third text sequence with each sixth candidate sample word to obtain a third multi-error sample;

specifically, the sample random selection algorithm is as follows:

setting a selected quantity calculation formula, wherein the selected quantity calculation formula is that the selected quantity value = a first quantity and a value/2; counting a first number and a value of the third multiple-error sample, a number of the first nearnote-error texts and a number of the second nearnote-error texts; the first quantity and value are the total quantity value of the samples, so the first quantity and value are substituted into the selected quantity calculation formula to obtain a selected quantity value; and randomly selecting a plurality of error samples from the third multiple error samples, the first near-sound error texts and the second near-sound error texts as a plurality of near-sound error samples to be used according to the selection quantity value.

S350, generating a to-be-used near-shape error sample by adopting operation logic of the near-sound error sample generation operation based on the to-be-near-shape processing corpus; specifically, the generation logic of the near-sound error sample to be used is the operation logic based on the near-sound error sample generation operation to generate the sample, in the process of generating the near-sound error sample, all the replacement logics of the near-sound word list to be used, which are adopted by the generation of the near-sound error sample, are converted into the replacement logics of the near-shape word list to be used, and the word pattern is adopted to recall the near-shape sample word; correspondingly, the near-tone error sample to be used and the near-shape error sample to be used, which are generated through the step S300, have extremely high authenticity and are consistent with the real situation.

S400, configuring the model, specifically comprising:

s410, configuring a first decision model and a first open source packet; training the first decision model based on the corpus to be used and the first open source package to obtain a decision model to be used; in this embodiment, the first decision model is an n-gram language model, the first open source package is a kenlm open source package, and the training of the first decision model, that is, the kenlm open source package is adopted and the n-gram language model is trained based on word granularity to obtain the decision model to be used; the decision model to be used is used for further judging the error correction result, so that the error correction accuracy is improved, and the error correction rate is reduced;

s420, configuring a first inspection model, and setting an error detection network in a model architecture of the first inspection model to obtain a second inspection model; training the second inspection model based on the near-sound error sample to be used and the near-form error sample to be used to obtain an inspection model to be used; in this embodiment, the first inspection model adopts a macBert pre-training language model, and the error detection network is a neural network based on a binary model; correspondingly, in the embodiment, the error detection network is added to adjust the macBert pre-training language model, so that the joint learning of the model can be further improved, and the semantic knowledge of the pre-training language model is fully utilized; the to-be-used inspection model is used for inspecting the text to be inspected and outputting corresponding prediction errors;

s430, configuring a post-processing prediction model; in this embodiment, the post-processing prediction model adopts a Roberta model for subsequent decision processing; step S400, the method is mainly used for configuring a plurality of models, and training the configured models by adopting the data obtained in step S100, step S200 and step S300, so as to obtain models which can be used in the subsequent process.

S500, text checking, which specifically comprises the following steps:

s510, acquiring a text to be checked; the text to be checked is a Chinese text which needs to be subjected to error correction analysis and is input by a user;

s520, setting a probability threshold value, an error screening characteristic and an error length characteristic value; in this embodiment, the probability threshold is a criterion for determining the error probability of the prediction error output by the macBert pre-training language model; the error screening characteristics are preset white list characteristics, namely certain errors belong to specific names, special names, common languages and the like, and the errors need to be further screened and eliminated to prevent error correction; the error length characteristic value is set to 1 in the present embodiment;

s530, text inspection error correction operation is carried out on the text to be inspected based on the substitution table to be used, the decision model to be used, the inspection model to be used, the post-processing prediction model, the probability threshold, the error screening characteristic and the error length characteristic value;

specifically, the text checking and error correcting operation includes:

inputting the text to be checked into the checking model to be used to obtain predicted suspicious errors and error probability; after the corresponding text to be checked is input into the checking model to be used, the corresponding text to be checked is processed by an internal framework of the checking model to be used, and is analyzed by an error detection network and a prediction network respectively, loss calculation is finally carried out, and finally the error part detected in the text to be checked, namely the predicted suspicious error and the error probability of the predicted suspicious error are output;

comparing the error probability with the probability threshold; if the error probability reaches the probability threshold, the predicted suspicious error is accurate and needs to be corrected, so that the predicted suspicious error is subjected to error correction processing based on the word list to be replaced, namely, word replacement of the predicted suspicious error in the text to be checked is performed in the word list to be replaced according to mapping, and a new correct text is generated;

if the error probability does not reach the probability threshold, the suspicious error prediction is possibly inaccurate, so further analysis and decision making are needed; therefore, in this embodiment, considering that the predicted suspected errors output by the to-be-used inspection model are all single words, a single word error, but may be combined to be a correct specific word, the predicted suspected errors are subjected to adjacent error combination processing, that is, adjacent predicted suspected errors are combined to obtain a first to-be-determined error; performing feature filtering processing on the first error to be judged based on the error screening feature, and judging whether filtering is performed to obtain a second error to be judged;

if the second error to be judged is obtained, namely some characters or words which do not belong to the specific characters or words in the error screening characteristics exist in the combined first error to be judged, so that the second error to be judged needs to be judged again, and the second error to be judged is subjected to advanced decision processing; if the second errors to be judged are not obtained, the merged first errors to be judged belong to certain specific characters or words in the error screening characteristics, so that correction is not needed, and error reservation processing is carried out on the predicted suspicious errors;

specifically, the advanced decision processing includes:

identifying an error length value of the second error to be judged; the error length value is the length value of the text in the second error to be judged; if the error length value is larger than the error length characteristic value, performing gram grading decision processing based on the second to-be-judged error and the predicted suspicious error; if the error length value is not greater than the error length characteristic value, a model needs to be adopted again for secondary prediction, so that the second error to be judged is subjected to prediction processing based on the post-processing prediction model to obtain a third error to be judged; performing gram grading decision processing based on the third to-be-judged error and the predicted suspicious error; the gram grading decision processing is to grade errors according to the trained n-gram model, and judge whether the errors should be corrected according to the scores;

specifically, the gram scoring decision processing includes: calling the decision model to be used to calculate a first error correction index of the predicted suspicious error and a second error correction index of the second error to be judged or a first error correction index of the predicted suspicious error and a third error correction index of the third error to be judged; comparing the first error correction index with the second error correction index or comparing the first error correction index with the third error correction index; if the first error correction index is smaller than the second error correction index or the third error correction index, the predicted suspicious error is required to be corrected, and the judgment of the to-be-used inspection model is correct, so that the error correction processing is performed on the predicted suspicious error based on the to-be-replaced word list; if the first error correction index is not smaller than the second error correction index or the third error correction index, the predicted suspicious error is not required to be corrected, and the error is reserved for the predicted suspicious error due to the fact that the error is judged to be wrong by using the check model; and error retention, namely, the predicted suspicious error in the text to be detected is not modified or adjusted, and the predicted suspicious error in the text to be detected is retained.

Example 2

The present embodiment provides a high-accuracy chinese spell checking system based on the same inventive concept as the high-accuracy chinese spell checking method described in embodiment 1, as shown in fig. 7, including: the system comprises a corpus acquisition module, a substitution table configuration module, an error sample generation module, a model configuration module and a text inspection module;

in the high-accuracy Chinese spelling check system, a corpus acquisition module is used for screening a target website and crawling data from the target website to obtain a first corpus; the corpus obtaining module is further used for configuring a first open-source project; the corpus acquiring module is also used for setting cleaning characteristic data, symbol reservation cutting logic, cutting identifiers and length intervals; the corpus obtaining module executes corpus cleaning operation on the first corpus based on the first open source project, the cleaning feature data, the symbol reservation cutting logic, the cutting identifier and the length interval to obtain a corpus to be used;

specifically, the corpus cleaning operation includes: the corpus obtaining module is used for carrying out corpus cleaning on the first corpus based on a regular expression and the cleaning characteristic data to obtain a second corpus; the corpus obtaining module cuts the second corpus based on the symbol reservation cutting logic and the cutting identifier to obtain a third corpus; the corpus obtaining module is used for filtering the corpus of which the corpus length is positioned outside the length interval in the third corpus to obtain a fourth corpus; and the corpus acquisition module performs word segmentation processing on the fourth corpus based on the first open source project to obtain the corpus to be used.

In the high-accuracy Chinese spelling check system, a substitution table configuration module is used for setting a first word frequency value, a second word frequency value and a first word frequency value; the substitution table configuration module is also used for configuring front and back nasal sound disambiguation logic, compound word segmentation logic, a second open source item, a first simplified and complex word table and a plurality of first near sound sample words; the replacement table configuration module executes a replacement word table generation operation based on the corpus to be used, the first word frequency value, the second word frequency value and the first word frequency value to obtain a word table to be replaced; the substitution table configuration module executes a near-sound near-shape table generation operation based on the front and back nasal sound disambiguation logic, the compound word segmentation logic, the second open source item and the first near-sound sample words to obtain a near-sound near-shape table; the replacement table configuration module sets the first simplified and traditional word table as a Chinese traditional word table, and integrates the word table to be replaced, the near-sound near-form table and the Chinese traditional word table to obtain a replacement table to be used;

specifically, the operation of generating the replacement word list includes: a substitution table configuration module carries out word frequency statistics and word frequency statistics on the linguistic data to be used to obtain a word frequency ranking sequence and a word frequency ranking sequence; in the word frequency ranking sequence, a substitution table configuration module selects a first word corpus corresponding to the first word frequency value quantity according to a first direction, and a substitution table configuration module selects a second word corpus corresponding to the second word frequency value quantity according to a second direction; in the word frequency ranking sequence, a substitution table configuration module selects first word corpora corresponding to the first word frequency value quantity according to the first direction; and the replacement table configuration module constructs the word table to be replaced based on the first word corpus, the second word corpus and the first word corpus.

Specifically, the operation of generating the near-sound near-shape table includes: the substitution table configuration module creates a plurality of pinyin mappings respectively matched with the first near sound sample words, and the substitution table configuration module arranges the first near sound sample words and the pinyin mappings to obtain a first near sound word list; a replacement table configuration module performs word segmentation processing on a plurality of first near sound sample words in the first near sound word list based on the compound word segmentation logic to obtain a plurality of second near sound sample words; the substitution table configuration module updates a plurality of pinyin mappings based on a plurality of second near-sound sample words to obtain a second near-sound word list; a substitution table configuration module disambiguates a plurality of pinyin mappings corresponding to a plurality of second near-sound sample words in the second near-sound word list based on the front and rear nasal sound disambiguation logic to obtain a near-sound word list to be used; the replacement table configuration module acquires a homographic word table to be used based on the second open source project; and a replacement table configuration module integrates the to-be-used near-sound word list and the to-be-used near-form word list to obtain the near-sound near-form list.

In the high-accuracy Chinese spelling check system, an error sample generation module is used for setting a near-sound error probability and a near-shape error probability; the error sample generation module screens the linguistic data to be processed by the near pronunciation and the linguistic data to be processed by the near shape in the linguistic data to be used based on the near pronunciation error probability and the near shape error probability; the error sample generation module is also used for configuring a random position recall algorithm, a probability limit recall algorithm, a multi-error recall algorithm and a sample random selection algorithm; the error sample generation module executes a near sound error sample generation operation based on the to-be-used substitution table, the to-be-near-sound processing corpus, the random position recall algorithm, the probability limit recall algorithm, the multiple error recall algorithm and the sample random selection algorithm to generate a to-be-used near sound error sample; the error sample generation module generates a to-be-used near-shape error sample by adopting the operation logic of the near-sound error sample generation operation based on the to-be-near-shape processing corpus;

specifically, the operation of generating the near-tone error sample includes: the error sample generation module creates a part of speech set to be filtered and a first word recall set; the error sample generation module sets a first length threshold value and a first Cartesian product limiting formula; the error sample generation module calls the random position recall algorithm based on the part of speech set to be filtered, the first word recall set, the first length threshold, the first Cartesian product limiting formula, the linguistic data to be processed by the near sound, the word list to be used by the near sound and the Chinese traditional word list to obtain a plurality of first near sound error texts; the error sample generation module sets a single text recall probability and a sequence recall position; the error sample generation module sets recall limiting logic based on the word list to be replaced; the error sample generation module calls the probability limit recall algorithm based on the single text recall probability, the sequence recall position, the recall limit logic, the to-be-near-sound processing corpus and the to-be-used near-sound vocabulary to obtain a plurality of second near-sound error texts; the error sample generation module sets a second length threshold value and a multi-dislocation setting algorithm; the error sample generation module calls the multi-error recall algorithm based on the second length threshold, the multi-dislocation setting algorithm, the to-be-near-speech processing corpus and the to-be-used near-speech vocabulary to obtain a third multi-error sample; and the error sample generation module calls the sample random selection algorithm based on the third multi-error sample, the plurality of first near-sound error texts and the plurality of second near-sound error texts to obtain the near-sound error sample to be used.

Specifically, the random location recall algorithm is as follows: selecting a first text sequence from the corpus to be subjected to the near-speech processing by an error sample generation module, and selecting a first fragment from the first text sequence by the error sample generation module; an error sample generation module performs data recall processing on the first word recall set based on the first length threshold, the first Cartesian product limiting formula, the to-be-used near-sound word list and the first segment; the error sample generation module is used for screening the first word recall set subjected to the data recall processing on the basis of the Chinese traditional word list and the part of speech set to be filtered to obtain a second word recall set; an error sample generation module generates a plurality of first near-sound error texts based on sample words in the second word recall set and the first text sequence;

specifically, the probability limit recall algorithm is as follows: selecting a second text sequence from the linguistic data to be subjected to the proximal sound processing by an error sample generation module; the error sample generation module recalls the text positioned at the sequence recall position on the second text sequence based on the to-be-used near-sound word list according to the single text recall probability to obtain a plurality of fourth recall sample words; an error sample generation module screens a plurality of fourth recall samples based on the recall limiting logic to obtain a plurality of fifth recall sample words; an error sample generation module generates a plurality of second near-sound error texts based on a plurality of fifth recall sample words and the second text sequence;

specifically, the multi-error recall algorithm is as follows: the error sample generation module selects a third text sequence in the linguistic data to be subjected to the proximal sound processing based on the second length threshold; an error sample generation module sets a plurality of error replacement positions in the third text sequence based on the multi-dislocation setting algorithm; an error sample generation module acquires a plurality of sixth candidate sample words about a plurality of error replacement positions based on the to-be-used near-sound word list; an error sample generation module generates the third multiple error samples based on the number of error alternatives, the number of sixth candidate sample words, and the third text sequence;

specifically, the sample random selection algorithm is as follows: the error sample generation module sets a selection quantity calculation formula; the error sample generation module counts a first number and a value of the third multi-error sample, the first near-sound error texts and the second near-sound error texts; the error sample generation module substitutes the first quantity and the value into the selected quantity calculation formula to obtain a selected quantity value; and the error sample generation module selects a plurality of to-be-used near sound error samples from the third multi-error sample, the first near sound error texts and the second near sound error texts according to the selection quantity value.

In the high-accuracy Chinese spell checking system, a model configuration module is used for configuring a first decision model and a first open source packet; the model configuration module trains the first decision model based on the corpus to be used and the first open source package to obtain a decision model to be used; the model configuration module is further used for configuring a first inspection model, and the model configuration module sets an error detection network in a model architecture of the first inspection model to obtain a second inspection model; the model configuration module trains the second inspection model based on the to-be-used near-sound error sample and the to-be-used near-shape error sample to obtain an inspection model to be used; a post-processing prediction model is configured.

In the high-accuracy Chinese spelling check system, a text check module is used for acquiring a text to be checked; the text checking module is also used for setting a probability threshold value, an error screening characteristic and an error length characteristic value; the text inspection module executes text inspection error correction operation on the text to be inspected based on the substitution table to be used, the decision model to be used, the inspection model to be used, the post-processing prediction model, the probability threshold, the error screening characteristic and the error length characteristic value;

specifically, the text checking and error correcting operation includes: the text inspection module inputs the text to be inspected into the inspection model to be used to obtain predicted suspicious errors and error probability; the text checking module compares the error probability with the probability threshold; if the error probability reaches the probability threshold, the text inspection module carries out error correction processing on the predicted suspicious errors based on the word list to be replaced; if the error probability does not reach the probability threshold, the text inspection module carries out adjacent error combination processing on the predicted suspicious error to obtain a first error to be judged; the text inspection module performs feature filtering processing on the first error to be judged based on the error screening feature, and the text inspection module judges whether filtering is performed to obtain a second error to be judged; if the second error to be judged is obtained, the text inspection module carries out advanced decision processing on the second error to be judged; and if the second to-be-judged error is not obtained, the text inspection module carries out error reservation processing on the predicted suspicious error.

Specifically, the advanced decision processing includes: the text checking module identifies the error length value of the second error to be judged; if the error length value is larger than the error length characteristic value, the text inspection module performs gram grading decision processing based on the second error to be judged and the predicted suspicious error; if the error length value is not larger than the error length characteristic value, the text inspection module carries out prediction processing on the second error to be judged based on the post-processing prediction model to obtain a third error to be judged; the text checking module carries out gram grading decision processing on the basis of the third error to be judged and the predicted suspicious error;

specifically, the gram scoring decision processing includes: the text inspection module calls the decision model to be used to calculate a first error correction index of the predicted suspicious error and a second error correction index of the second error to be judged or a third error correction index of the third error to be judged; the text checking module compares the first error correction index with the second error correction index or the third error correction index; if the first error correction index is smaller than the second error correction index or the third error correction index, the text inspection module performs error correction processing on the predicted suspicious errors based on the word list to be replaced; and if the first error correction index is not smaller than the second error correction index or the third error correction index, the text inspection module performs error retention processing on the predicted suspicious errors.

Example 3

The present embodiments provide a computer-readable storage medium comprising:

the storage medium is used for storing computer software instructions for implementing the high-accuracy Chinese spell checking method of embodiment 1, and comprises a program for executing the high-accuracy Chinese spell checking method; specifically, the executable program may be built in the high-accuracy chinese spell checking system according to embodiment 2, so that the high-accuracy chinese spell checking system may implement the high-accuracy chinese spell checking method according to embodiment 1 by executing the built-in executable program.

Furthermore, the computer-readable storage medium of the present embodiments may take any combination of one or more readable storage media, where a readable storage medium includes an electronic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof.

Compared with the prior art, the high-accuracy Chinese spelling check method, the high-accuracy Chinese spelling check system and the high-accuracy Chinese spelling check medium can efficiently generate high-quality and high-authenticity training samples through the method, improve the accuracy of an error detection model in the aspect of sample configuration, have a simple sample generation mode, and are convenient to expand and apply; meanwhile, the method can carry out error detection on the Chinese text by independently setting a trained error detection model and a post-processing decision model, and carry out high-quality error correction and judgment after error detection.

The numbers of the embodiments disclosed in the embodiments of the present invention are merely for description, and do not represent the merits of the embodiments.

It will be understood by those skilled in the art that all or part of the steps of the above embodiments may be implemented by hardware, or a program executed by hardware and instructed by a program to be stored in a computer-readable storage medium, where the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk.

The above description is only an embodiment of the present invention, and is not intended to limit the scope of the present invention, and all equivalent structures or equivalent processes performed by the present invention or directly or indirectly applied to other related technical fields are included in the scope of the present invention.

Claims

1. A high-accuracy Chinese spell checking method is characterized by comprising the following steps:

and a corpus obtaining step:

a replacement table configuration step:

setting a first word frequency value, a second word frequency value and a first word frequency value; configuring front and back nasal sound disambiguation logic, compound word segmentation logic, a second open source item, a first simplified and complex word list and a plurality of first near sound sample words; performing a word table replacement operation based on the corpus to be used, the first word frequency value, the second word frequency value and the first word frequency value to obtain a word table to be replaced; executing a near-sound near-shape table generating operation based on the front and back nasal sound disambiguation logic, the compound word segmentation logic, the second open source item and the plurality of first near-sound sample words to obtain a near-sound near-shape table; setting the first simplified and traditional word list as a Chinese traditional word list, and integrating the word list of the characters to be replaced, the near sound and near form list and the Chinese traditional word list to obtain a replacement list to be used;

an error sample generation step:

setting a near sound error probability and a near shape error probability; screening the linguistic data to be subjected to the near-phonetic processing and the linguistic data to be subjected to the near-shape processing in the linguistic data to be used based on the near-phonetic error probability and the near-shape error probability; configuring a random position recall algorithm, a probability limit recall algorithm, a multiple error recall algorithm and a sample random selection algorithm; executing a near-sound error sample generation operation based on the to-be-used substitution table, the to-be-near-sound processing corpus, the random position recall algorithm, the probability limit recall algorithm, the multi-error recall algorithm and the sample random selection algorithm to generate a to-be-used near-sound error sample; generating a to-be-used near-shape error sample by adopting the operation logic of the near-sound error sample generation operation based on the to-be-near-shape processing corpus;

model configuration step:

a text checking step:

acquiring a text to be checked; setting a probability threshold value, an error screening characteristic and an error length characteristic value; and executing text inspection error correction operation on the text to be inspected based on the substitution table to be used, the decision model to be used, the inspection model to be used, the post-processing prediction model, the probability threshold, the error screening characteristic and the error length characteristic value.

2. The method of claim 1, wherein the method further comprises:

the corpus cleaning operation comprises:

performing corpus cleaning on the first corpus based on a regular expression and the cleaning feature data to obtain a second corpus; cutting the second corpus based on the symbol reservation cutting logic and the cutting identifier to obtain a third corpus; filtering the corpus with the corpus length outside the length interval in the third corpus to obtain a fourth corpus; and performing word segmentation processing on the fourth corpus based on the first open source project to obtain the corpus to be used.

3. The method of claim 1, wherein the method further comprises:

the replacement word list generation operation comprises:

4. The method of claim 1, wherein the method further comprises:

the near-sound near-shape table generating operation comprises:

creating a plurality of pinyin mappings respectively matched with the first near sound sample words, and sorting the first near sound sample words and the pinyin mappings to obtain a first near sound word list; performing word segmentation processing on a plurality of first near sound sample words in the first near sound word list based on the compound word segmentation logic to obtain a plurality of second near sound sample words; updating a plurality of pinyin mappings based on a plurality of second near-sound sample words to obtain a second near-sound word list; disambiguating a plurality of pinyin mappings respectively corresponding to a plurality of second near sound sample words in the second near sound word list based on the front and back nasal sound disambiguation logic to obtain a near sound word list to be used; acquiring a homographic word list to be used based on the second open source project; and integrating the to-be-used word list and the to-be-used word list to obtain the near-sound word list.

5. The method of claim 4, wherein the method further comprises:

the near-tone error sample generation operation comprises:

creating a part-of-speech set to be filtered and a first word recall set; setting a first length threshold value and a first Cartesian product limiting formula; calling the random position recall algorithm based on the part of speech set to be filtered, the first word recall set, the first length threshold, the first Cartesian product limiting formula, the corpora to be subjected to near-sound processing, the to-be-used near-sound vocabulary and the Chinese traditional vocabulary to obtain a plurality of first near-sound error texts;

setting a second length threshold and a multi-dislocation setting algorithm; calling the multi-error recall algorithm based on the second length threshold, the multi-dislocation setting algorithm, the to-be-near-sound processing corpus and the to-be-used near-sound vocabulary to obtain a third multi-error sample;

6. The method of claim 5, wherein the method further comprises:

the random position recall algorithm is as follows:

selecting a first text sequence from the corpus to be subjected to the near-sound processing, and selecting a first fragment from the first text sequence; performing data recall processing on the first word recall set based on the first length threshold, the first Cartesian product limiting formula, the to-be-used near-word list, and the first segment; screening the first word recall set subjected to the data recall processing based on the Chinese traditional word list and the part-of-speech set to be filtered to obtain a second word recall set; generating a number of the first near-sound erroneous texts based on sample words in the second word recall set and the first text sequence;

the probability limit recall algorithm is as follows:

selecting a second text sequence from the corpus to be subjected to the near-sound processing; recalling texts positioned at the sequence recall positions on the second text sequence according to the near-sound word list to be used and the single text recall probability to obtain a plurality of fourth recall sample words; screening a plurality of fourth recall samples based on the recall limiting logic to obtain a plurality of fifth recall sample words; generating a number of second nearing error texts based on the number of fifth recall sample words and the second text sequence;

the multi-error recall algorithm is as follows:

selecting a third text sequence from the linguistic data to be subjected to the proximal sound processing based on the second length threshold; setting a plurality of error replacements in the third text sequence based on the multi-dislocation setting algorithm; acquiring a plurality of sixth candidate sample words related to a plurality of error replacement positions on the basis of the near-sound word list to be used; generating the third multiple-error sample based on the number of error replacements, the number of sixth candidate sample words, and the third text sequence;

the sample random selection algorithm comprises the following steps:

7. The method of claim 1, wherein the method further comprises:

the text checking and correcting operation comprises the following steps:

if the error probability reaches the probability threshold, carrying out error correction processing on the predicted suspicious errors based on the word list to be replaced;

if the error probability does not reach the probability threshold, performing adjacent error combination processing on the predicted suspicious error to obtain a first error to be judged; performing feature filtering processing on the first error to be judged based on the error screening feature, and judging whether filtering is performed to obtain a second error to be judged;

8. The method of claim 7, wherein the method further comprises:

the advanced decision processing comprises:

the gram scoring decision processing includes:

9. The high-accuracy Chinese spell checking system of the high-accuracy Chinese spell checking method according to any one of claims 1 to 8, comprising: the system comprises a corpus acquisition module, a substitution table configuration module, an error sample generation module, a model configuration module and a text inspection module;

the error sample generation module is used for setting a near sound error probability and a near shape error probability; the error sample generation module screens the linguistic data to be processed by the near pronunciation and the linguistic data to be processed by the near shape in the linguistic data to be used based on the near pronunciation error probability and the near shape error probability; the error sample generation module is also used for configuring a random position recall algorithm, a probability limit recall algorithm, a multi-error recall algorithm and a sample random selection algorithm; the error sample generation module executes near-sound error sample generation operation based on the to-be-used substitution table, the to-be-near-sound processing corpus, the random position recall algorithm, the probability limit recall algorithm, the multiple error recall algorithm and the sample random selection algorithm to generate a to-be-used near-sound error sample; the error sample generation module generates a to-be-used near-shape error sample by adopting the operation logic of the near-sound error sample generation operation based on the to-be-near-shape processing corpus;

the model configuration module is used for configuring a first decision model and a first open source package; the model configuration module trains the first decision-making model based on the corpus to be used and the first open source package to obtain a decision-making model to be used; the model configuration module is further used for configuring a first inspection model, and the model configuration module is used for setting an error detection network in a model architecture of the first inspection model to obtain a second inspection model; the model configuration module trains the second inspection model based on the near-sound error sample to be used and the near-form error sample to be used to obtain an inspection model to be used; configuring a post-processing prediction model;

10. A computer-readable storage medium, having stored thereon a computer program which, when executed by a processor, carries out the steps of the high accuracy chinese spell checking method of any one of claims 1 to 8.