CN109597983B - Spelling error correction method and device - Google Patents

Spelling error correction method and device Download PDF

Info

Publication number
CN109597983B
CN109597983B CN201710928606.5A CN201710928606A CN109597983B CN 109597983 B CN109597983 B CN 109597983B CN 201710928606 A CN201710928606 A CN 201710928606A CN 109597983 B CN109597983 B CN 109597983B
Authority
CN
China
Prior art keywords
pinyin
corrected
pinyins
similar
error correction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710928606.5A
Other languages
Chinese (zh)
Other versions
CN109597983A (en
Inventor
陈克凡
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Gridsum Technology Co Ltd
Original Assignee
Beijing Gridsum Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Gridsum Technology Co Ltd filed Critical Beijing Gridsum Technology Co Ltd
Priority to CN201710928606.5A priority Critical patent/CN109597983B/en
Publication of CN109597983A publication Critical patent/CN109597983A/en
Application granted granted Critical
Publication of CN109597983B publication Critical patent/CN109597983B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/232Orthographic correction, e.g. spell checking or vowelisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a spelling error correction method and a device, relates to the technical field of data processing, and aims to solve the problem that the spelling error correction accuracy is low due to the fact that the existing spelling error correction is only carried out correlation and error correction according to the position relation of letters in a keyboard, and the error correction is one-sided. The method of the invention comprises the following steps: obtaining pinyin to be corrected; extracting a plurality of similar pinyins corresponding to the pinyin to be corrected from a preset database; calculating error correction probabilities respectively corresponding to the similar pinyins and the pinyins to be corrected according to a keyboard editing distance set based on a Chinese pinyin rule; and outputting the corrected pinyin corresponding to the pinyin to be corrected according to the error correction probability. The invention is suitable for the error correction of spelling.

Description

Spelling error correction method and device
Technical Field
The present invention relates to the field of data processing technologies, and in particular, to a spelling error correction method and apparatus.
Background
With the continuous development of network technology, more and more people work, shop, search information and the like through computers in daily life, and interaction between users and intelligent equipment is very frequent. In general, people input corresponding information in a device by using a keyboard input, a touch screen handwriting input, a voice input, and the like. However, when a person inputs through the keyboard, the user may hit the wrong keyboard during spelling, for example, when the person wants to input the chinese character "phone", the correct chinese pinyin "dianhua" may be written as "dianhia" during spelling through the keyboard, and then the error correction is performed according to the wrong pinyin input by the user and the corresponding characters are associated.
At present, the error correction mode for spelling errors is mainly to perform association and error correction according to the position relation of each letter in a keyboard, so that the error correction mode is relatively unilateral, and the accuracy of spelling error correction is low.
Disclosure of Invention
In view of the above problems, the present invention provides a spelling error correction method and apparatus, and mainly aims to correct the pinyin input by a user in combination with the rules of chinese pinyin.
In order to solve the above technical problem, in a first aspect, the present invention provides a method for spell correction, including:
obtaining pinyin to be corrected, wherein the pinyin to be corrected is a plurality of characters input by a user;
extracting a plurality of similar pinyins corresponding to the pinyin to be corrected from a preset database, wherein the correct spelling pinyins corresponding to all words and characters are stored in the preset database, and the similar pinyins are the pinyins which have a difference of a preset number of characters from the pinyin to be corrected;
calculating error correction probabilities respectively corresponding to the similar pinyins and the pinyins to be corrected according to a keyboard editing distance set based on a Chinese pinyin rule;
and outputting the corrected pinyin corresponding to the pinyin to be corrected according to the error correction probability.
Optionally, the method further includes:
and acquiring and extracting the accumulated search quantity respectively corresponding to each similar pinyin from the preset database.
Optionally, the calculating the error correction probability corresponding to each similar pinyin and the pinyin to be error-corrected respectively includes:
calculating final editing distances respectively corresponding to the similar pinyins and the pinyins to be corrected according to the keyboard editing distance and preset coefficients set based on the Chinese pinyin rule, wherein the preset coefficients are set according to the number of characters with the difference between the similar pinyins and the pinyins to be corrected;
and determining the product of the accumulated search quantity corresponding to each similar pinyin and the reciprocal of the final editing distance as the error correction probability of each similar pinyin.
Optionally, the outputting the pinyin corresponding to the pinyin to be error-corrected according to the error correction probability includes:
sequencing the similar pinyins according to the error correction probability;
extracting similar pinyins corresponding to the error correction probability exceeding a preset probability threshold;
and outputting the extracted similar pinyin according to a preset rule.
Optionally, before obtaining the pinyin to be corrected, the method further includes:
acquiring pinyin to be detected;
detecting whether a correct spelling pinyin corresponding to the pinyin to be detected exists in the preset database;
if yes, outputting the correct spelling pinyin corresponding to the pinyin to be detected;
and if not, determining the pinyin to be detected as the pinyin to be corrected.
In a second aspect, the present invention further provides a spelling error correction apparatus, comprising:
the device comprises an acquisition unit, a correction unit and a processing unit, wherein the acquisition unit is used for acquiring pinyin to be corrected, and the pinyin to be corrected is a plurality of characters input by a user;
the extraction unit is used for extracting a plurality of similar pinyins corresponding to the pinyins to be corrected from a preset database, the correct spelling pinyins corresponding to all words and characters are stored in the preset database, and the similar pinyins are the pinyins with the difference of a preset number of characters from the pinyins to be corrected;
the calculation unit is used for calculating the error correction probability corresponding to each similar pinyin and the pinyin to be corrected respectively according to the keyboard editing distance set based on the pinyin rule;
and the output unit is used for outputting the corrected pinyin corresponding to the pinyin to be corrected according to the error correction probability.
Optionally, the obtaining unit is further configured to obtain, from the preset database, cumulative search quantities respectively corresponding to the similar pinyins;
the extracting unit is further configured to extract accumulated search quantities respectively corresponding to the similar pinyins acquired by the acquiring unit.
Optionally, the computing unit includes:
the calculation module is used for calculating the final editing distance corresponding to each similar pinyin and the pinyin to be corrected according to the keyboard editing distance and the preset coefficient set based on the Chinese pinyin rule, wherein the preset coefficient is set according to the number of characters of the difference between the similar pinyin and the pinyin to be corrected;
and the determining module is used for determining the product of the accumulated search quantity corresponding to each similar pinyin and the reciprocal of the final editing distance as the error correction probability of each similar pinyin.
Optionally, the output unit includes:
the sequencing module is used for sequencing all the similar pinyins according to the error correction probability;
the extraction module is used for extracting similar pinyin corresponding to the error correction probability exceeding a preset probability threshold;
and the output module is used for outputting the extracted similar pinyin according to a preset rule.
Optionally, the apparatus further comprises: a detection unit for detecting the position of the optical fiber,
the acquisition unit is also used for acquiring the pinyin to be detected;
the detection unit is used for detecting whether the correct spelling pinyin corresponding to the pinyin to be detected exists in the preset database;
the output unit is also used for outputting the correct spelling pinyin corresponding to the pinyin to be detected if the correct spelling pinyin exists;
and the determining unit is also used for determining the pinyin to be detected as the pinyin to be corrected if the pinyin does not exist.
In order to achieve the above object, according to a third aspect of the present invention, there is provided a storage medium including a stored program, wherein when the program runs, a device on which the storage medium is located is controlled to execute the above spell correction method.
In order to achieve the above object, according to a fourth aspect of the present invention, there is provided a processor for executing a program, wherein the program executes to execute the above spell correction method.
By means of the technical scheme, the spelling error correction method and the spelling error correction device provided by the invention have the advantages that when pinyin input by a user is corrected in the prior art, association and error correction are mainly carried out according to the position relation of each letter in a keyboard, so that the error correction mode is relatively one-sided.
The above description is only an overview of the technical solutions of the present invention, and the present invention can be implemented in accordance with the content of the description so as to make the technical means of the present invention more clearly understood, and the above and other objects, features, and advantages of the present invention will be more clearly understood.
Drawings
Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:
FIG. 1 is a flow chart of a method for spell correction according to an embodiment of the invention;
FIG. 2 is a flow chart of another spell correction method provided by an embodiment of the invention;
FIG. 3 is a block diagram illustrating an exemplary spell correction apparatus;
fig. 4 is a block diagram illustrating another spelling error correction apparatus according to an embodiment of the present invention.
Detailed Description
Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
In order to improve the accuracy of the spell correction, an embodiment of the present invention provides a method for spell correction, as shown in fig. 1, the method including:
101. and obtaining the pinyin to be corrected.
The pinyin to be corrected is a plurality of characters input by a user. The pinyin to be corrected can be a pinyin character corresponding to one character, a pinyin character corresponding to one phrase, a character corresponding to one English word, and the like. Specifically, the application scenario in the embodiment of the present invention may be a browser page or a page in an APP, but this is not specifically limited in the embodiment of the present invention.
It should be noted that, the execution main body of the embodiment of the present invention may be a device configured in the web page for correcting spelling errors, and when the device detects that the pinyin that cannot be recognized is input by the user in the web page, it indicates that the pinyin needs to be corrected at this moment, and an instruction is triggered to be obtained, thereby realizing the error correction of the pinyin.
102. And extracting a plurality of similar pinyins corresponding to the pinyins to be corrected from a preset database.
The preset database stores correct spelling pinyins corresponding to all words and characters, and the similar pinyins are pinyins with a difference of a preset number of characters from the pinyin to be corrected. The number of the pinyins of the similar pinyins corresponding to the pinyins to be corrected can be 3, 10, 20 and the like, the pinyins which are different from the pinyins to be corrected by a preset number of characters can be one or more different characters between the pinyins to be corrected, or more or less than one or more characters between the pinyins to be corrected, and the like, for example, if the obtained pinyin to be corrected input by the user is kai hin, the similar pinyins corresponding to the pinyins to be corrected can include { kai xin, kai bin, kaiyin, ku aijin }.
Specifically, for the embodiment of the present invention, the step 102 may be to directly search, in a traversal manner, similar pinyins corresponding to the pinyin to be corrected in a pre-created database according to the obtained pinyin to be corrected, or to first generate all pinyins corresponding to the pinyin to be corrected, which have a difference of a preset number of characters, and then compare the pinyins with the pinyin to be corrected to obtain all final similar pinyins corresponding to the pinyin to be corrected.
103. And calculating the error correction probability corresponding to each similar pinyin and the pinyin to be corrected respectively according to the keyboard editing distance set based on the pinyin rule.
The keyboard editing distance set based on the Chinese pinyin rule comprises keyboard editing distances set according to different Chinese pinyin rules, for example, the keyboard editing distances of c, ch, s, sh and the like are set to be 0.5 according to a flat-warped-tongue rule in Chinese pinyin; setting the keyboard distance of bu to be 0.5 and the editing distance of bt to be 1 according to the initial consonant and final rule in the Chinese pinyin; and setting the keyboard editing distance of lu and lv to be 0.5, the keyboard editing distance of mo and me to be 0.5 and the like according to the accent rules of Chinese pinyin in different regions. It should be noted that, for the embodiment of the present invention, all the pinyins having the special rule of chinese pinyin and the corresponding keyboard edit distances may be stored in the database in advance, and for the remaining pinyins having no special rule of chinese pinyin, the keyboard edit distances may be calculated and stored according to the calculation method in the prior art, so as to finally obtain the keyboard edit distance set based on the rule of chinese pinyin in this step. Therefore, in this step, the pinyin to be corrected can be traversed first and compared with the keyboard edit distance set based on the pinyin rule stored in the database, so as to obtain the keyboard edit distance of each similar pinyin.
Further, the error correction probability is the probability that the identifier is associated with each similar pinyin in the pinyin set according to the wrong pinyin input by the user, so that the probability that the pinyin is associated with the pinyin to be corrected input by the user is higher when the error correction probability that the similar pinyin corresponds to the pinyin to be corrected is obtained by calculation, and conversely, the probability that the pinyin is associated with the pinyin to be corrected of the user input rate is lower when the error correction probability that the similar pinyin is obtained by calculation is lower.
104. And outputting the corrected pinyin corresponding to the pinyin to be corrected according to the error correction probability.
The corrected pinyin is the correct pinyin selected by the user, and the corrected pinyin finally output in the step can be one or a plurality of corrected pinyins arranged from large to small according to the error correction probability so as to be selected by the user. As described in step 103, since the error correction probability is the probability of identifying the association between each similar pinyin and the pinyin to be corrected, the processing result of the pinyin to be corrected can be output according to the obtained error correction probability by calculating the error correction probability of each pinyin in all similar pinyins.
The spelling error correction method provided by the embodiment of the invention is mainly used for correlating and correcting errors according to the position relation of each letter in a keyboard when the pinyin input by a user is corrected in the prior art, so that the error correction mode is relatively in a one-sided mode.
Further, as a refinement and an extension of the embodiment shown in fig. 1, another pinyin error correction method is provided in the embodiment of the present invention, as shown in fig. 2.
201. And obtaining the pinyin to be corrected.
The pinyin to be corrected is a plurality of characters input by a user. For the specific explanation of the pinyin to be corrected and the concept of the character, reference may be made to the corresponding description in step 101, and details are not repeated here.
To avoid resource waste, for the embodiment of the present invention, before the step 201, the method may further include: obtaining a pinyin to be detected; detecting whether a correct spelling pinyin corresponding to the pinyin to be detected exists in the preset database; if yes, outputting correct spelling pinyin corresponding to the pinyin to be detected; and if not, determining the pinyin to be detected as the pinyin to be corrected. The preset database stores the correct spelling pinyins corresponding to all words and characters and the accumulated searching quantity identification information corresponding to each correct spelling pinyin respectively, wherein the accumulated searching quantity identification information is used for identifying the searching times of the user on each correct spelling pinyin.
It should be noted that, for the embodiment of the present invention, different preset databases may be configured for each web page, and for each web page, correct spelling pinyin data corresponding to all contents included in the web page is stored in the database, and when a user searches for web page contents in the web page each time, the search is recorded, and the statistical result and the pinyin corresponding to the web page contents are correspondingly stored. And a corresponding database can be configured for the webpages of all categories according to the categories, and the search times of the content are counted so as to correct the pinyin by taking the search times of the user as a reference factor.
202. And extracting a plurality of similar pinyins corresponding to the pinyin to be corrected from a preset database.
The similar pinyins corresponding to the pinyins to be error-corrected are pinyins that differ from the pinyins to be error-corrected by a preset number of characters, and the specific conceptual explanation of the pinyins that differ from the pinyins to be error-corrected by the preset number of characters may refer to the corresponding description in step 102, which is not described herein again.
Specifically, the step 202 may first generate a corresponding lookup function according to the obtained pinyin to be corrected, and perform a lookup in a preset database according to the lookup function, but is not limited thereto. For example, when the obtained pinyin to be corrected is "niulai", the letters contained in the pinyin can be sequentially replaced so as to generate corresponding functions, for example, corresponding search functions are generated according to "niula _" and "niu _ ai", the letters in the pinyin can be increased or decreased, and then the functions are used for searching in a database to obtain all similar pinyins which are different from the pinyin to be corrected by a preset number of characters.
For the embodiment of the present invention, after this step, the method may further include: and acquiring and extracting accumulated search quantities respectively corresponding to the similar pinyins from the preset database. After the step 202, all corrected pinyins which may be the pinyins to be corrected are obtained, and at this time, the accumulated search quantity corresponding to each found similar pinyin is extracted, so as to further process each similar pinyin, thereby avoiding the problem of time waste caused by complicated operation due to the fact that each pinyin which has a difference with the pinyin to be corrected by a preset quantity of characters is found and extracted, and improving the efficiency of pinyin correction.
Specifically, the accumulated search quantities corresponding to each similar pinyin and each similar pinyin may be written into the data table, all extracted similar pinyins and the accumulated search quantities corresponding to each pinyin may be directly stored, and each similar pinyin is separated by using a preset separator, which is not limited in the embodiments of the present invention.
For example, the obtained pinyin to be corrected is wanba, and the similar pinyin data table obtained by writing each similar pinyin and the corresponding accumulated search number into the data table is shown as table one:
phonetic data Accumulating the number of searches
wanha 5
wanna 1
rana 2
wanga 1
wangba 200
According to the pinyin data table shown in table one, the similar pinyins corresponding to the pinyin to be corrected contain 5 pinyins, the accumulated search times of the user corresponding to wangba is 200 times, the accumulated search times of the user corresponding to wanga is 1 time, and the like. The pinyin corresponding to the pinyin to be corrected is stored together with all the pinyin of the characters with the preset number, so that the pinyin data can be directly extracted from a data table and the like during further processing, the problems of extraction errors and complex operation caused by sequential extraction in a large amount of disordered data are solved, and the accuracy and convenience of pinyin correction are improved.
For the embodiment of the invention, after the step, the pinyin to be corrected and the created data table can be stored, so that when the input pinyin to be corrected exists before the step, the data table corresponding to the pinyin to be corrected stored before the step can be directly obtained without extracting the pinyin again and creating a set, the time is saved, the problem of resource waste is avoided, and the efficiency of pinyin correction is improved and the resources are saved.
203. And calculating final editing distances respectively corresponding to the similar pinyins and the pinyins to be corrected according to the keyboard editing distance and the preset coefficient set based on the Chinese pinyin rule.
The preset coefficient is set according to the number of characters which are different between the similar pinyin and the pinyin to be corrected, namely when one character is added/deleted between the similar pinyin and the pinyin to be corrected, the coefficient is determined to be 1, when N characters are added/deleted, the coefficient is N, similarly, when the similar pinyin is different from the pinyin to be corrected by one character, the coefficient is 1, and when N characters are different, the coefficient is N and the like.
Specifically, in step 203, the keyboard editing distance set based on the chinese pinyin rule may be multiplied by a preset coefficient, and the obtained product is determined as the final editing distance corresponding to each similar pinyin and the pinyin to be corrected. The concept explanation of the keyboard edit distance set based on the pinyin rule may refer to the corresponding description in the step 102, and is not repeated here.
As an example of step 202, when the pinyin to be error-corrected is wanba, the final edit distance corresponding to the pinyin to be error-corrected is calculated by taking the pinyin wangba in the similar pinyin as an example: between wangba and the pinyin wanba to be corrected, because wangba is added with a character compared with the pinyin to be corrected, the coefficient set according to the preset rule is 1, the second keyboard editing distance between ang and an is set to be 0.5 in advance based on the Chinese pinyin rule, and the editing distance between the pinyin wangba and the pinyin wanba to be corrected is 1 multiplied by 0.5=0.5. Similarly, the final edit distances between the other similar pinyins and the pinyins to be corrected can be respectively obtained, the comprehensive edit distance between wanha and wanba is 1 × 1=1, the final edit distance between wanna and wanba is 1 × 1=1, the final edit distance between rana and wanba is 1 × 2=2, and the final edit distance between wanga and wanba is 1 × 1=1.
204. And determining the product of the accumulated search quantity corresponding to each similar pinyin and the reciprocal of the final editing distance as the error correction probability of each similar pinyin.
Wherein the error correction probability is the probability that the identifier is associated with each similar pinyin according to the wrong pinyin input by the user. Specifically, the calculation method in step 204 is as follows: cumulative search number x (1/(final edit distance)). As described in step 203, according to the accumulated search quantity corresponding to each similar pinyin, the error correction probabilities of each similar pinyin can be sequentially calculated as follows: wangba has an error correction probability of 200 × (1/0.5) =400, wanha has an error correction probability of 5 × (1/1) =5, wanna has an error correction probability of 1 × (1/1) =1, rana has an error correction probability of 2 × (1/2) =1, wanga has an error correction probability of 1 × (1/1) =1.
205. And outputting the pinyin corresponding to the pinyin to be corrected according to the error correction probability.
The concept explanation of the modified pinyin can refer to the corresponding description in step 103, and is not repeated here.
Specifically, the step 205 may include: sequencing the similar pinyins according to the error correction probability; extracting similar pinyins corresponding to the error correction probability exceeding a preset probability threshold; and outputting the extracted similar pinyin according to a preset rule. The preset probability threshold may be 90%, 85%, or 70%, and the like, and the embodiment of the present invention is not particularly limited. For example, the similar pinyins corresponding to the pinyins to be corrected contain 5 pinyins { pinyins 1, pinyins 2, pinyins 3, pinyins 4 and pinyins 5}, the error correction probabilities corresponding to the five pinyins and the pinyins to be corrected are respectively 98%, 23%, 3%, 67% and 85% obtained through calculation, the pinyins are sorted according to the error correction probability from high to low to obtain { pinyins 1, pinyins 5, pinyins 4, pinyins 2 and pinyins 3}, if the preset probability threshold is 80%, the pinyins with the error correction probability exceeding the preset probability threshold are pinyin 1 and pinyins 5, and at this time, the pinyins 1 and pinyins 5 are extracted and output according to the sequence of the error correction probability from high to low: pinyin 1-pinyin 5.
It should be noted that, for the embodiment of the present invention, a number threshold may be set, and when the accumulated search number corresponding to the pinyins in the similar pinyins exceeds the threshold, the rank of each similar pinyin may be appropriately adjusted, for example, the pinyin that exceeds the threshold is ranked at the first position for output, and the like. For the embodiment of the invention, the pinyin meeting the conditions is obtained by setting the preset probability threshold value and screening according to the threshold value, and the obtained pinyin is completely output for the user to select, so that the problem of correcting one side caused by providing only one corrected pinyin is avoided, the comprehensiveness of pinyin error correction is improved, and the use experience of the user is improved.
Further, as an implementation of the method shown in fig. 1, an embodiment of the present invention further provides a spelling error correction apparatus, which is used for implementing the method shown in fig. 1. The embodiment of the apparatus corresponds to the embodiment of the method, and for convenience of reading, details in the embodiment of the apparatus are not repeated one by one, but it should be clear that the apparatus in the embodiment can correspondingly implement all the contents in the embodiment of the method. As shown in fig. 3, the apparatus includes: an acquisition unit 31, an extraction unit 32, a calculation unit 33, an output unit 34, wherein
The obtaining unit 31 may be configured to obtain a pinyin to be error-corrected, where the pinyin to be error-corrected is a plurality of characters input by a user.
The extracting unit 32 may be configured to extract a plurality of similar pinyins corresponding to the pinyin to be corrected, which is obtained by the obtaining unit 31, from a preset database, where correct spelling pinyins corresponding to all words and characters are stored in the preset database, and the similar pinyins are pinyins that differ from the pinyin to be corrected by a preset number of characters.
The calculating unit 33 may be configured to calculate error correction probabilities respectively corresponding to the similar pinyins extracted by the extracting unit 32 and the pinyin to be error corrected according to the keyboard editing distance set based on the pinyin rule.
The output unit 34 may be configured to output a corrected pinyin corresponding to the pinyin to be corrected according to the error correction probability calculated by the calculation unit 33.
Further, as an implementation of the method shown in fig. 2, another spell correction device is further provided in an embodiment of the present invention, and is configured to implement the method shown in fig. 2. The embodiment of the apparatus corresponds to the embodiment of the method, and for convenience of reading, details in the embodiment of the apparatus are not repeated one by one, but it should be clear that the apparatus in the embodiment can correspondingly implement all the contents in the embodiment of the method. As shown in fig. 4, the apparatus includes: an acquisition unit 41, an extraction unit 42, a calculation unit 43, an output unit 44, wherein
The obtaining unit 41 may be configured to obtain a pinyin to be error-corrected, where the pinyin to be error-corrected is a plurality of characters input by a user.
The extracting unit 42 may be configured to extract a plurality of similar pinyins corresponding to the pinyin to be corrected, which is obtained by the obtaining unit 41, from a preset database, where correct spelling pinyins corresponding to all words and characters are stored in the preset database, and the similar pinyins are pinyins having a difference from the pinyin to be corrected by a preset number of characters.
The calculating unit 43 may be configured to calculate error correction probabilities respectively corresponding to the similar pinyins extracted by the extracting unit 42 and the pinyin to be error corrected according to the keyboard editing distance set based on the pinyin rule.
The output unit 44 may be configured to output a corrected pinyin corresponding to the pinyin to be corrected according to the error correction probability calculated by the calculating unit 43.
Further, the apparatus further comprises: a detection unit 45 and a determination unit 46.
The obtaining unit 41 may be further configured to obtain a pinyin to be detected.
The detecting unit 45 may be configured to detect whether a correctly spelled pinyin corresponding to the pinyin to be detected exists in the preset database.
The output unit 44 may be further configured to output a correctly spelled pinyin corresponding to the pinyin to be detected, if the correctly spelled pinyin exists.
The determining unit 46 may be configured to determine the pinyin to be detected as a pinyin to be corrected if the pinyin does not exist.
Further, in the above-mentioned case,
the calculating unit 43 may be specifically configured to calculate final editing distances corresponding to the similar pinyins and the pinyin to be corrected respectively according to the disc editing distance and the preset coefficient set based on the chinese pinyin rule.
The determining unit 46 may be further configured to determine, as the error correction probability of each similar pinyin, a product between the accumulated search quantity corresponding to each similar pinyin and the reciprocal of the final edit distance.
Further, the output unit 44 includes:
the sorting module 4401 may be configured to sort the similar pinyins according to the error correction probability.
The extracting module 4402 may be configured to extract similar pinyins corresponding to error correction probabilities exceeding a preset probability threshold.
The output module 4403 may be configured to output the extracted similar pinyin according to a preset rule.
The embodiment of the invention provides another spelling error correction device. The device comprises: the device comprises an acquisition unit, an extraction unit, a calculation unit and an output unit. When the pinyin input by a user is corrected in the prior art, association and correction are mainly performed according to the position relation of each letter in a keyboard, so that the correction mode is relatively unilateral. Meanwhile, after the pinyin input by the user is obtained, whether the content corresponding to the pinyin exists in the preset database is detected, so that the error correction of the pinyin is ensured to be carried out only when the content corresponding to the pinyin input by the user does not exist in the database, the problem of resource waste caused by error correction of the correct pinyin input by the user is avoided, and the resources are saved.
The text processing device comprises a processor and a memory, wherein the acquiring unit 31, the extracting unit 32, the calculating unit 33, the outputting unit 34 and the like are stored in the memory as program units, and the processor executes the program units stored in the memory to realize corresponding functions.
The processor comprises a kernel, and the kernel calls the corresponding program unit from the memory. The kernel can be set to one or more, and the accuracy of the spell correction is improved by adjusting the kernel parameters.
The memory may include volatile memory in a computer readable medium, random Access Memory (RAM) and/or nonvolatile memory such as Read Only Memory (ROM) or flash memory (flash RAM), and the memory includes at least one memory chip.
An embodiment of the present invention provides a storage medium on which a program is stored, the program implementing the spell correction method when executed by a processor.
The embodiment of the invention provides a processor, which is used for running a program, wherein the spell correction method is executed when the program runs.
The embodiment of the invention provides equipment, which comprises a processor, a memory and a program which is stored on the memory and can run on the processor, wherein the processor executes the program and realizes the following steps: obtaining pinyin to be corrected, wherein the pinyin to be corrected is a plurality of characters input by a user; extracting a plurality of similar pinyins corresponding to the pinyins to be corrected from a preset database, wherein the correct spelling pinyins corresponding to all words and characters are stored in the preset database, and the similar pinyins are pinyins which have a difference of a preset number of characters from the pinyins to be corrected; calculating error correction probabilities respectively corresponding to the similar pinyins and the pinyins to be corrected according to a keyboard editing distance set based on a Chinese pinyin rule; and outputting the corrected pinyin corresponding to the pinyin to be corrected according to the error correction probability.
Further, the method further comprises:
and acquiring and extracting accumulated search quantities respectively corresponding to the similar pinyins from the preset database.
Further, the calculating the error correction probability corresponding to each similar pinyin and the pinyin to be error-corrected respectively includes:
calculating final editing distances respectively corresponding to the similar pinyins and the pinyins to be corrected according to the keyboard editing distance and a preset coefficient set based on the Chinese pinyin rule, wherein the preset coefficient is set according to the number of characters with the difference between the similar pinyins and the pinyins to be corrected;
and determining the product of the accumulated search quantity corresponding to each similar pinyin and the reciprocal of the final editing distance as the error correction probability of each similar pinyin.
Further, the outputting the pinyin corresponding to the pinyin to be corrected according to the error correction probability includes:
sequencing the similar pinyins according to the error correction probability;
extracting similar pinyins corresponding to the error correction probability exceeding a preset probability threshold;
and outputting the extracted similar pinyin according to a preset rule.
Further, before obtaining the pinyin to be corrected, the method further includes:
obtaining a pinyin to be detected;
detecting whether a correct spelling pinyin corresponding to the pinyin to be detected exists in the preset database;
if yes, outputting correct spelling pinyin corresponding to the pinyin to be detected;
and if not, determining the pinyin to be detected as the pinyin to be corrected.
The device in the embodiment of the invention can be a server, a PC, a PAD, a mobile phone and the like.
An embodiment of the present invention further provides a computer program product, which, when executed on a data processing apparatus, is adapted to execute a program that initializes the following method steps: obtaining pinyin to be corrected, wherein the pinyin to be corrected is a plurality of characters input by a user; extracting a plurality of similar pinyins corresponding to the pinyin to be corrected from a preset database, wherein the correct spelling pinyins corresponding to all words and characters are stored in the preset database, and the similar pinyins are the pinyins which have a difference of a preset number of characters from the pinyin to be corrected; calculating error correction probabilities respectively corresponding to the similar pinyins and the pinyins to be corrected according to a keyboard editing distance set based on a Chinese pinyin rule; and outputting the corrected pinyin corresponding to the pinyin to be corrected according to the error correction probability.
Further, the method further comprises:
and acquiring and extracting accumulated search quantities respectively corresponding to the similar pinyins from the preset database.
Further, the calculating the error correction probability corresponding to each similar pinyin and the pinyin to be error-corrected respectively includes:
calculating final editing distances respectively corresponding to the similar pinyins and the pinyins to be corrected according to the keyboard editing distance and a preset coefficient set based on the Chinese pinyin rule, wherein the preset coefficient is set according to the number of characters with the difference between the similar pinyins and the pinyins to be corrected;
and determining the product of the accumulated search quantity corresponding to each similar pinyin and the reciprocal of the final editing distance as the error correction probability of each similar pinyin.
Further, the outputting the pinyin corresponding to the pinyin to be corrected according to the error correction probability includes:
sequencing the similar pinyins according to the error correction probability;
extracting similar pinyins corresponding to the error correction probability exceeding a preset probability threshold;
and outputting the extracted similar pinyin according to a preset rule.
Further, before obtaining the pinyin to be corrected, the method further includes:
obtaining a pinyin to be detected;
detecting whether a correct spelling pinyin corresponding to the pinyin to be detected exists in the preset database;
if yes, outputting correct spelling pinyin corresponding to the pinyin to be detected;
and if not, determining the pinyin to be detected as the pinyin to be corrected.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include forms of volatile memory in a computer readable medium, random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). The memory is an example of a computer-readable medium.
Computer-readable media, including both permanent and non-permanent, removable and non-removable media, may implement the information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. The use of the phrase "including an" as used herein does not exclude the presence of other, identical elements, components, methods, articles, or apparatus that may include the same, unless expressly stated otherwise.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The above are merely examples of the present application and are not intended to limit the present application. Various modifications and changes may occur to those skilled in the art to which the present application pertains. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims (9)

1. A method of spell correction, the method comprising:
obtaining pinyin to be corrected, wherein the pinyin to be corrected is a plurality of characters input by a user;
extracting a plurality of similar pinyins corresponding to the pinyin to be corrected from a preset database, wherein the correct spelling pinyins corresponding to all words and characters are stored in the preset database, and the similar pinyins are the pinyins which have a difference of a preset number of characters from the pinyin to be corrected;
calculating error correction probabilities respectively corresponding to the similar pinyins and the pinyin to be corrected according to a keyboard editing distance set based on a Chinese pinyin rule, wherein the error correction probabilities comprise: calculating final editing distances respectively corresponding to the similar pinyins and the pinyins to be corrected according to the keyboard editing distance and a preset coefficient set based on the Chinese pinyin rule, wherein the preset coefficient is set according to the number of characters with the difference between the similar pinyins and the pinyins to be corrected; determining the product of the accumulated search quantity corresponding to each similar pinyin and the reciprocal of the final editing distance as the error correction probability of each similar pinyin, and setting the keyboard editing distance according to different Chinese pinyin rules, wherein when the difference between the similar pinyin and the pinyin to be corrected is one character, the preset coefficient is 1, and when the difference is N characters, the preset coefficient is N;
and outputting the corrected pinyin corresponding to the pinyin to be corrected according to the error correction probability.
2. The method of claim 1, further comprising:
and acquiring and extracting accumulated search quantities respectively corresponding to the similar pinyins from the preset database.
3. The method of claim 1, wherein the outputting the corrected pinyin corresponding to the pinyin to be corrected according to the error correction probability comprises:
sequencing the similar pinyins according to the error correction probability;
extracting similar pinyins corresponding to the error correction probability exceeding a preset probability threshold;
and outputting the extracted similar pinyin according to a preset rule.
4. The method as claimed in claim 1, wherein before the obtaining the pinyin to be corrected, the method further includes:
obtaining a pinyin to be detected;
detecting whether a correct spelling pinyin corresponding to the pinyin to be detected exists in the preset database;
if yes, outputting correct spelling pinyin corresponding to the pinyin to be detected;
and if not, determining the pinyin to be detected as the pinyin to be corrected.
5. An apparatus for spell correction, the apparatus comprising:
the device comprises an acquisition unit, a correction unit and a processing unit, wherein the acquisition unit is used for acquiring pinyin to be corrected, and the pinyin to be corrected is a plurality of characters input by a user;
the extraction unit is used for extracting a plurality of similar pinyins corresponding to the pinyins to be corrected from a preset database, the correct spelling pinyins corresponding to all words and characters are stored in the preset database, and the similar pinyins are the pinyins with the difference of a preset number of characters from the pinyins to be corrected;
the calculation unit is used for calculating the error correction probability corresponding to each similar pinyin and the pinyin to be corrected respectively according to the keyboard editing distance set based on the pinyin rule, and comprises the following steps: calculating final editing distances respectively corresponding to the similar pinyins and the pinyins to be corrected according to the keyboard editing distance and a preset coefficient set based on the Chinese pinyin rule, wherein the preset coefficient is set according to the number of characters with the difference between the similar pinyins and the pinyins to be corrected; determining the product of the accumulated search quantity corresponding to each similar pinyin and the reciprocal of the final editing distance as the error correction probability of each similar pinyin, and setting the keyboard editing distance according to different Chinese pinyin rules, wherein when the difference between the similar pinyin and the pinyin to be corrected is one character, the preset coefficient is 1, and when the difference is N characters, the preset coefficient is N;
and the output unit is used for outputting the corrected pinyin corresponding to the pinyin to be corrected according to the error correction probability.
6. The apparatus of claim 5,
the obtaining unit is further configured to obtain, from the preset database, cumulative search quantities respectively corresponding to the similar pinyins;
the extracting unit is further configured to extract the accumulated search quantities respectively corresponding to the similar pinyins acquired by the acquiring unit.
7. The apparatus of claim 6, wherein the computing unit comprises:
the calculation module is used for calculating the final editing distance corresponding to each similar pinyin and the pinyin to be corrected according to the keyboard editing distance and the preset coefficient set based on the Chinese pinyin rule, wherein the preset coefficient is set according to the number of characters of the difference between the similar pinyin and the pinyin to be corrected;
and the determining module is used for determining the product of the accumulated search quantity corresponding to each similar pinyin and the reciprocal of the final editing distance as the error correction probability of each similar pinyin.
8. A storage medium comprising a stored program, wherein the apparatus on which the storage medium is located is controlled to perform the spell correction method of any one of claims 1 to 4 when the program is run.
9. A processor, configured to run a program, wherein the program when running performs the spell correction method of any of claims 1 to 4.
CN201710928606.5A 2017-09-30 2017-09-30 Spelling error correction method and device Active CN109597983B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710928606.5A CN109597983B (en) 2017-09-30 2017-09-30 Spelling error correction method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710928606.5A CN109597983B (en) 2017-09-30 2017-09-30 Spelling error correction method and device

Publications (2)

Publication Number Publication Date
CN109597983A CN109597983A (en) 2019-04-09
CN109597983B true CN109597983B (en) 2022-11-04

Family

ID=65956394

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710928606.5A Active CN109597983B (en) 2017-09-30 2017-09-30 Spelling error correction method and device

Country Status (1)

Country Link
CN (1) CN109597983B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111739514B (en) * 2019-07-31 2023-11-14 北京京东尚科信息技术有限公司 Voice recognition method, device, equipment and medium
CN111028834B (en) * 2019-10-30 2023-01-20 蚂蚁财富(上海)金融信息服务有限公司 Voice message reminding method and device, server and voice message reminding equipment
CN111694985B (en) * 2020-06-17 2022-03-01 北京字节跳动网络技术有限公司 Search method, search device, electronic equipment and computer-readable storage medium
CN112765231A (en) * 2021-01-04 2021-05-07 珠海格力电器股份有限公司 Data processing method and device and computer readable storage medium
CN112560452B (en) * 2021-02-25 2021-05-18 智者四海(北京)技术有限公司 Method and system for automatically generating error correction corpus
CN115437511B (en) * 2022-11-07 2023-02-21 北京澜舟科技有限公司 Pinyin Chinese character conversion method, conversion model training method and storage medium
CN115905297B (en) * 2023-01-04 2023-12-15 脉策(上海)智能科技有限公司 Method, apparatus and medium for retrieving data

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102831177A (en) * 2012-07-31 2012-12-19 聚熵信息技术(上海)有限公司 Statement error correction method and system
CN106202153A (en) * 2016-06-21 2016-12-07 广州智索信息科技有限公司 The spelling error correction method of a kind of ES search engine and system
CN106847288A (en) * 2017-02-17 2017-06-13 上海创米科技有限公司 The error correction method and device of speech recognition text

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101483433B1 (en) * 2013-03-28 2015-01-16 (주)이스트소프트 System and Method for Spelling Correction of Misspelled Keyword

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102831177A (en) * 2012-07-31 2012-12-19 聚熵信息技术(上海)有限公司 Statement error correction method and system
CN106202153A (en) * 2016-06-21 2016-12-07 广州智索信息科技有限公司 The spelling error correction method of a kind of ES search engine and system
CN106847288A (en) * 2017-02-17 2017-06-13 上海创米科技有限公司 The error correction method and device of speech recognition text

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
自动拼写校对的算法设计和系统实现;郑文曦等;《科技和产业》;20130225(第02期);全文 *

Also Published As

Publication number Publication date
CN109597983A (en) 2019-04-09

Similar Documents

Publication Publication Date Title
CN109597983B (en) Spelling error correction method and device
CN106033416B (en) Character string processing method and device
CN107862070B (en) Online classroom discussion short text instant grouping method and system based on text clustering
CN106598999B (en) Method and device for calculating text theme attribution degree
CN110019668A (en) A kind of text searching method and device
US10831993B2 (en) Method and apparatus for constructing binary feature dictionary
CN108021545B (en) Case course extraction method and device for judicial writing
JP2019511040A (en) Text information clustering method and text information clustering system
CN110162778B (en) Text abstract generation method and device
CN106610931B (en) Topic name extraction method and device
CN111291177A (en) Information processing method and device and computer storage medium
US20200110778A1 (en) Search method and apparatus and non-temporary computer-readable storage medium
CN110008473B (en) Medical text named entity identification and labeling method based on iteration method
CN111209734A (en) Test question duplication eliminating method and system
CN108345694B (en) Document retrieval method and system based on theme database
CN107590119B (en) Method and device for extracting person attribute information
CN106598997B (en) Method and device for calculating text theme attribution degree
CN109213998A (en) Chinese wrongly written character detection method and system
Leonandya et al. A semi-supervised algorithm for Indonesian named entity recognition
CN109472722B (en) Method and device for obtaining relevant information of approved finding segment of official document to be generated
CN110705261B (en) Chinese text word segmentation method and system thereof
CN110705285B (en) Government affair text subject word library construction method, device, server and readable storage medium
CN115563268A (en) Text abstract generation method and device, electronic equipment and storage medium
CN110019670A (en) A kind of text searching method and device
CN110019659B (en) Method and device for searching referee document

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information
CB02 Change of applicant information

Address after: 100083 No. 401, 4th Floor, Haitai Building, 229 North Fourth Ring Road, Haidian District, Beijing

Applicant after: BEIJING GRIDSUM TECHNOLOGY Co.,Ltd.

Address before: 100086 Beijing city Haidian District Shuangyushu Area No. 76 Zhichun Road cuigongfandian 8 layer A

Applicant before: BEIJING GRIDSUM TECHNOLOGY Co.,Ltd.

GR01 Patent grant
GR01 Patent grant