CN112001168A - Word error correction method and device, electronic equipment and storage medium - Google Patents

Word error correction method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN112001168A
CN112001168A CN202010675640.8A CN202010675640A CN112001168A CN 112001168 A CN112001168 A CN 112001168A CN 202010675640 A CN202010675640 A CN 202010675640A CN 112001168 A CN112001168 A CN 112001168A
Authority
CN
China
Prior art keywords
word
words
feature
user input
error correction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010675640.8A
Other languages
Chinese (zh)
Inventor
高岩峰
周冰
任化强
李敏
李东晓
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Mobile Communications Group Co Ltd
MIGU Culture Technology Co Ltd
Original Assignee
China Mobile Communications Group Co Ltd
MIGU Culture Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Mobile Communications Group Co Ltd, MIGU Culture Technology Co Ltd filed Critical China Mobile Communications Group Co Ltd
Priority to CN202010675640.8A priority Critical patent/CN112001168A/en
Publication of CN112001168A publication Critical patent/CN112001168A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/232Orthographic correction, e.g. spell checking or vowelisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/02Input arrangements using manually operated switches, e.g. using keyboards or dials
    • G06F3/023Arrangements for converting discrete items of information into a coded form, e.g. arrangements for interpreting keyboard generated codes as alphanumeric codes, operand codes or instruction codes
    • G06F3/0233Character input methods
    • G06F3/0237Character input methods using prediction or retrieval techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Abstract

The embodiment of the invention provides a word error correction method, a word error correction device, electronic equipment and a storage medium. The method comprises the following steps: determining alternative words corresponding to the user input words; determining feature words according to the user input words and the alternative words; constructing a feature word tree according to the feature word text and the position information of the feature word text in the user input word, and obtaining a composite feature word set according to the feature word tree; and obtaining an error correction result of the user input word according to the grading and sorting result of the composite characteristic words in the composite characteristic word set. According to the word error correction method, the word error correction device, the electronic equipment and the storage medium, the feature word tree is established for the feature words, the composite feature words are determined according to the feature word tree, and then the error correction results of the user input words are determined according to the grading sorting results of the composite feature words, so that automatic error correction of the composite words is achieved, and the execution efficiency is high.

Description

Word error correction method and device, electronic equipment and storage medium
Technical Field
The present invention relates to the field of text processing technologies, and in particular, to a method and an apparatus for word error correction, an electronic device, and a storage medium.
Background
When a user inputs text information on an electronic device, a phenomenon of inputting wrong words often occurs. For example, when a user inputs a Chinese text, a pinyin input method is adopted to wrongly write Chinese characters in terms as homophones of the Chinese characters; or a character pattern input method is adopted to wrongly write the Chinese characters in the words into the similar characters of the shapes; or the situations of multiple characters, few characters, wrong characters and the like occur. Similar problems occur when a user inputs pinyin or foreign language text.
Errors in user input of text information can affect subsequent information processing flows. For example, when a user searches by using a search system, if an input search word is wrong, the accuracy of a search result is affected, and even a desired search result cannot be searched.
To this end, methods capable of automatically correcting errors of words are provided in the prior art, including edit distance algorithms and weighted edit distance algorithms.
The edit distance algorithm quantitatively measures the difference between two strings (e.g., english text) by how many times a string is changed into another string. Edit distance algorithms may be used in natural language processing, for example, in a spell checking process to determine which word(s) are more likely words based on the edit distance of a misspelled word and other correct words.
Calculating a weighted editing distance between a search word and a pre-acquired hot word by using a weighted editing distance algorithm, wherein in the calculation process of the weighted editing distance, aiming at the operation of converting the search word into the hot word, weights of different numerical values are respectively set for the operation of inserting character strings, the operation of deleting character strings, the operation of replacing the similar characters or the phonetic characters, the operation of replacing the non-similar characters or the phonetic characters and the operation of exchanging the characters; and then selecting a preset number of hot words for error correction prompt according to the weighted editing distance and the hot word heat.
However, in the edit distance algorithm or the weighted edit distance algorithm, the user input word is compared with one candidate word during implementation, and there is no operation of simultaneously comparing two or more candidate words, so that automatic error correction of the combined word cannot be directly supported.
Disclosure of Invention
The embodiment of the invention provides a word error correction method, a word error correction device, electronic equipment and a storage medium, which are used for solving the defect that the automatic error correction of a compound word cannot be realized by the word error correction method in the prior art.
An embodiment of a first aspect of the present invention provides a word error correction method, including:
determining alternative words corresponding to the user input words; wherein the user input words are combined words;
determining feature words according to the user input words and the alternative words; wherein the feature word is a maximum similar sub-character string of the user input word and the alternative word; the characteristic words comprise: the characteristic word text and the position information of the characteristic word text in the user input word;
constructing a feature word tree according to the feature word text and the position information of the feature word text in the user input word, and obtaining a composite feature word set according to the feature word tree; the composite characteristic words in the composite characteristic word set are combinations of the characteristic words;
and obtaining an error correction result of the user input word according to the grading and sorting result of the composite characteristic words in the composite characteristic word set.
In the above technical solution, the constructing a feature word tree according to the feature word text and the position information of the feature word text in the user input word includes:
creating tree nodes according to the position information of the feature word text in the user input words, and storing the feature words with the same position information of the feature word text in the user input words under the same tree nodes; the position information of the feature word text in the user input word comprises the following steps: start position information, end position information;
determining the parent-child node relationship among the tree nodes according to the position information of the feature word text in the user input word, wherein the method comprises the following steps: if the starting position information of the first tree node is greater than the ending position information of the second tree node, the first tree node is a child node of the second tree node.
In the above technical solution, the obtaining a composite feature word set according to the feature word tree includes:
deeply traversing the feature word tree to generate a path;
and according to the position sequence of the tree nodes in the paths, carrying out Cartesian product on all the feature words contained in each tree node in each path to obtain a composite feature word set.
In the above technical solution, the obtaining a composite feature word set according to the feature word tree further includes:
and setting marks for the tree nodes according to the content of the feature words stored in the tree nodes in the feature word tree, and deleting paths which do not meet preset rules in the feature word tree according to the marks of the tree nodes.
In the above technical solution, the obtaining an error correction result of the user input word according to the score ranking result of the composite feature words in the composite feature word set includes:
performing accuracy scoring on the composite feature words;
comparing the accuracy grading result of the composite feature word with a preset accuracy threshold value, and removing the composite feature word of which the accuracy grading result does not meet the accuracy threshold value;
and sorting the composite feature words meeting the accuracy threshold according to the accuracy grading result, and taking the composite feature words with the first accuracy sorting as the error correction result of the user input words.
In the above technical solution, the obtaining an error correction result of the user input word according to the score ranking result of the composite feature words in the composite feature word set further includes:
when a plurality of compound feature words with first accuracy ordering are available, performing secondary ordering on the plurality of compound feature words according to the similarity scores, and taking the compound feature words with first similarity ordering as error correction results of the user input words; and obtaining the similarity score of the compound feature word according to the similarity score of the feature words forming the compound feature word.
In the above technical solution, the determining the alternative word corresponding to the user input word includes:
according to the user input word, obtaining an alternative word corresponding to the user input word from a Hash index table generated based on an error correction dictionary; wherein the content of the first and second substances,
the Hash index table generated based on the error correction dictionary is a Hash index table generated according to keywords, wherein the Hash index table is obtained by cutting correct words in the error correction dictionary into single words, then overlapping the single words obtained by cutting into two words or overlapping multiple words to obtain the keywords.
An embodiment of a second aspect of the present invention provides a word error correction apparatus, including:
the alternative word determining module is used for determining alternative words corresponding to the input words of the user; wherein the user input words are combined words;
the characteristic word determining module is used for determining characteristic words according to the user input words and the alternative words; wherein the feature word is a maximum similar sub-character string of the user input word and the alternative word; the characteristic words comprise: the characteristic word text and the position information of the characteristic word text in the user input word;
the composite feature word generation module is used for constructing a feature word tree according to the feature word text and the position information of the feature word text in the user input word, and obtaining a composite feature word set according to the feature word tree; the composite characteristic words in the composite characteristic word set are combinations of the characteristic words;
and the error correction result generation module is used for obtaining the error correction result of the user input word according to the grading and sorting result of the composite characteristic words in the composite characteristic word set.
In a third embodiment of the present invention, an electronic device is provided, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and when the processor executes the computer program, the processor implements the steps of the word error correction method according to the first embodiment of the present invention.
A fourth aspect of the present invention provides a non-transitory computer readable storage medium, on which a computer program is stored, which when executed by a processor implements the steps of the word error correction method as described in the first aspect of the present invention.
According to the word error correction method, the word error correction device, the electronic equipment and the storage medium, the feature word tree is created for the feature words, the composite feature word set is determined according to the feature word tree, and then the error correction result of the user input words is determined according to the grading sorting result of the plurality of composite feature words in the set, so that automatic error correction of the composite words is achieved, and high execution efficiency is achieved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.
FIG. 1 is a flow chart of a word error correction method provided by an embodiment of the present invention;
FIG. 2 is a schematic diagram of a word error correction apparatus according to an embodiment of the present invention;
fig. 3 is a schematic physical structure diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In the embodiment of the invention, the combination word is a word group formed by splicing two or more words with independent meanings. For example, "shoot carve hero transmit Huang Rihua" is a compound word.
Neither the edit distance algorithm nor the weighted edit distance algorithm in the prior art can be directly applied to automatic error correction of the compound word. Therefore, in the prior art, certain changes are made to an edit distance algorithm or a weighted edit distance algorithm to support automatic error correction of the compound words. Namely: various possible combinations of words contained in the compound words are added to an error correction dictionary in advance, and are converted into character strings, and then automatic error correction of the compound words input by a user is realized by using an edit distance algorithm or a weighted edit distance algorithm.
For example, to realize the combined error correction of "shoot carving hero pass" and "Huang Rihua", the error correction dictionary needs to have the following 4 terms:
shoot carving hero transmission;
huang Ri Hua;
the shoot carving hero transmits the yellow time;
the yellow-sun-China shoot-carve hero-pass.
There are very many possibilities for compound words, such as a combination of several person names, a combination of movie names and years, a combination of tv series names and seasons, and so on. If all the combinations are exhausted, the number of terms of the error correction dictionary increases in a geometric level and exceeds the performance limit of the algorithm. Therefore, the error correction of the compound word realized by the flexible edit distance algorithm or the weighted edit distance algorithm can only be used for a small amount of hot spot data, and has no universality.
The word error correction method provided by the embodiment of the invention can realize error correction of the compound words.
Before describing the word error correction method provided by the embodiment of the invention in detail, a unified description is first made on related concepts involved in the method.
The user inputs a word: a character string input by a user;
selecting words: alternative character strings, the correct character string the user desires to enter is obtained from the alternative words. The candidate words can be character strings stored in the error correction dictionary, and can also be obtained by expanding the character strings stored in the error correction dictionary;
characteristic words: the user inputs the maximum similar substring of the word and the alternative word;
compounding characteristic words: and the character string consists of two or more than two characteristic words.
Fig. 1 is a flowchart of a word error correction method according to an embodiment of the present invention, and as shown in fig. 1, the word error correction method according to the embodiment of the present invention includes:
step 101, determining alternative words corresponding to the user input words.
In the embodiment of the present invention, the user input word is generally a compound word, that is, the user input word includes two or more words having independent meanings. The type of the input words of the user is not limited, such as Chinese words, Chinese pinyin and foreign language words.
The words input by the user are influenced by the subjective intention of the user, and the possibility of errors exists, so the words error correction method provided by the embodiment of the invention needs to be adopted for error correction.
The method for determining the candidate word according to the user input word has various implementation modes, such as determining the candidate word by comparing an error correction dictionary, and determining the candidate word by comparing a hash index table generated based on the error correction dictionary. In the embodiment of the present invention, a specific implementation manner is not limited.
Multiple alternative words are typically available based on a single user input word.
And step 102, extracting characteristic words according to the alternative words and the user input words.
The characteristic word is the maximum similar sub-character string of the user input word and the alternative word. The feature words are extracted according to the alternative words and the user input words, and can be implemented by using a related method in the prior art, such as a longest common subsequence algorithm (LCS algorithm). Implementation details are not further described herein.
The extracted feature words include the following information:
(1) text: a feature word text which is a substring of the alternative word;
(2) sPosBegin: starting positions of user input words corresponding to the feature word texts;
(3) sPosEnd: the ending position of the user input word corresponding to the feature word text;
(4) score: and scoring the similarity of the characteristic words.
The starting position of the user input word corresponding to the characteristic word text refers to the position of a first character in the user input word, wherein the position of the first character is the same as the character in the characteristic word text; the ending position of the user input word corresponding to the characteristic word text refers to the position of the last character in the user input word which is the same as the character in the characteristic word text. The feature word similarity score reflects the similarity between the user input word and the alternative word; specifically, the feature words extract characters with similarity between the user input words and the alternative words, and the score is calculated according to the similarity of the characters and can be obtained by an edit distance algorithm or a weighted edit distance algorithm in the prior art.
For example, assume that the user input word is "shoot hero" and the alternative word is "shoot carving hero pass". The extracted feature words then include the following information:
text: carrying out shoot carving on hero;
sPosBegin:0;
sPosEnd:2;
score:2.6。
since the "shoot" word in the feature word "shoot carving hero" is the same as the first word in the user input word "shoot hero", the value of sPosBegin is 0 (0 is taken as the starting position); since the "male" word in the feature word "shoot carve hero" is the same as the third word in the user input word "shoot hero", sPosEnd has a value of 2 (when 0 is the starting position, 2 is the position of the third word). The feature word similarity score reflects the similarity between the user input word "shoot hero" and the alternative word "shoot carving hero pass". Take LCS algorithm as an example. The extracted public subsequence (characteristic words) is 'shoot hero', and compared with 'shoot carve hero', it can be known that 'carve' and 'pass' are lacked, and according to the rule, the characters lacked from head to tail are not deducted, the characters lacked from middle are deducted, and each character is deducted for 0.4. The user adds 1 to each character in which both the input word and the candidate word appear, and the calculation score is 1 × 3-0.4 × 1 — 2.6.
And 103, constructing a feature word tree according to the feature words, and obtaining a composite feature word set from the feature word tree.
After the feature words are obtained, the feature words are arranged and combined theoretically to obtain the composite feature words. However, when the number of feature words is large, millions of compound feature words may be generated by exhaustive enumeration of all possible combinations, which seriously affects the efficiency of the error correction method. Therefore, in the embodiment of the invention, the compound feature words are generated by adopting a mode of constructing the feature word tree. This can greatly reduce the number of compound feature words, contributing to improving the efficiency of the error correction method.
Specifically, when the feature word tree is created, tree nodes treitem are created according to a binary group formed by sPosBegin and sPosEnd of the feature words, and feature words with the same binary group (sPosBegin and sPosEnd) are recorded in the same tree node treitem.
The tree nodes TreeIt establish a parent-child relationship correlation rule according to the size relationship between the sPosBegin and the sPosEnd as follows: and if the sPosBegin of the first tree node TreeItem is larger than the sPosEnd of the second tree node TreeItem, the first tree node is a child node of the second tree node. When a tree node does not contain a child node, the tree node is a leaf node.
After the feature word tree is constructed, a tree structure including root nodes and leaf nodes (tree nodes) is formed. And obtaining a composite feature word set from the feature word tree. Specifically, the feature word tree is deeply traversed to generate a path, each node in the path comprises a feature word list (the list comprises one or more feature words), and the feature word list under each path is subjected to Cartesian product to obtain a set of composite feature words.
For example, if the device vote includes ABZC, XBCD, CD, and DE, and the user input word is ABCDE, the extracted feature words and corresponding sPosBegin and sPosEnd are: ABZC (0,2), BCD (1,3), CD (2,3) and DE (3, 4).
The created feature word tree is:
root of herbaceous plant
|---ABZC(0,2)
|---DE(3,4)
|---BCD(1,3)
|---CD(2,3)
Wherein, sPosBegin of the feature word DE (3,4) is 3, sPosEnd of the feature word ABZC (0,2) is 2, obviously 3 is greater than 2, so that the node corresponding to the feature word DE (3,4) is a child node of the node corresponding to the feature word ABZC (0, 2).
According to the feature word tree, the obtained compound feature words comprise ABZCDE, BCD and CD.
Further, assuming that in the above example, the alternative words further include ABDC, ABEC, and ABFC, the extracted feature words further include ABDC (0,2), ABEC (0,2), and ABFC (0, 2). The two-tuple (sPosBegin, sPosEnd) of the three feature words is (0,2), so when creating the feature word tree, the three feature words are under the same node as the feature word ABZC (0, 2). And in the corresponding path, carrying out Cartesian product on the three feature words and the feature word DE, so that the final composite feature word set also comprises ABDCDE, ABECDE and ABFCDE.
And step 104, obtaining an error correction result of the user input word according to the grading and sorting result of the composite characteristic words in the composite characteristic word set.
After the composite feature word set is obtained according to the feature word tree, all composite feature words in the obtained set can be ranked according to scores, and the composite feature word with the first ranking according to the ranking result of the scores is used as the error correction result of the user input word.
The scoring of the composite feature words comprises a plurality of scoring modes, such as accuracy scoring, similarity scoring and the like. In the embodiment of the present invention, the specific scoring manner is not limited.
The word error correction method provided by the embodiment of the invention realizes automatic error correction of the combined words by creating the feature word tree for the feature words, determining the composite feature word set according to the feature word tree and then determining the error correction result of the user input words according to the grading and sorting result of a plurality of composite feature words in the set, thereby having higher execution efficiency.
Based on any one of the above embodiments, in an embodiment of the present invention, the obtaining a composite feature word set according to the feature word tree further includes:
and setting marks for the tree nodes according to the content of the feature words stored in the tree nodes in the feature word tree, and deleting paths which do not meet preset rules in the feature word tree according to the marks of the tree nodes.
After the feature word tree is constructed according to the feature words, the feature word tree is limited by a preset rule, and in some cases, not all paths in the feature word tree meet the requirements of the preset rule. And deleting the paths which do not meet the preset rule, thereby reducing the calculation amount of the subsequent steps.
For example, a preset rule specifies: the compound feature word cannot appear with two movie names. According to this rule, a specific flag may be recorded in a tree node of the feature word tree as to whether all the stored feature words are movie names. If the corresponding labels of the two tree nodes on a path are true, the path can be judged as illegal, and the path is deleted.
The word error correction method provided by the embodiment of the invention reduces the number of paths in the feature word tree by filtering the feature word tree, is beneficial to reducing the calculation amount of subsequent steps and improves the calculation efficiency.
Based on any one of the above embodiments, in an embodiment of the present invention, the obtaining an error correction result of the user input word according to a score ranking result of the composite feature words in the composite feature word set includes:
performing accuracy scoring on the composite feature words;
comparing the accuracy grading result of the composite feature word with a preset accuracy threshold value, and removing the composite feature word of which the accuracy grading result does not meet the accuracy threshold value;
and sorting the composite feature words meeting the accuracy threshold according to the accuracy grading result, and taking the composite feature words with the first accuracy sorting as the error correction result of the user input words.
The accuracy score is used for evaluating the accuracy of the compound feature word.
The accuracy scoring rule can be determined by business logic, and in the embodiment of the invention, the accuracy scoring mode for the compound feature words is as follows:
Figure BDA0002583945340000101
where p represents the accuracy score of the compound feature word, aiThe similarity score of the ith characteristic word in the composite characteristic word is represented, n represents the number of the characteristic words contained in the composite characteristic word, b represents the weight of multiple characters, d represents the number of the multiple characters in the composite characteristic word, c represents the weight of few characters, and s represents the number of the few characters in the composite characteristic word. The multi-character means a character which is present in the input word of the user and is absent in the compound characteristic word, and the few characters means a character which is absent in the input word of the user and is present in the compound characteristic word.
For example, in one example, the user input word is "big hero sunset" and its corresponding compound feature word is "shoot carving hero sunset". Wherein, the large word is a multi-word, and the carved word is a small word. Let the multi-word weight be 0.6 and the few-word weight be 0.5.
The similarity score of known "shoot carving hero" is 2.5, and the similarity score of known "Huang Rihua" is 3; substituting into a formula:
the accuracy score of the compound feature word "shoot carving hero Huang Rihua" is 2.5+ 3-0.6 x 1 is 4.9.
In this example, there is no point for the missing word "carving" when calculating the accuracy score because it has already been deducted in the similarity of the shoot carving hero, and no point can be deducted repeatedly. When calculating the accuracy score, whether the scores of the few characters need to be deducted or not can be determined continuously according to whether the 'shoot carving hero' and the 'yellow sunshiny' (sPosBegin, sPosEnd) obtained by splitting the composite feature word (shoot carving hero sunshiny) are continuously determined. Specifically, when a few characters appear at the head and tail of the feature word, the score is not deducted. For example, if the correct word is 'shoot carving hero pass', the user input word is 'shoot carving hero', and 'shoot' at the head of the feature word and 'pass' at the tail of the feature word are absent, no mark is deducted. The few characters appear in the characteristic words to be deducted, for example, the correct word is 'shoot carving hero pass', the user inputs the word as 'shoot hero', and the missing 'shoot carving' word is deducted.
After the accuracy scores of the composite feature words are obtained, in order to avoid the phenomenon that the accuracy of the composite feature words is low, the error of the input words of the user is more corrected, the accuracy scores of the composite feature words are compared with a preset accuracy threshold, and the composite feature words with the accuracy scores not meeting the accuracy threshold are removed.
And finally, sorting the composite feature words meeting the accuracy threshold according to the accuracy scores to obtain corresponding sorting results. And taking the composite characteristic word with the first accuracy sequence as an error correction result of the user input word.
The word error correction method provided by the embodiment of the invention selects the composite characteristic word as the error correction result of the user input word according to the accuracy grade of the composite characteristic word, and has the advantage of good error correction effect.
Based on any one of the above embodiments, in an embodiment of the present invention, the obtaining an error correction result of the user input word according to a score ranking result of the composite feature words in the composite feature word set further includes:
when a plurality of compound feature words with first accuracy ordering are available, performing secondary ordering on the plurality of compound feature words according to the similarity scores, and taking the compound feature words with first similarity ordering as error correction results of the user input words; and obtaining the similarity score of the compound feature word according to the similarity score of the feature words forming the compound feature word.
Specifically, if more than one compound feature word has the same accuracy score, the compound feature words with the same accuracy score are ranked according to the similarity score, and the compound feature word with the first ranked similarity is used as the error correction result of the user input word.
The similarity score of the compound feature word can be obtained by summing the feature word similarity scores of all the feature words forming the compound feature word. The calculation of the similarity score of each feature word can be implemented by the prior art, and is also described in the previous step 102.
The word error correction method provided by the embodiment of the invention has the advantages that on the premise that a plurality of compound characteristic words have the same accuracy scores, the compound characteristic words are subjected to secondary sequencing according to the similarity, and the error correction result of the user input word is obtained according to the secondary sequencing result, so that the error correction effect is good.
Based on any one of the above embodiments, in an embodiment of the present invention, the determining an alternative word corresponding to a word input by a user includes:
according to the user input word, obtaining an alternative word corresponding to the user input word from a Hash index table generated based on an error correction dictionary; wherein the content of the first and second substances,
the Hash index table generated based on the error correction dictionary is a Hash index table generated according to keywords, wherein the Hash index table is obtained by cutting correct words in the error correction dictionary into single words, then overlapping the single words obtained by cutting into two words or overlapping multiple words to obtain the keywords.
The correction dictionary typically contains confirmed correctly written words. However, the number and range of words contained in the error correction dictionary are limited, and if the corresponding candidate words are searched for the user input words directly according to the error correction dictionary, the search result may not be complete. Therefore, the hash index table generated based on the error correction dictionary is adopted in the embodiment of the invention.
The correct dictionary stores words with correct writing, and performs word segmentation operation on the correct words to obtain a list with words as basic units. The characters described in the embodiments of the present invention refer to single characters in chinese, or words in foreign languages, or characters representing arabic numerals and punctuation marks. For example, the correction dictionary stores the chinese word "transformers", and the result of word segmentation is a list including "transformers", "diamonds", and "diamonds". If the object to be segmented is a foreign word (such as an English word), the basic unit during segmentation is a foreign word; if the objects to be segmented are Arabic numerals and punctuation marks, the basic unit during segmentation is a character. In a preferred implementation, the word segmentation result includes pronunciation information (e.g., pinyin of Chinese characters) of the word in addition to the word.
And then generating keywords based on the result after the word segmentation operation. There are various specific implementation ways for generating keywords, and one way is 2-word overlapping. Specifically, adjacent words in the word segmentation result are combined together to generate a keyword. For example, from "morph", "gold", "steel" the following keywords can be generated: "deformation", "gold-shaped", "diamond". When the word segmentation result contains fewer words, such as less than or equal to 4 words, the keywords can be generated by spacing one word. For example, the following keywords "deformed metal" and "deformed steel" can be generated from "deformed metal" and "steel". Besides the mode of overlapping 2 characters, a mode of overlapping 3 characters or a mode of overlapping more characters can be adopted. The number of overlapping words affects the error correction capability and the operating efficiency of the method. The number of overlapped words is large, the number of generated keywords can be reduced, and the method is beneficial to improving the operation efficiency; the number of overlapped words is small, the number of generated keywords is increased, and the method is beneficial to improving the error correction capability. Therefore, when generating a keyword, it is necessary to determine the manner of generating the keyword according to a specific situation.
After the key is obtained, a hash index table may be built for the key. When the hash index table is established, the keywords with the same or similar characteristics are assigned to the same index. For example, keywords having the same or similar pronunciations are assigned to the same index. As another example, keywords having the same or similar glyphs are assigned to the same index. In a preferred implementation, the hash index table includes various types of indexes, such as an index for pronunciation and an index for font. This allows a key to be simultaneously ascribed to different indices.
After a Hash index table generated based on the error correction dictionary is established, the user input words are compared with the Hash index table, and alternative words corresponding to the user input words are obtained.
Multiple alternative words are typically available based on a single user input word. For example, the user input word is "transformers", and its corresponding alternatives include: "transformers" (retrievable by both the keywords "transformers" and "diamonds"), "transformers" (retrievable by both the keywords "transformers" and "diamonds").
The word error correction method provided by the embodiment of the invention can realize automatic error correction of the combined word, does not need to expand an error correction dictionary, and has higher execution efficiency.
Fig. 2 is a schematic diagram of a word error correction apparatus according to an embodiment of the present invention, and as shown in fig. 2, the word error correction apparatus according to the embodiment of the present invention includes:
an alternative word determining module 201, configured to determine an alternative word corresponding to a word input by a user;
a feature word determining module 202, configured to determine a feature word according to the user input word and the candidate word; wherein the feature word is a maximum similar sub-character string of the user input word and the alternative word; the characteristic words comprise: the characteristic word text and the position information of the characteristic word text in the user input word;
the composite feature word generation module 203 is configured to construct a feature word tree according to the feature word text and the position information of the feature word text in the user input word, and obtain a composite feature word set according to the feature word tree; the composite characteristic words in the composite characteristic word set are combinations of the characteristic words;
and the error correction result generating module 204 is configured to obtain an error correction result of the user input word according to the scoring and sorting result of the composite feature words in the composite feature word set.
The word error correction device provided by the embodiment of the invention realizes automatic error correction of combined words by creating the feature word tree for the feature words, determining the composite feature word set according to the feature word tree and then determining the error correction result of the user input words according to the grading and sorting result of a plurality of composite feature words in the set, thereby having higher execution efficiency.
Fig. 3 is a schematic entity structure diagram of an electronic device according to an embodiment of the present invention, and as shown in fig. 3, the electronic device may include: a processor (processor)310, a communication Interface (communication Interface)320, a memory (memory)330 and a communication bus 340, wherein the processor 310, the communication Interface 320 and the memory 330 communicate with each other via the communication bus 340. The processor 310 may call logic instructions in the memory 330 to perform the following method: determining alternative words corresponding to the user input words; wherein the user input words are combined words; determining feature words according to the user input words and the alternative words; wherein the feature word is a maximum similar sub-character string of the user input word and the alternative word; the characteristic words comprise: the characteristic word text and the position information of the characteristic word text in the user input word; constructing a feature word tree according to the feature word text and the position information of the feature word text in the user input word, and obtaining a composite feature word set according to the feature word tree; the composite characteristic words in the composite characteristic word set are combinations of the characteristic words; and obtaining an error correction result of the user input word according to the grading and sorting result of the composite characteristic words in the composite characteristic word set.
It should be noted that, when being implemented specifically, the electronic device in this embodiment may be a server, a PC, or other devices, as long as the structure includes the processor 310, the communication interface 320, the memory 330, and the communication bus 340 shown in fig. 3, where the processor 310, the communication interface 320, and the memory 330 complete mutual communication through the communication bus 340, and the processor 310 may call the logic instruction in the memory 330 to execute the above method. The embodiment does not limit the specific implementation form of the electronic device.
In addition, the logic instructions in the memory 330 may be implemented in the form of software functional units and stored in a computer readable storage medium when the software functional units are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
Further, embodiments of the present invention disclose a computer program product comprising a computer program stored on a non-transitory computer-readable storage medium, the computer program comprising program instructions, which when executed by a computer, the computer is capable of performing the methods provided by the above-mentioned method embodiments, for example, comprising: determining alternative words corresponding to the user input words; wherein the user input words are combined words; determining feature words according to the user input words and the alternative words; wherein the feature word is a maximum similar sub-character string of the user input word and the alternative word; the characteristic words comprise: the characteristic word text and the position information of the characteristic word text in the user input word; constructing a feature word tree according to the feature word text and the position information of the feature word text in the user input word, and obtaining a composite feature word set according to the feature word tree; the composite characteristic words in the composite characteristic word set are combinations of the characteristic words; and obtaining an error correction result of the user input word according to the grading and sorting result of the composite characteristic words in the composite characteristic word set.
In another aspect, an embodiment of the present invention further provides a non-transitory computer-readable storage medium, on which a computer program is stored, where the computer program is implemented by a processor to perform the method provided by the foregoing embodiments, for example, including: determining alternative words corresponding to the user input words; wherein the user input words are combined words; determining feature words according to the user input words and the alternative words; wherein the feature word is a maximum similar sub-character string of the user input word and the alternative word; the characteristic words comprise: the characteristic word text and the position information of the characteristic word text in the user input word; constructing a feature word tree according to the feature word text and the position information of the feature word text in the user input word, and obtaining a composite feature word set according to the feature word tree; the composite characteristic words in the composite characteristic word set are combinations of the characteristic words; and obtaining an error correction result of the user input word according to the grading and sorting result of the composite characteristic words in the composite characteristic word set.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (10)

1. A method of word error correction, comprising:
determining alternative words corresponding to the user input words; wherein the user input words are combined words;
determining feature words according to the user input words and the alternative words; wherein the feature word is a maximum similar sub-character string of the user input word and the alternative word; the characteristic words comprise: the characteristic word text and the position information of the characteristic word text in the user input word;
constructing a feature word tree according to the feature word text and the position information of the feature word text in the user input word, and obtaining a composite feature word set according to the feature word tree; the composite characteristic words in the composite characteristic word set are combinations of the characteristic words;
and obtaining an error correction result of the user input word according to the grading and sorting result of the composite characteristic words in the composite characteristic word set.
2. The word error correction method according to claim 1, wherein the constructing a feature word tree according to the feature word text and position information of the feature word text in the user input word comprises:
creating tree nodes according to the position information of the feature word text in the user input words, and storing the feature words with the same position information of the feature word text in the user input words under the same tree nodes; the position information of the feature word text in the user input word comprises the following steps: start position information, end position information;
determining the parent-child node relationship among the tree nodes according to the position information of the feature word text in the user input word, wherein the method comprises the following steps: if the starting position information of the first tree node is greater than the ending position information of the second tree node, the first tree node is a child node of the second tree node.
3. The word error correction method according to claim 1, wherein the obtaining a composite feature word set according to the feature word tree includes:
deeply traversing the feature word tree to generate a path;
and according to the position sequence of the tree nodes in the paths, carrying out Cartesian product on all the feature words contained in each tree node in each path to obtain a composite feature word set.
4. The word error correction method according to claim 3, wherein the obtaining of the composite feature word set according to the feature word tree further comprises:
and setting marks for the tree nodes according to the content of the feature words stored in the tree nodes in the feature word tree, and deleting paths which do not meet preset rules in the feature word tree according to the marks of the tree nodes.
5. The word error correction method according to claim 1, wherein obtaining the error correction result of the user input word according to the score sorting result of the compound feature words in the compound feature word set comprises:
performing accuracy scoring on the composite feature words;
comparing the accuracy grading result of the composite feature word with a preset accuracy threshold value, and removing the composite feature word of which the accuracy grading result does not meet the accuracy threshold value;
and sorting the composite feature words meeting the accuracy threshold according to the accuracy grading result, and taking the composite feature words with the first accuracy sorting as the error correction result of the user input words.
6. The word error correction method according to claim 5, wherein the obtaining of the error correction result of the user input word according to the ranking result of the scores of the compound feature words in the compound feature word set further comprises:
when a plurality of compound feature words with first accuracy ordering are available, performing secondary ordering on the plurality of compound feature words according to the similarity scores, and taking the compound feature words with first similarity ordering as error correction results of the user input words; and obtaining the similarity score of the compound feature word according to the similarity score of the feature words forming the compound feature word.
7. The word error correction method of claim 1, wherein the determining the candidate word corresponding to the user input word comprises:
according to the user input word, obtaining an alternative word corresponding to the user input word from a Hash index table generated based on an error correction dictionary; wherein the content of the first and second substances,
the Hash index table generated based on the error correction dictionary is a Hash index table generated according to keywords, wherein the Hash index table is obtained by cutting correct words in the error correction dictionary into single words, then overlapping the single words obtained by cutting into two words or overlapping multiple words to obtain the keywords.
8. A word error correction apparatus, comprising:
the alternative word determining module is used for determining alternative words corresponding to the input words of the user; wherein the user input words are combined words;
the characteristic word determining module is used for determining characteristic words according to the user input words and the alternative words; wherein the feature word is a maximum similar sub-character string of the user input word and the alternative word; the characteristic words comprise: the characteristic word text and the position information of the characteristic word text in the user input word;
the composite feature word generation module is used for constructing a feature word tree according to the feature word text and the position information of the feature word text in the user input word, and obtaining a composite feature word set according to the feature word tree; the composite characteristic words in the composite characteristic word set are combinations of the characteristic words;
and the error correction result generation module is used for obtaining the error correction result of the user input word according to the grading and sorting result of the composite characteristic words in the composite characteristic word set.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the word correction method according to any one of claims 1 to 7 are implemented when the processor executes the program.
10. A non-transitory computer readable storage medium, on which a computer program is stored, the computer program, when being executed by a processor, implementing the steps of the word correction method according to any one of claims 1 to 7.
CN202010675640.8A 2020-07-14 2020-07-14 Word error correction method and device, electronic equipment and storage medium Pending CN112001168A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010675640.8A CN112001168A (en) 2020-07-14 2020-07-14 Word error correction method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010675640.8A CN112001168A (en) 2020-07-14 2020-07-14 Word error correction method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN112001168A true CN112001168A (en) 2020-11-27

Family

ID=73466942

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010675640.8A Pending CN112001168A (en) 2020-07-14 2020-07-14 Word error correction method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN112001168A (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH01316863A (en) * 1988-06-17 1989-12-21 Nippon Telegr & Teleph Corp <Ntt> Automatic qualifying and correcting device for error in japanese language text
CN105975625A (en) * 2016-05-26 2016-09-28 同方知网数字出版技术股份有限公司 Chinglish inquiring correcting method and system oriented to English search engine
CN107045496A (en) * 2017-04-19 2017-08-15 畅捷通信息技术股份有限公司 The error correction method and error correction device of text after speech recognition
CN108595437A (en) * 2018-05-04 2018-09-28 和美(深圳)信息技术股份有限公司 Text query error correction method, device, computer equipment and storage medium
CN110362824A (en) * 2019-06-24 2019-10-22 广州多益网络股份有限公司 A kind of method, apparatus of automatic error-correcting, terminal device and storage medium
CN110795617A (en) * 2019-08-12 2020-02-14 腾讯科技(深圳)有限公司 Error correction method and related device for search terms
US20200082808A1 (en) * 2018-09-12 2020-03-12 Kika Tech (Cayman) Holdings Co., Limited Speech recognition error correction method and apparatus

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH01316863A (en) * 1988-06-17 1989-12-21 Nippon Telegr & Teleph Corp <Ntt> Automatic qualifying and correcting device for error in japanese language text
CN105975625A (en) * 2016-05-26 2016-09-28 同方知网数字出版技术股份有限公司 Chinglish inquiring correcting method and system oriented to English search engine
CN107045496A (en) * 2017-04-19 2017-08-15 畅捷通信息技术股份有限公司 The error correction method and error correction device of text after speech recognition
CN108595437A (en) * 2018-05-04 2018-09-28 和美(深圳)信息技术股份有限公司 Text query error correction method, device, computer equipment and storage medium
US20200082808A1 (en) * 2018-09-12 2020-03-12 Kika Tech (Cayman) Holdings Co., Limited Speech recognition error correction method and apparatus
CN110362824A (en) * 2019-06-24 2019-10-22 广州多益网络股份有限公司 A kind of method, apparatus of automatic error-correcting, terminal device and storage medium
CN110795617A (en) * 2019-08-12 2020-02-14 腾讯科技(深圳)有限公司 Error correction method and related device for search terms

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
刘亮亮;曹存根;: "基于局部上下文特征的组合的中文真词错误自动校对研究", 计算机科学, vol. 43, no. 12, pages 30 - 35 *

Similar Documents

Publication Publication Date Title
Kissos et al. OCR error correction using character correction and feature-based word classification
US7584093B2 (en) Method and system for generating spelling suggestions
JP5144940B2 (en) Improved robustness in table of contents extraction
US9195738B2 (en) Tokenization platform
CN107704102B (en) Text input method and device
KR101425182B1 (en) Typing candidate generating method for enhancing typing efficiency
KR101483433B1 (en) System and Method for Spelling Correction of Misspelled Keyword
CN110362824B (en) Automatic error correction method, device, terminal equipment and storage medium
KR101544690B1 (en) Word division device, word division method, and word division program
CN111488466B (en) Chinese language marking error corpus generating method, computing device and storage medium
Mandal et al. Clustering-based Bangla spell checker
Noaman et al. Automatic Arabic spelling errors detection and correction based on confusion matrix-noisy channel hybrid system
Uthayamoorthy et al. Ddspell-a data driven spell checker and suggestion generator for the tamil language
JPH0211934B2 (en)
CN113033204A (en) Information entity extraction method and device, electronic equipment and storage medium
CN116450896A (en) Text fuzzy matching method, device, electronic equipment and readable storage medium
CN112001168A (en) Word error correction method and device, electronic equipment and storage medium
CN116484842A (en) Statement error correction method and device, electronic equipment and storage medium
CN115688748A (en) Question error correction method and device, electronic equipment and storage medium
JP3975825B2 (en) Character recognition error correction method, apparatus and program
JP2009176148A (en) Unknown word determining system, method and program
CN115146630B (en) Word segmentation method, device, equipment and storage medium based on professional domain knowledge
CN109376339B (en) Text conversion candidate rule information extraction method based on user behaviors
JP2007172315A (en) System, method and program for creating synonym dictionary
CN117094322A (en) Named entity identification method, device, equipment and medium based on knowledge graph

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination