CN112001168B - Word error correction method, device, electronic equipment and storage medium - Google Patents

Word error correction method, device, electronic equipment and storage medium Download PDF

Info

Publication number
CN112001168B
CN112001168B CN202010675640.8A CN202010675640A CN112001168B CN 112001168 B CN112001168 B CN 112001168B CN 202010675640 A CN202010675640 A CN 202010675640A CN 112001168 B CN112001168 B CN 112001168B
Authority
CN
China
Prior art keywords
word
words
feature
user input
tree
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010675640.8A
Other languages
Chinese (zh)
Other versions
CN112001168A (en
Inventor
高岩峰
周冰
任化强
李敏
李东晓
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Mobile Communications Group Co Ltd
MIGU Culture Technology Co Ltd
Original Assignee
China Mobile Communications Group Co Ltd
MIGU Culture Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Mobile Communications Group Co Ltd, MIGU Culture Technology Co Ltd filed Critical China Mobile Communications Group Co Ltd
Priority to CN202010675640.8A priority Critical patent/CN112001168B/en
Publication of CN112001168A publication Critical patent/CN112001168A/en
Application granted granted Critical
Publication of CN112001168B publication Critical patent/CN112001168B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/232Orthographic correction, e.g. spell checking or vowelisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/02Input arrangements using manually operated switches, e.g. using keyboards or dials
    • G06F3/023Arrangements for converting discrete items of information into a coded form, e.g. arrangements for interpreting keyboard generated codes as alphanumeric codes, operand codes or instruction codes
    • G06F3/0233Character input methods
    • G06F3/0237Character input methods using prediction or retrieval techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Abstract

The embodiment of the invention provides a word error correction method, a word error correction device, electronic equipment and a storage medium. The method comprises the following steps: determining alternative words corresponding to the user input words; determining feature words according to the user input words and the alternative words; constructing a feature word tree according to the feature word text and the position information of the feature word text in the user input word, and obtaining a composite feature word set according to the feature word tree; and obtaining an error correction result of the user input word according to the grading and sorting result of the compound feature words in the compound feature word set. According to the word error correction method, the device, the electronic equipment and the storage medium, the feature word tree is created for the feature words, the composite feature words are determined according to the feature word tree, and then the error correction result of the user input words is determined according to the scoring and sorting results of the composite feature words, so that automatic error correction of the combined words is achieved, and higher execution efficiency is achieved.

Description

Word error correction method, device, electronic equipment and storage medium
Technical Field
The present invention relates to the field of text processing technologies, and in particular, to a word error correction method, a device, an electronic apparatus, and a storage medium.
Background
When a user inputs text information on an electronic device, a phenomenon of inputting a wrong word often occurs. For example, when a user inputs a Chinese text, a pinyin input method is adopted to wrongly write Chinese characters in words as homophones; or adopting a font input method to wrongly write the Chinese characters in the words into the shape similar words; or multiple words, fewer words, wrong words and the like. Similar problems can also occur when a user enters pinyin or foreign text.
Errors in the text information input by the user can affect the subsequent information processing flow. For example, when a user searches using a search system, if an input search term is wrong, the accuracy of the search result will be affected, and even the desired search result cannot be searched.
To this end, the prior art provides methods for automatically correcting words, including edit distance algorithms and weighted edit distance algorithms.
The edit distance algorithm quantifies the degree of difference between two strings (e.g., english) by looking at how many times it takes to change one string into another. Edit distance algorithms may be used in natural language processing, for example, in spell checking to determine which word(s) are more likely words based on the edit distance of a misspelled word and other correct words.
The weighted editing distance algorithm calculates the weighted editing distance between the search word and the pre-acquired hot word, wherein in the calculation process of the weighted editing distance, weights with different values are respectively set for the operation of converting from the search word to the hot word, the operation of inserting a character string, the operation of deleting the character string, the operation of replacing a shape near word or a sound near word, the operation of replacing a non-shape near word or a sound near word and the operation of exchanging characters; and selecting a preset number of hotwords to carry out error correction prompt according to the weighted editing distance and hotword hotness.
However, both the edit distance algorithm and the weighted edit distance algorithm are implemented by comparing the user input word with one candidate word without the operation of simultaneously comparing with two or more candidate words, and thus cannot directly support automatic error correction of the combined word.
Disclosure of Invention
The embodiment of the invention provides a word error correction method, a device, electronic equipment and a storage medium, which are used for solving the defect that the word error correction method in the prior art cannot realize automatic error correction of a combined word.
An embodiment of a first aspect of the present invention provides a word error correction method, including:
determining alternative words corresponding to the user input words; wherein the user input words are combined words;
Determining feature words according to the user input words and the alternative words; wherein the feature word is the maximum similar substring of the user input word and the candidate word; the feature words include: the characteristic word text, the position information of the characteristic word text in the user input word;
constructing a feature word tree according to the feature word text and the position information of the feature word text in the user input word, and obtaining a composite feature word set according to the feature word tree; wherein, the compound feature words in the compound feature word set are the combination of the feature words;
And obtaining an error correction result of the user input word according to the grading and sorting result of the compound feature words in the compound feature word set.
In the above technical solution, the constructing a feature word tree according to the feature word text and the position information of the feature word text in the user input word includes:
creating tree nodes according to the position information of the feature word text in the user input word, and storing the feature words with the same position information of the feature word text in the user input word under the same tree nodes; wherein, the location information of the feature word text in the user input word comprises: start position information, end position information;
Determining parent-child node relations between tree nodes according to position information of feature word text in user input words, wherein the method comprises the following steps: if the start position information of the first tree node is greater than the end position information of the second tree node, the first tree node is a child node of the second tree node.
In the above technical solution, the obtaining the composite feature word set according to the feature word tree includes:
deeply traversing the characteristic word tree to generate a path;
and according to the position sequence of the tree nodes in the paths, carrying out Cartesian product on all the feature words contained in each tree node in each path to obtain a composite feature word set.
In the above technical solution, the obtaining the composite feature word set according to the feature word tree further includes:
setting a mark for the tree node according to the content of the feature words stored by the tree node in the feature word tree, and deleting paths which do not meet the preset rule in the feature word tree according to the mark of the tree node.
In the above technical solution, the obtaining the error correction result of the user input word according to the score sorting result of the compound feature words in the compound feature word set includes:
scoring the accuracy of the composite feature words;
Comparing the accuracy scoring result of the composite feature word with a preset accuracy threshold value, and removing the composite feature word of which the accuracy scoring result does not meet the accuracy threshold value;
And sequencing the compound feature words meeting the accuracy threshold according to the accuracy scoring result, and taking the compound feature words with the first accuracy sequencing as the error correction result of the user input word.
In the above technical solution, the obtaining the error correction result of the user input word according to the score sorting result of the compound feature words in the compound feature word set further includes:
When a plurality of compound feature words with the first accuracy are ranked, performing secondary ranking on the compound feature words according to similarity scores, wherein the compound feature words with the first similarity ranking are used as error correction results of words input by a user; the similarity score of the composite feature words is obtained according to the similarity scores of the feature words forming the composite feature words.
In the above technical solution, the determining the candidate word corresponding to the user input word includes:
according to the user input words, obtaining alternative words corresponding to the user input words from a hash index table generated based on an error correction dictionary; wherein,
The hash index table generated based on the error correction dictionary is a hash index table generated according to the key words by dividing correct words in the error correction dictionary into single words and then performing two-word overlapping or multi-word overlapping on the single words obtained by dividing to obtain the key words.
An embodiment of a second aspect of the present invention provides a word error correction apparatus, including:
the candidate word determining module is used for determining candidate words corresponding to the user input words; wherein the user input words are combined words;
The characteristic word determining module is used for determining characteristic words according to the user input words and the candidate words; wherein the feature word is the maximum similar substring of the user input word and the candidate word; the feature words include: the characteristic word text, the position information of the characteristic word text in the user input word;
The compound characteristic word generation module is used for constructing a characteristic word tree according to the characteristic word text and the position information of the characteristic word text in the input words of the user, and obtaining a compound characteristic word set according to the characteristic word tree; wherein, the compound feature words in the compound feature word set are the combination of the feature words;
and the error correction result generation module is used for obtaining the error correction result of the user input word according to the grading and sorting result of the composite feature words in the composite feature word set.
An embodiment of the third aspect of the present invention provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the word error correction method according to the embodiment of the first aspect of the present invention when the program is executed.
An embodiment of a fourth aspect of the present invention provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of a word error correction method according to an embodiment of the first aspect of the present invention.
According to the word error correction method, the device, the electronic equipment and the storage medium, the feature word tree is created for the feature words, the composite feature word set is determined according to the feature word tree, and then the error correction result of the user input word is determined according to the scoring and sorting results of the multiple composite feature words in the set, so that automatic error correction of the combined words is realized, and higher execution efficiency is achieved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a word error correction method provided by an embodiment of the present invention;
FIG. 2 is a schematic diagram of a word error correction device according to an embodiment of the present invention;
fig. 3 is a schematic entity structure diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
In the embodiment of the invention, the combined word is a word group formed by splicing two or more words with independent meanings. For example, "jet carving hero transmits something" is a combination word.
The edit distance algorithm and the weighted edit distance algorithm in the prior art cannot be directly applied to automatic error correction of the combined word. Therefore, certain variations are made to the edit distance algorithm or the weighted edit distance algorithm in the prior art to support automatic error correction of the combined word. Namely: various possible combinations of words contained in the combined words are added to the error correction dictionary in advance, and converted into character strings, and then automatic error correction of the combined words input by the user is realized by utilizing an edit distance algorithm or a weighted edit distance algorithm.
For example, to achieve the combined error correction of "jet carving hero pass" and "yellow somewhere", the error correction dictionary needs the following 4 terms:
Jet carving hero transmission;
Yellow to some extent;
Jet carving hero to transmit yellow to a certain place;
yellow one is transmitted by jet carving hero.
There are many possibilities of combining words, such as a combination of several names of persons, a combination of names of movies and years, a combination of names of dramas and what seasons, and so on. If all combinations are exhausted, the number of terms of the error correction dictionary is increased in geometric level and exceeds the performance limit of the algorithm. Therefore, the error correction of the combined word can be realized by changing the editing distance algorithm or the weighted editing distance algorithm only for a small amount of hot spot data, and the error correction method has no universality.
The word error correction method provided by the embodiment of the invention can realize error correction of the combined word.
Before describing the word error correction method provided by the embodiment of the invention in detail, first, related concepts related to the method are described in a unified way.
The user enters the word: a character string input by a user;
Alternative words: the candidate character string, the correct character string that the user desires to input, is obtained from the candidate word. The candidate words can be character strings stored in the error correction dictionary, or can be obtained by expanding the character strings stored in the error correction dictionary;
Feature words: the user inputs the maximum similar substring of the word and the candidate word;
compound feature words: a character string composed of two or more feature words.
Fig. 1 is a flowchart of a word error correction method provided by an embodiment of the present invention, where, as shown in fig. 1, the word error correction method provided by the embodiment of the present invention includes:
step 101, determining alternative words corresponding to the words input by the user.
In embodiments of the present invention, the user input words are typically combined words, i.e., the user input words include two or more words having independent meanings. The type of the words input by the user is not limited, and can be Chinese words, chinese pinyin and foreign language words.
The user input words are influenced by the subjective intention of the user, and errors are likely to occur, so that the word error correction method provided by the embodiment of the invention is required to be adopted for error correction.
Determining the corresponding candidate word from the user input word has a variety of implementations, such as determining the candidate word by comparing to an error correction dictionary, and such as determining the candidate word by comparing to a hash index table generated based on the error correction dictionary. In the embodiments of the present invention, specific implementation manners are not limited.
Multiple alternatives are typically available based on a user input word.
And 102, extracting feature words according to the candidate words and the words input by the user.
The feature word is the most similar substring of the user input word and the candidate word. The feature word is extracted according to the candidate word and the user input word, and can be implemented by a related method in the prior art, such as a longest common subsequence algorithm (LCS algorithm). Further details of the implementation are not described here.
The extracted feature words include the following information:
(1) Text: a feature word text, which is a substring of the candidate word;
(2) sPosBegin: the starting position of the user input word corresponding to the characteristic word text;
(3) sPosEnd: the end position of the user input word corresponding to the characteristic word text;
(4) score: feature word similarity score.
The starting position of the user input word corresponding to the characteristic word text refers to the position of a first word which is the same as the word in the characteristic word text in the user input word; the end position of the user input word corresponding to the feature word text refers to the position of the last word in the user input word that is identical to the word in the feature word text. The feature word similarity score reflects the similarity of the user input word and the alternative word; specifically, the feature words extract words with similarity between the user input words and the candidate words, and calculate scores according to the similarity of the words, wherein the scores can be obtained by an edit distance algorithm or a weighted edit distance algorithm in the prior art.
For example, assume that the user input word is "jet hero" and the alternative word is "jet carving hero pass". The extracted feature words include the following information:
text: jet carving hero;
sPosBegin:0;
sPosEnd:2;
score:2.6。
Since the "shoot" word in the feature word "shoot carved hero" is the same as the first word in the user input word "shoot hero", sPosBegin has a value of 0 (0 is taken as the starting position); since the "male" word in the feature word "jet-carving hero" is the same as the third word in the user input word "jet-carving hero", sPosEnd has a value of 2 (when 0 is the starting position, 2 is the position of the third word). The feature word similarity score reflects the similarity of the user input word jet hero and the alternative word jet carving hero transmission. Taking LCS algorithm as an example. The extracted public subsequence (characteristic word) is jet hero, and compared with jet hero, the character is lack of carving and transmission, and according to rules, the head and tail characters are lack of deduction, the middle is lack of characters, and each character is 0.4 minute. The user inputs words and the words where the alternative words appear add 1 score to each word, and the score is 1*3-0.4x1=2.6.
And 103, constructing a characteristic word tree according to the characteristic words, and obtaining a composite characteristic word set from the characteristic word tree.
After the feature words are obtained, the feature words are theoretically arranged and combined to obtain the composite feature words. However, when the number of feature words is large, if all possible combinations are exhausted, millions of composite feature words may be generated, which seriously affects the efficiency of the error correction method. Therefore, in the embodiment of the invention, the composite feature word is generated by adopting a mode of constructing a feature word tree. This can greatly reduce the number of compound feature words, contributing to an improvement in the efficiency of the error correction method.
Specifically, when creating the feature word tree, a tree node TreeItem is created from the two sets of feature words sPosBegin and sPosEnd, and the same feature word of the two sets (sPosBegin, sPosEnd) is recorded in the same tree node TreeItem.
According to the size relation between sPosBegin and sPosEnd, the tree nodes TreeItem establish father-son relation correlation rules as follows: if sPosBegin of the first tree node TreeItem is greater than sPosEnd of the second tree node TreeItem, the first tree node is a child of the second tree node. When a certain tree node does not contain child nodes, the tree node is a leaf node.
After the feature word tree is constructed, a tree structure comprising root nodes and leaf nodes (tree nodes) is formed. And obtaining a composite characteristic word set from the characteristic word tree. Specifically, the feature word tree is traversed deeply, paths are generated, each node in the paths comprises a feature word list (the list comprises one or more feature words), and Cartesian products are respectively carried out on the feature word list under each path, so that a set of composite feature words is obtained.
For example, if the pseudo device selection word has ABZC, XBCD, CD and DE and the user input word is ABCDE, the extracted feature words and corresponding sPosBegin and sPosEnd are: ABZC (0, 2), BCD (1, 3), CD (2, 3) and DE (3, 4).
The created feature word tree is:
Root of Chinese character
|---ABZC(0,2)
|---DE(3,4)
|---BCD(1,3)
|---CD(2,3)
Wherein sPosBegin of the feature word DE (3, 4) is 3, sPosEnd of the feature word ABZC (0, 2) is 2, and obviously 3 is greater than 2, so that the node corresponding to the feature word DE (3, 4) is a child node of the node corresponding to the feature word ABZC (0, 2).
According to the feature word tree, the obtained composite feature words comprise ABZCDE, BCD and CD.
Further, assuming that in the above example, the candidate words further include ABDC, ABEC, and ABFC, the extracted feature words further include ABDC (0, 2), ABEC (0, 2), and ABFC (0, 2). The two-tuple (sPosBegin, sPosEnd) of the three feature words is (0, 2), so that when the feature word tree is created, the three feature words are under the same node as feature word ABZC (0, 2). In the corresponding path, the three feature words are cartesian integrated with the feature word DE, so that ABDCDE, ABECDE and ABFCDE are also included in the final composite feature word set.
And 104, obtaining an error correction result of the user input word according to the grading and sorting result of the compound feature words in the compound feature word set.
After the composite feature word set is obtained according to the feature word tree, all the composite feature words in the obtained set can be ranked according to the scores, and the first composite feature word in the score ranking is used as an error correction result of the user input word according to the score ranking result.
Scoring the composite feature words includes a variety of scoring approaches, such as accuracy scoring, similarity scoring, and the like. In the embodiment of the present invention, a specific scoring mode is not limited.
According to the word error correction method provided by the embodiment of the invention, the feature word tree is created for the feature words, the composite feature word set is determined according to the feature word tree, and then the error correction result of the user input word is determined according to the grading and sorting results of a plurality of composite feature words in the set, so that the automatic error correction of the combined words is realized, and the execution efficiency is higher.
Based on any of the foregoing embodiments, in an embodiment of the present invention, the obtaining a composite feature word set according to the feature word tree further includes:
setting a mark for the tree node according to the content of the feature words stored by the tree node in the feature word tree, and deleting paths which do not meet the preset rule in the feature word tree according to the mark of the tree node.
After the feature word tree is constructed according to the feature words, the feature word tree is limited by a preset rule, and in some cases, not all paths in the feature word tree meet the requirement of the preset rule. And deleting paths which do not meet the preset rules, so that the calculated amount of the subsequent steps is reduced.
For example, a predetermined rule specifies: the compound feature word cannot appear with two movie names. According to this rule, it is possible to record in a tree node of the feature word tree whether or not the feature words stored therein are movie names with a specific mark. If the corresponding flags of two tree nodes on a path are both true, the path may be determined to be illegal and the path may be deleted.
The word error correction method provided by the embodiment of the invention reduces the number of paths in the feature word tree by filtering the feature word tree, is beneficial to reducing the calculated amount of the subsequent steps and improves the calculation efficiency.
Based on any one of the foregoing embodiments, in an embodiment of the present invention, the obtaining, according to the score ranking result of the compound feature words in the compound feature word set, an error correction result of the user input word includes:
scoring the accuracy of the composite feature words;
Comparing the accuracy scoring result of the composite feature word with a preset accuracy threshold value, and removing the composite feature word of which the accuracy scoring result does not meet the accuracy threshold value;
And sequencing the compound feature words meeting the accuracy threshold according to the accuracy scoring result, and taking the compound feature words with the first accuracy sequencing as the error correction result of the user input word.
The accuracy score is used to evaluate the accuracy of the compound feature word.
The accuracy scoring rule may be determined by business logic, and in the embodiment of the present invention, the accuracy scoring method for the compound feature word is as follows:
Wherein p represents the accuracy score of the compound feature words, a i represents the similarity score of the ith feature word in the compound feature words, n represents the number of feature words contained in the compound feature words, b represents the multiple word weight, d represents the number of multiple words in the compound feature words, c represents the few word weight, and s represents the number of few words in the compound feature words. Multiple words refer to words that are not in the compound feature words and that are not in the user input words, and few words refer to words that are not in the compound feature words and that are not in the user input words.
For example, in one example, the user input word is "large jet hero yellow somewhere", and the corresponding compound feature word is "jet carving hero yellow somewhere". Wherein the large word is a multiple word and the carving word is a few word. Let the multi-word weight be 0.6 and the few-word weight be 0.5.
Similarity score for known "jet carving hero" =2.5, similarity score for known "yellow somewhere" =3; substitution formula:
the accuracy score of the composite feature word "jet carving hero yellow somehow" = 2.5+3-0.6 x1 = 4.9.
In this example, in calculating the accuracy score, there is no score for the few words "carved" because it has been scored within the similarity of jet carved hero and is not repeatable. When calculating the accuracy score, whether the few characters need to be separated or not can be determined according to whether the compound feature words (the jet carving hero and the yellow hero) are separated or not (sPosBegin, sPosEnd). Specifically, when few words appear at the head and tail of the feature word, they are not detained. For example, the correct word is "shoot carved hero pass", the user input word is "carved hero", the absence of "shoot" of the feature word head and "pass" of the feature word tail, no withhold. The few characters appear in the characteristic words to be deducted, for example, the correct word is 'jet carving hero transmission', the user inputs the word to be 'jet hero', and the missing 'carving' characters are to be deducted.
After the respective accuracy scores of the composite feature words are obtained, in order to avoid the occurrence of the phenomenon that the more the composite feature words are more correct and more wrong when the user inputs the words, the accuracy scores of the composite feature words are compared with a preset accuracy threshold value, and the composite feature words with the accuracy scores not meeting the accuracy threshold value are removed.
And finally, sequencing the compound feature words meeting the accuracy threshold according to the accuracy scores to obtain corresponding sequencing results. And taking the compound characteristic words with the first accuracy sequence as error correction results of words input by the user.
The word error correction method provided by the embodiment of the invention selects the composite feature words as error correction results of the words input by the user according to the accuracy scores of the composite feature words, and has the advantage of good error correction effect.
Based on any one of the foregoing embodiments, in an embodiment of the present invention, the obtaining the error correction result of the user input word according to the score ranking result of the compound feature words in the compound feature word set further includes:
When a plurality of compound feature words with the first accuracy are ranked, performing secondary ranking on the compound feature words according to similarity scores, wherein the compound feature words with the first similarity ranking are used as error correction results of words input by a user; the similarity score of the composite feature words is obtained according to the similarity scores of the feature words forming the composite feature words.
Specifically, if the accuracy scores of more than one compound feature word are the same, the compound feature words with the same accuracy scores are ranked according to the similarity scores, and the compound feature word with the first similarity ranking is used as the error correction result of the user input word.
The similarity score for a compound feature word may be derived from a sum of feature word similarity scores for all of the feature words comprising the compound feature word. The calculation of the similarity scores for the individual feature words may be performed using prior art techniques, as described in connection with step 102 above.
According to the word error correction method provided by the embodiment of the invention, on the premise that a plurality of composite feature words have the same accuracy scores, the composite feature words are subjected to secondary sorting according to the similarity, and the error correction result of the user input word is obtained according to the secondary sorting result, so that the word error correction method has the advantage of good error correction effect.
Based on any of the foregoing embodiments, in an embodiment of the present invention, the determining an alternative word corresponding to the user input word includes:
according to the user input words, obtaining alternative words corresponding to the user input words from a hash index table generated based on an error correction dictionary; wherein,
The hash index table generated based on the error correction dictionary is a hash index table generated according to the key words by dividing correct words in the error correction dictionary into single words and then performing two-word overlapping or multi-word overlapping on the single words obtained by dividing to obtain the key words.
Error correction dictionaries typically contain confirmed correctly written words. However, the number and the range of words contained in the error correction dictionary are limited, and if the corresponding candidate word is searched for the user input word directly according to the error correction dictionary, the problem of insufficient search result may be caused. Therefore, in the embodiment of the invention, the hash index table generated based on the error correction dictionary is adopted.
The error correction dictionary stores words which are written correctly, and word segmentation operation is carried out on the correct words to obtain a list taking words as basic units. The words in the embodiment of the invention refer to single words in Chinese, words in foreign language, or characters representing Arabic numerals and punctuation marks. For example, the error correction dictionary stores the Chinese word "transformers", and the results after word segmentation are lists including "transformers", "shapes", "diamonds" and "just". If the object to be segmented is a foreign word (such as English word), the basic unit in word segmentation is a foreign word; if the object to be segmented is Arabic numerals and punctuation marks, the basic unit in word segmentation is characters. As a preferred implementation, the word segmentation result includes pronunciation information of the word (such as pinyin of Chinese characters) besides the word.
And then generating keywords based on the results after the word segmentation operation. There are various specific implementations of generating keywords, one is 2-word overlapping. Specifically, adjacent words in the word segmentation result are combined together to generate keywords. For example, from "variant", "shape", "diamond", "rigid" the following keywords may be generated: "deformed", "gold", "diamond". When fewer words are included in the word segmentation result, such as less than or equal to 4 words, keywords can also be generated in a way of spacing one word. For example, the following keywords "transformation", "shape steel" may also be generated from "transformation", "shape", "diamond", "steel". In addition to the 2-word overlapping mode, a 3-word overlapping mode or a mode with more words overlapping can be adopted. The number of overlapping words affects the error correction capability and the efficiency of the operation of the method. The number of overlapping words is large, the number of generated keywords can be reduced, and the method is beneficial to improving the operation efficiency of the method; the number of overlapping words is small, the number of generated keywords can be increased, and the error correction capability of the method can be improved. Therefore, when generating a keyword, it is necessary to determine the keyword generation method according to the specific situation.
After the key is obtained, a hash index table may be built for the key. When the hash index table is established, keywords with the same or similar characteristics are assigned to the same index. For example, keywords having the same or similar pronunciation are assigned to the same index. For another example, keywords having the same or similar glyphs are assigned to the same index. As a preferred implementation, the hash index table includes a plurality of types of indexes, such as indexes about pronunciations, and indexes about glyphs. This allows a key to be assigned to different indices simultaneously.
After a hash index table generated based on the error correction dictionary is established, the user input words are compared with the hash index table, and candidate words corresponding to the user input words are obtained.
Multiple alternatives are typically available based on a user input word. For example, the user input word is "transformers", and its corresponding alternatives include: "transformers" (which are retrievable according to the keywords "gold" and "diamond") and "buddha's warriors" (which are retrievable according to the keywords "diamond") and "mutans" (which are retrievable according to the keywords "gold" and "diamond").
The word error correction method provided by the embodiment of the invention not only can realize automatic error correction of the combined words, but also does not need to expand an error correction dictionary, and has higher execution efficiency.
Fig. 2 is a schematic diagram of a word error correction device provided by an embodiment of the present invention, where, as shown in fig. 2, the word error correction device provided by the embodiment of the present invention includes:
An alternative word determining module 201, configured to determine an alternative word corresponding to a word input by a user;
A feature word determining module 202, configured to determine a feature word according to the user input word and the candidate word; wherein the feature word is the maximum similar substring of the user input word and the candidate word; the feature words include: the characteristic word text, the position information of the characteristic word text in the user input word;
The compound feature word generating module 203 is configured to construct a feature word tree according to the feature word text and the position information of the feature word text in the user input word, and obtain a compound feature word set according to the feature word tree; wherein, the compound feature words in the compound feature word set are the combination of the feature words;
And the error correction result generating module 204 is configured to obtain an error correction result of the user input word according to the score ranking result of the compound feature words in the compound feature word set.
The word error correction device provided by the embodiment of the invention realizes automatic error correction of the combined words by creating the feature word tree for the feature words, determining the composite feature word set according to the feature word tree, and then determining the error correction result of the user input words according to the grading and sorting results of a plurality of composite feature words in the set.
Fig. 3 is a schematic physical structure diagram of an electronic device according to an embodiment of the present invention, where, as shown in fig. 3, the electronic device may include: processor 310, communication interface (Communications Interface) 320, memory 330 and communication bus 340, wherein processor 310, communication interface 320 and memory 330 communicate with each other via communication bus 340. The processor 310 may call logic instructions in the memory 330 to perform the following method: determining alternative words corresponding to the user input words; wherein the user input words are combined words; determining feature words according to the user input words and the alternative words; wherein the feature word is the maximum similar substring of the user input word and the candidate word; the feature words include: the characteristic word text, the position information of the characteristic word text in the user input word; constructing a feature word tree according to the feature word text and the position information of the feature word text in the user input word, and obtaining a composite feature word set according to the feature word tree; wherein, the compound feature words in the compound feature word set are the combination of the feature words; and obtaining an error correction result of the user input word according to the grading and sorting result of the compound feature words in the compound feature word set.
It should be noted that, in this embodiment, the electronic device may be a server, a PC, or other devices in the specific implementation, so long as the structure of the electronic device includes the processor 310, the communication interface 320, the memory 330, and the communication bus 340 as shown in fig. 3, where the processor 310, the communication interface 320, and the memory 330 perform communication with each other through the communication bus 340, and the processor 310 may call logic instructions in the memory 330 to execute the above method. The embodiment does not limit a specific implementation form of the electronic device.
Further, the logic instructions in the memory 330 described above may be implemented in the form of software functional units and may be stored in a computer-readable storage medium when sold or used as a stand-alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a usb disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
Further, embodiments of the present invention disclose a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, enable the computer to perform the methods provided by the above-described method embodiments, for example comprising: determining alternative words corresponding to the user input words; wherein the user input words are combined words; determining feature words according to the user input words and the alternative words; wherein the feature word is the maximum similar substring of the user input word and the candidate word; the feature words include: the characteristic word text, the position information of the characteristic word text in the user input word; constructing a feature word tree according to the feature word text and the position information of the feature word text in the user input word, and obtaining a composite feature word set according to the feature word tree; wherein, the compound feature words in the compound feature word set are the combination of the feature words; and obtaining an error correction result of the user input word according to the grading and sorting result of the compound feature words in the compound feature word set.
In another aspect, embodiments of the present invention also provide a non-transitory computer readable storage medium having stored thereon a computer program, which when executed by a processor is implemented to perform the method provided in the above embodiments, for example, including: determining alternative words corresponding to the user input words; wherein the user input words are combined words; determining feature words according to the user input words and the alternative words; wherein the feature word is the maximum similar substring of the user input word and the candidate word; the feature words include: the characteristic word text, the position information of the characteristic word text in the user input word; constructing a feature word tree according to the feature word text and the position information of the feature word text in the user input word, and obtaining a composite feature word set according to the feature word tree; wherein, the compound feature words in the compound feature word set are the combination of the feature words; and obtaining an error correction result of the user input word according to the grading and sorting result of the compound feature words in the compound feature word set.
The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims (8)

1. A method of word correction, comprising:
determining alternative words corresponding to the user input words; wherein the user input words are combined words;
determining feature words according to the user input words and the alternative words; wherein the feature word is the maximum similar substring of the user input word and the candidate word; the feature words include: the feature word text and the position information of the feature word text in the user input word;
constructing a feature word tree according to the feature word text and the position information of the feature word text in the user input word, and obtaining a composite feature word set according to the feature word tree; wherein, the compound feature words in the compound feature word set are the combination of the feature words;
Obtaining an error correction result of the user input word according to the grading and sorting result of the compound feature words in the compound feature word set;
the construction of the feature word tree according to the feature word text and the position information of the feature word text in the input word of the user comprises the following steps:
Creating tree nodes according to the position information of the feature word text in the user input word, and storing the feature words with the same position information of the feature word text in the user input word under the same tree nodes; wherein, the location information of the feature word text in the user input word comprises: start position information and end position information;
If the starting position information of the first tree node is larger than the ending position information of the second tree node, the first tree node is a child node of the second tree node;
The obtaining a composite feature word set according to the feature word tree comprises the following steps:
deeply traversing the characteristic word tree to generate a path;
and according to the position sequence of the tree nodes in the paths, carrying out Cartesian product on all the feature words contained in each tree node in each path to obtain a composite feature word set.
2. The word correction method of claim 1, wherein the obtaining a composite feature word set from the feature word tree further comprises:
setting a mark for the tree node according to the content of the feature words stored by the tree node in the feature word tree, and deleting paths which do not meet the preset rule in the feature word tree according to the mark of the tree node.
3. The word error correction method according to claim 1, wherein the step of obtaining the error correction result of the user input word according to the score ranking result of the compound feature words in the compound feature word set includes:
scoring the accuracy of the composite feature words;
Comparing the accuracy scoring result of the composite feature word with a preset accuracy threshold value, and removing the composite feature word of which the accuracy scoring result does not meet the accuracy threshold value;
And sequencing the compound feature words meeting the accuracy threshold according to the accuracy scoring result, and taking the compound feature words with the first accuracy sequencing as the error correction result of the user input word.
4. The word error correction method according to claim 3, wherein the step of obtaining the error correction result of the user input word according to the score ranking result of the compound feature words in the compound feature word set further comprises:
When a plurality of compound feature words with the first accuracy are ranked, performing secondary ranking on the compound feature words according to similarity scores, wherein the compound feature words with the first similarity ranking are used as error correction results of words input by a user; the similarity score of the composite feature words is obtained according to the similarity scores of the feature words forming the composite feature words.
5. The word correction method of claim 1, wherein the determining the candidate word corresponding to the user input word comprises:
according to the user input words, obtaining alternative words corresponding to the user input words from a hash index table generated based on an error correction dictionary; wherein,
The hash index table generated based on the error correction dictionary is a hash index table generated according to the key words by dividing correct words in the error correction dictionary into single words and then performing two-word overlapping or multi-word overlapping on the single words obtained by dividing to obtain the key words.
6. A word error correction apparatus, comprising:
the candidate word determining module is used for determining candidate words corresponding to the user input words; wherein the user input words are combined words;
The characteristic word determining module is used for determining characteristic words according to the user input words and the candidate words; wherein the feature word is the maximum similar substring of the user input word and the candidate word; the feature words include: the feature word text and the position information of the feature word text in the user input word;
The compound characteristic word generation module is used for constructing a characteristic word tree according to the characteristic word text and the position information of the characteristic word text in the input words of the user, and obtaining a compound characteristic word set according to the characteristic word tree; wherein, the compound feature words in the compound feature word set are the combination of the feature words;
the error correction result generation module is used for obtaining error correction results of the user input words according to the scoring and sorting results of the composite feature words in the composite feature word set;
the construction of the feature word tree according to the feature word text and the position information of the feature word text in the input word of the user comprises the following steps:
Creating tree nodes according to the position information of the feature word text in the user input word, and storing the feature words with the same position information of the feature word text in the user input word under the same tree nodes; wherein, the location information of the feature word text in the user input word comprises: start position information and end position information;
If the starting position information of the first tree node is larger than the ending position information of the second tree node, the first tree node is a child node of the second tree node;
The obtaining a composite feature word set according to the feature word tree comprises the following steps:
deeply traversing the characteristic word tree to generate a path;
and according to the position sequence of the tree nodes in the paths, carrying out Cartesian product on all the feature words contained in each tree node in each path to obtain a composite feature word set.
7. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor performs the steps of the word error correction method of any one of claims 1 to 5 when the program is executed.
8. A non-transitory computer readable storage medium having stored thereon a computer program, which when executed by a processor performs the steps of the word error correction method according to any of claims 1 to 5.
CN202010675640.8A 2020-07-14 2020-07-14 Word error correction method, device, electronic equipment and storage medium Active CN112001168B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010675640.8A CN112001168B (en) 2020-07-14 2020-07-14 Word error correction method, device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010675640.8A CN112001168B (en) 2020-07-14 2020-07-14 Word error correction method, device, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN112001168A CN112001168A (en) 2020-11-27
CN112001168B true CN112001168B (en) 2024-05-03

Family

ID=73466942

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010675640.8A Active CN112001168B (en) 2020-07-14 2020-07-14 Word error correction method, device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN112001168B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH01316863A (en) * 1988-06-17 1989-12-21 Nippon Telegr & Teleph Corp <Ntt> Automatic qualifying and correcting device for error in japanese language text
CN105975625A (en) * 2016-05-26 2016-09-28 同方知网数字出版技术股份有限公司 Chinglish inquiring correcting method and system oriented to English search engine
CN107045496A (en) * 2017-04-19 2017-08-15 畅捷通信息技术股份有限公司 The error correction method and error correction device of text after speech recognition
CN108595437A (en) * 2018-05-04 2018-09-28 和美(深圳)信息技术股份有限公司 Text query error correction method, device, computer equipment and storage medium
CN110362824A (en) * 2019-06-24 2019-10-22 广州多益网络股份有限公司 A kind of method, apparatus of automatic error-correcting, terminal device and storage medium
CN110795617A (en) * 2019-08-12 2020-02-14 腾讯科技(深圳)有限公司 Error correction method and related device for search terms

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200082808A1 (en) * 2018-09-12 2020-03-12 Kika Tech (Cayman) Holdings Co., Limited Speech recognition error correction method and apparatus

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH01316863A (en) * 1988-06-17 1989-12-21 Nippon Telegr & Teleph Corp <Ntt> Automatic qualifying and correcting device for error in japanese language text
CN105975625A (en) * 2016-05-26 2016-09-28 同方知网数字出版技术股份有限公司 Chinglish inquiring correcting method and system oriented to English search engine
CN107045496A (en) * 2017-04-19 2017-08-15 畅捷通信息技术股份有限公司 The error correction method and error correction device of text after speech recognition
CN108595437A (en) * 2018-05-04 2018-09-28 和美(深圳)信息技术股份有限公司 Text query error correction method, device, computer equipment and storage medium
CN110362824A (en) * 2019-06-24 2019-10-22 广州多益网络股份有限公司 A kind of method, apparatus of automatic error-correcting, terminal device and storage medium
CN110795617A (en) * 2019-08-12 2020-02-14 腾讯科技(深圳)有限公司 Error correction method and related device for search terms

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于局部上下文特征的组合的中文真词错误自动校对研究;刘亮亮;曹存根;;计算机科学;第43卷(第12期);第30-35页 *

Also Published As

Publication number Publication date
CN112001168A (en) 2020-11-27

Similar Documents

Publication Publication Date Title
US9195738B2 (en) Tokenization platform
CN105512291B (en) Method and system for expanding database search queries
CN107704102B (en) Text input method and device
CN110276071B (en) Text matching method and device, computer equipment and storage medium
US20130018650A1 (en) Selection of Language Model Training Data
CN106326484A (en) Error correction method and device for search terms
CN110362824B (en) Automatic error correction method, device, terminal equipment and storage medium
CN105068997B (en) The construction method and device of parallel corpora
JP2008539476A (en) Spelling presentation generation method and system
JP5646792B2 (en) Word division device, word division method, and word division program
CN110008309B (en) Phrase mining method and device
US20140032207A1 (en) Information Classification Based on Product Recognition
Bedrick et al. Robust kaomoji detection in Twitter
JP4985724B2 (en) Word recognition program, word recognition method, and word recognition device
KR101083455B1 (en) System and method for correction user query based on statistical data
CN111444713B (en) Method and device for extracting entity relationship in news event
CN111950267B (en) Text triplet extraction method and device, electronic equipment and storage medium
KR20190090636A (en) Method for automatically editing pattern of document
CN112001168B (en) Word error correction method, device, electronic equipment and storage medium
CN111339778A (en) Text processing method, device, storage medium and processor
JP4005477B2 (en) Named entity extraction apparatus and method, and numbered entity extraction program
JP2009176148A (en) Unknown word determining system, method and program
JP4915499B2 (en) Synonym dictionary generation system, synonym dictionary generation method, and synonym dictionary generation program
CN112183117A (en) Translation evaluation method and device, storage medium and electronic equipment
Demir Context tailoring for text normalization

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant