CN110991166B - Chinese wrongly-written character recognition method and system based on pattern matching - Google Patents

Chinese wrongly-written character recognition method and system based on pattern matching Download PDF

Info

Publication number
CN110991166B
CN110991166B CN201911219533.8A CN201911219533A CN110991166B CN 110991166 B CN110991166 B CN 110991166B CN 201911219533 A CN201911219533 A CN 201911219533A CN 110991166 B CN110991166 B CN 110991166B
Authority
CN
China
Prior art keywords
word
wrongly
character recognition
matching
sentence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911219533.8A
Other languages
Chinese (zh)
Other versions
CN110991166A (en
Inventor
曹馨宇
王海涛
刘亮亮
付雪
赵静
张帆
赵超
吴刚
丁文兴
周长青
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China National Institute of Standardization
Original Assignee
China National Institute of Standardization
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China National Institute of Standardization filed Critical China National Institute of Standardization
Priority to CN201911219533.8A priority Critical patent/CN110991166B/en
Publication of CN110991166A publication Critical patent/CN110991166A/en
Application granted granted Critical
Publication of CN110991166B publication Critical patent/CN110991166B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Character Discrimination (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a wrongly written or mispronounced character recognition method based on pattern matching, which comprises the following steps: s1, defining a wrongly written character recognition mode according to the structural characteristics of a language; s2, establishing an index of a wrongly written character recognition mode based on a graph storage structure; and S3, automatically checking and correcting the error of the text to be checked through the index of the wrongly written character recognition mode. The method disclosed by the invention integrates grammar restriction and conditional function collocation to recognize wrongly written characters by defining a wrongly written character recognition mode, can effectively aim at errors which violate local or long-distance grammar constraint conditions, and has good accuracy; the system realizes the definition of the wrongly written character recognition mode and the establishment of the index through a program, and automatically corrects and corrects the error of the text to be checked by utilizing the wrongly written character recognition mode index; the index structure is established based on the flexibility of the graph storage structure to realize breadth-first search and depth-first search of data, so that a perfect database (matching library) is constructed, and the accuracy of wrongly-written character recognition is improved.

Description

Chinese wrongly-written character recognition method and system based on pattern matching
Technical Field
The invention relates to the technical field of natural language processing by an artificial intelligent computer, in particular to a Chinese wrongly written character recognition method and system based on pattern matching.
Background
Automatic proofreading of Chinese text is one of the main applications of natural language processing, and is also a difficult problem of natural language understanding. With the advent of the big data age, errors in Chinese texts are increasing, some wrongly written characters in texts can be effectively found and automatically corrected by a statistical method and a machine learning method, but some wrongly written characters in texts are errors caused by violating local or long-distance grammatical or semantic constraints, so that the wrongly written characters are difficult to find and prepare by some contexts, and the process needs to be completed by some grammatical rules and semantic collocation. For example, common words such as "that" and "where", "and" ground "are often confused and errors occur frequently, and generally, the automatic proofreading method is difficult to find or has a particularly high error correction rate, and it is not sufficient to determine whether an error occurs by using a single context or collocation identification when finding such an error.
In view of the above, the present invention is particularly proposed.
Disclosure of Invention
Aiming at the defects in the prior art, the invention provides a Chinese wrongly-written character recognition method and system based on pattern matching, and the recognition accuracy is improved.
In order to achieve the purpose, the technical scheme of the invention is as follows:
a wrongly written or mispronounced word recognition method based on pattern matching comprises the following steps:
s1, defining a wrongly written character recognition mode according to the structural characteristics of a language;
s2, establishing an index of a wrongly written character recognition mode based on a graph storage structure;
and S3, automatically checking and correcting the error of the text to be checked through the index of the wrongly written character recognition mode.
Further, in the method for identifying wrongly written characters based on pattern matching, in step s1, a wrongly written character identification pattern is established according to the grammar structure and semantic restriction features of the Chinese language, including
And setting recognition matching conditions and associating semantic operation as recognition rules to form the wrongly-written character recognition mode.
Further, in the method for identifying wrongly written characters based on pattern matching, the identification matching condition in step s11 is formed by matching restriction functions; the limiting function includes
NOTCEAIN (< S >, < W | WORDCLASS1>), used for judging whether the sentence "S" to be debugged contains the target word "W" or the word class "WORDCLASS 1", if not, returning to TRUE, otherwise, returning to FALSE;
NOTINDWITH (< S >, < W | WORDClASS1>), for judging whether the sentence "S" to be debugged ends with the target word "W" or the word class "WORDClASS 1", if not, returning TRUE, otherwise, returning FALSE;
MATCHED (< S >, < W | WORDClASS1>) is used for judging whether the sentence "S" to be debugged matches the target word "W" or the word class "WORDClASS 1", if the matching is successful, returning to TRUE, otherwise returning to FALSE;
the matching of the restriction function is done by a connector.
Further, in the method for identifying wrongly written words based on pattern matching, the semantic operation includes:
OK (< target word >): indicating that the target word is correct if the sentence to be debugged satisfies the recognition matching condition;
MARK (< target word >): indicating that the target word is possible to be wrong and marked if the sentence to be debugged meets the recognition matching condition;
REWRITE (< target word >, < correct word >): the method indicates that if the sentence to be debugged meets the recognition matching condition, the target word is wrong and contains wrongly written characters, and the correct word is the corresponding correct word and is automatically replaced.
Further, in the method for identifying wrongly written characters based on pattern matching, the step s2 of establishing an index of the recognition pattern of wrongly written characters based on the graph storage structure includes
S21, defining a graph structure through codes;
s21, parameters in the graph structure are defined through codes.
Further, in the method for identifying wrongly written characters based on pattern matching, in the step s3, the automatic error checking and automatic error correction are performed on the text to be checked by using the index of the wrongly written character identification pattern, where the method includes:
s31, segmenting words of a sentence to be debugged and marking words at each position;
s32, sequentially filtering words in the sentence to be checked, if the words reach the tail of the sentence, quitting the checking, and if not, turning to S33;
s33, matching words in the sentence to be debugged with the wrongly-written or mispronounced character recognition mode indexes, and if the matching is successful, putting the matching result into a temporary array;
s34, taking intersection of results in the temporary array, judging whether the number of elements successfully matched is equal to the length of the matching rule or not, and putting the rule index numbers with equal length into the final array;
s35, sequentially traversing each rule in the final array, and judging whether the sequence of the successfully matched rules is consistent with the rules or not, if so, successfully matching;
and S36, after the matching is successful, semantic operation is executed according to the back piece of the wrongly-written or mispronounced character recognition mode.
And S37, outputting a debugging result, and finishing the current sentence debugging.
In another aspect, the present invention further relates to a system for identifying wrongly written words based on pattern matching, which includes a processor and a memory, wherein the memory stores a program, and when the program is executed by the processor, the program performs the following steps:
D1. defining a wrongly written character recognition mode according to the structural characteristics of the language;
D2. establishing an index of a wrongly written character recognition mode based on a graph storage structure;
D3. and automatically debugging and correcting the text to be debugged through the established index structure.
Further, in the above-mentioned wrongly written or mispronounced word recognition system based on pattern matching, in step d1, a wrongly written or mispronounced word recognition pattern is established according to the grammatical structure and semantic restriction features of the chinese language, including
And setting recognition matching conditions and associating semantic operation as recognition rules to form the wrongly-written character recognition mode.
Further, in the above system for recognizing wrongly written words based on pattern matching, the recognition matching condition is formed by matching restriction functions; the limiting function includes
NOTCEAIN (< S >, < W | WORDCLASS1>), used for judging whether the sentence "S" to be debugged contains the target word "W" or the word class "WORDCLASS 1", if not, returning to TRUE, otherwise, returning to FALSE;
NOTINDWITH (< S >, < W | WORDClASS1>), for judging whether the sentence "S" to be debugged ends with the target word "W" or the word class "WORDClASS 1", if not, returning TRUE, otherwise, returning FALSE;
MATCHED (< S >, < W | WORDClASS1>) is used for judging whether the sentence "S" to be debugged matches the target word "W" or the word class "WORDClASS 1", if the matching is successful, returning to TRUE, otherwise returning to FALSE;
the matching of the restriction function is done by a connector.
Further, in the above system for recognizing wrongly written words based on pattern matching, the setting of the recognition matching condition and the semantic operation may include:
OK (< target word >): indicating that the target word is correct if the sentence to be debugged satisfies the recognition matching condition;
MARK (< target word >): indicating that the target word is possible to be wrong and marked if the sentence to be debugged meets the recognition matching condition;
REWRITE (< target word >, < correct word >): the method indicates that if the sentence to be debugged meets the recognition matching condition, the target word is wrong and contains wrongly written characters, and the correct word is the corresponding correct word and is automatically replaced.
Compared with the prior art, the invention has the beneficial effects that:
the method disclosed by the invention fuses grammar restriction and conditional function collocation by defining a wrongly-written character recognition mode, is then used for wrongly-written character recognition, can effectively aim at errors which violate local or long-distance grammar constraint conditions, has good accuracy rate and certain practicability; the system of the invention implements the method, realizes the definition of the wrongly written character recognition mode and the establishment of the index through a program, and automatically corrects and corrects the error of the text to be checked by utilizing the wrongly written character recognition mode index; the index structure is established based on the flexibility of the graph storage structure to realize breadth-first search and depth-first search of data, so that a perfect database (matching library) is constructed, and the accuracy of wrongly-written character recognition is improved.
Drawings
In order to more clearly illustrate the detailed description of the invention or the technical solutions in the prior art, the drawings that are needed in the detailed description of the invention or the prior art will be briefly described below. Throughout the drawings, like elements or portions are generally identified by like reference numerals. In the drawings, elements or portions are not necessarily drawn to scale.
FIG. 1 is a flow chart of an embodiment of a method for identifying wrongly written words based on pattern matching according to the present invention;
Detailed Description
Embodiments of the present invention will be described in detail below with reference to the accompanying drawings. The following examples are only for illustrating the technical solutions of the present invention more clearly, and therefore are only examples, and the protection scope of the present invention is not limited thereby.
It is to be noted that, unless otherwise specified, technical or scientific terms used herein shall have the ordinary meaning as understood by those skilled in the art to which the invention pertains.
Example 1
As shown in fig. 1, a method for identifying wrongly written words based on pattern matching includes the steps of:
s1, defining a wrongly written character recognition mode according to the structural characteristics of a language;
s2, establishing an index of a wrongly written character recognition mode based on a graph storage structure;
and S3, automatically checking and correcting the error of the text to be checked through the index of the wrongly written character recognition mode.
The method is particularly suitable for identifying wrongly written characters of Chinese texts, defines the recognition mode of wrongly written characters by utilizing the characteristics of Chinese syntactic structures, semantic restrictions and the like, combines some syntactic structures and conditional restrictions in the mode, and matches the texts of the sentences to be checked by utilizing the defined mode to check the mistakes and correct the mistakes.
Specifically, in a specific embodiment provided by the present invention, in step s1, a wrongly written or mispronounced word recognition mode is established according to a grammatical structure and semantic restriction features of the chinese language, which specifically includes: and setting recognition matching conditions and associating semantic operation as recognition rules to form the wrongly-written character recognition mode.
In this embodiment, the wrongly written word recognition mode is used as a wrongly written word recognition rule, and the structure of the wrongly written word recognition rule includes a recognition matching condition and a semantic operation associated with the recognition matching condition, so as to perform subsequent semantic operation on sentences meeting the recognition matching condition; wherein the recognition matching condition is defined by a conditional function (restriction function) to the grammatical structure and semantic restriction feature. In the example given in the present invention, the structure of the wrongly written word recognition pattern is as follows:
rule 1: NOTCCONTAIN (S, < target >) & NOTEDWITH (S, < | punctuation >) & MATCHED (S, < | certain type 1> < | certain type 2>) → OK (< target >);
rule 2: notcontinue (S, < word class1>) & notenddwell (S, < target word > < | word class 2>) → MARK (< target word >);
rule 3: NOTCEAIN (S, < word class1>) & MATCHED (S, < target word > < | word class 2>) → REWRITE (< target word >, < correct word >).
The wildcard characters used in the above-mentioned patterns are all in the conventional meaning, such as "+" indicates that any number of characters can be separated in the middle; and means and, etc. The above symbol "→" is used to indicate that the matching condition preceding the character is associated with a semantic operation following it.
The restriction function in the above mode is defined as follows:
NOTCEAIN (< S >, < W | WORDClASS1>) is used for judging whether the sentence "S" to be debugged contains the target word "W" or the word class "WORDClASS 1", if not, returning to TRUE, otherwise, returning to FALSE;
NOTINDWITH (< S >, < W | WORDClASS1>) is used for judging whether the sentence "S" to be debugged ends with the target word "W" or the word class "WORDClASS 1", if not, returning to TRUE, otherwise, returning to FALSE;
MATCHED (< S >, < W | WORDClASS1>) is used to determine whether the sentence "S" to be debugged matches the target word "W" or the word class "WORDClASS 1", and returns TRUE if the matching is successful, otherwise returns FALSE.
It should be noted that "W" is used to refer to the target word in the function; "WORDClASS" is used in the function to refer to a part of speech, and "S" is used in the function to refer to a sentence to be debugged.
The chinese part of speech includes nouns, verbs, adjectives, numerals, quantifiers, pronouns, distinguishments, adverbs, prepositions, conjunctions, auxiliary words, sighs, moods, and vocabularies, and further, the part of speech in this embodiment is defined as follows:<!WORDClASS1>=<W1|W2|...|Wn>(ii) a W represents a specific word or phrase.
In this embodiment, the semantic operation of the wrongly written word recognition mode includes three types, which are respectively defined as follows:
OK (< target word >): indicating that the "< target word >" is correct if the pattern is satisfied;
MARK (< target word >): indicating that the "target word" may be wrong and marked if the sentence satisfies the pattern;
REWRITE (< target word >, < correct word >): if the sentence satisfies the mode, the target word is wrong and contains wrongly written words, and < correct word > "is the corresponding correct word, and is automatically replaced to realize proofreading.
In the method, the wrongly-written character recognition mode is segmented, the index of the wrongly-written character recognition mode is stored by utilizing the structure of the graph, and the graph-based index structure is established. The graph storage structure (short for 'graph structure') is composed of a plurality of nodes, the nodes can be connected with each other to form a network, and in the computer data structure, a graph is one of the most flexible data structures; the invention stores the index structure of the wrongly-written character recognition mode by using the graph structure so as to realize breadth-first search and depth-first search.
The step S2 comprises the following steps:
s21, defining a graph structure by using codes; defining the number of edges, vertexes, introductions, nodes, labels and the like of the graph; the invention provides a specific embodiment of a code definition diagram structure, which comprises the following steps:
static int nEdge; // number of sides
static vector<gtype>G[W];
static int nRu [ W ]; // degree of penetration
static int nType [ W ]; //1 words, 2 parts of speech
static int nBelong [ W ]; // which rule class it belongs to, initially-1, if not-1, the nType value must be 4 (is a rule point)
// index (Global)
static int nSum; // FindID element number, total number of graph nodes
static map < string, int > FindID; // corresponding reference numerals in the figures
static map < int, string > FindName; establishing a mapping of indices and words
S22, defining the structure of the rule (the rule is a wrongly written or mispronounced word recognition mode) by the code;
static int nRuleClass;
static vector<RuleClassType>RuleClass;
therefore, the graph structure is defined according to the structure correspondence of the wrongly written character recognition mode, namely, the wrongly written character recognition mode index based on the graph storage structure is established.
After the index is established, the mode matching is carried out on the text (Chinese sentence) to be checked for errors through the established index structure, and corresponding operation is carried out according to semantic operation in the matched wrongly-written character recognition mode, so that automatic error checking and automatic error correction are realized.
The step S3 comprises
S31, segmenting words of a sentence to be debugged and marking words at each position;
in this step, the sentence to be searched after word segmentation is W1W2…WNFor the sentence after word segmentation, the tag array Status N is used]Word W for each positioniAnd (3) marking:
an initial state, Status [ i ] ═ 0(1< ═ i < ═ n);
s32, scanning the words W in the sentence S to be debugged in sequenceiIf the end of the sentence S is reached, quitting error checking, and turning to the S37. otherwise, turning to the S33;
s33, the word W in the sentence S to be debugged isiMatching with the wrongly-written character recognition mode index, and if the matching is successful, putting the matching result into an array vecTempResult (temporary array);
s34, then taking intersection from the result in the array vecTempResult, judging whether the number of elements successfully matched is equal to the length of the matching rule, and putting the index numbers (namely the labels in the code definition graph structure) of the rules with equal length into the array vecResult (final array): the length of the Rule is judged by using "&" in the Rule as a divider, and if two "&" dividers are included in the Rule1, the length is 3.
S35, sequentially traversing each wrongly-recognized character recognition module in the array vecResult, checking whether the sequence of the wrongly-recognized character recognition modes which are successfully matched is consistent with the matching conditions in the wrongly-recognized character recognition modes, and if so, indicating that the matching rule is effective, namely, the matching is successful;
for example: the recognition sentence "are these children this is to do that? ";
in the wrongly written character recognition mode, there are rules: NOTICONTAIN (S, <! ALL QUESTER >) & MATCHED (S, < that > <! no query assistant >) → MARK (that);
the matching process is as follows:
NOTICAIN (S, < | all interrogatories >) -TRUE
MATCHED (S, < that > <! Do not doubt >) -TRUE
Matching is successful, the back-piece in the rule is executed, and the 'that' in the marked sentence is possibly wrong;
and S36, matching successfully, if the back piece is MARK, marking the Status [ i ] of the current target word as 1 to indicate that the word has an error, and if the back piece is REWRITE, marking the Status [ i ] of the current target word as 2 to indicate that the word has an error and replacing the word with a correct word in the back piece in the wrongly-written character recognition mode.
And S37, outputting a debugging result, and finishing the current sentence debugging.
The method disclosed by the invention integrates grammar restriction and condition function collocation by defining a wrongly-written character recognition mode, and is then used for wrongly-written character recognition, so that errors which violate local or long-distance grammar constraint conditions, especially common errors such as 'that' and 'where' and 'ground' and the like, which are difficult to find and automatically correct by a machine learning method, can be effectively targeted; according to the method, through practical experiments, the wrongly-written character recognition mode of 1000 common words with errors is manually summarized, the experiment adopts a test corpus of 1 ten thousand rows of sentences, homophone errors 300 in the corpus sentences are manually constructed, the recall rate of the experiment result reaches 95%, and the accuracy rate reaches 90%; therefore, the method is applied to the identification of wrongly-written characters, has good accuracy and certain practicability.
Example 2
The invention also provides a system for identifying wrongly-written words based on pattern matching, which is used for implementing the method of the invention, and the system comprises a processor and a memory, wherein the memory stores a program, and when the program is run by the processor, the following steps are executed:
D1. defining a wrongly written character recognition mode according to the structural characteristics of the language;
D2. establishing an index of a wrongly written character recognition mode based on a graph storage structure;
D3. and automatically debugging and correcting the text to be debugged through the established index structure.
The system of the invention is particularly suitable for identifying wrongly written characters of Chinese texts, defines the recognition mode of the wrongly written characters by utilizing the characteristics of Chinese grammatical structure, semantic restriction and the like, and carries out error checking and correction.
In one embodiment, the program of the present invention is executed to perform step d1. establishing a wrongly written character recognition mode according to a grammatical structure and semantic restriction features of chinese, including: and setting recognition matching conditions and associating semantic operation as recognition rules to form the wrongly-written character recognition mode.
In this embodiment, the wrongly written word recognition mode is used as a wrongly written word recognition rule, and the structure of the wrongly written word recognition rule includes a recognition matching condition and a semantic operation associated with the recognition matching condition, so as to perform subsequent semantic operation on sentences meeting the recognition matching condition; the recognition matching condition is defined by a conditional function (restriction function) and a wildcard, etc., for example, the structure of the wrongly written character recognition pattern is as follows:
rule 1: NOTCCONTAIN (S, < target >) & NOTEDWITH (S, < | punctuation >) & MATCHED (S, < | certain type 1> < | certain type 2>) → OK (< target >);
rule 2: NOTCCONTAIN (S, < word class1>) & NOTEDWITH (S, < target word > < | word class 2>) → MARK (< target word >)
Rule 2: NOTCEAN (S, < word class1>) & MATCHED (S, < target word > < | word class 2>) → REWRITE (< target word >, < correct word >)
The wildcard characters used in the above-mentioned patterns are all in the conventional meaning, such as "+" indicates that any number of characters can be separated in the middle; and means "and", etc.; the above symbol "→" is used to indicate that the matching condition preceding the character is associated with a semantic operation following it.
The restriction function in the above mode is defined as follows:
NOTCEAIN (< S >, < W | WORDClASS1>) is used for judging whether the sentence "S" to be debugged contains the target word "W" or the word class "WORDClASS 1", if not, returning to TRUE, otherwise, returning to FALSE;
NOTINDWITH (< S >, < W | WORDClASS1>) is used for judging whether the sentence "S" to be debugged ends with the target word "W" or the word class "WORDClASS 1", if not, returning to TRUE, otherwise, returning to FALSE;
MATCHED (< S >, < W | WORDClASS1>) is used to determine whether the sentence "S" to be debugged matches the target word "W" or the word class "WORDClASS 1", and returns TRUE if the matching is successful, otherwise returns FALSE.
The chinese part of speech includes nouns, verbs, adjectives, numerals, quantifiers, pronouns, distinguishments, adverbs, prepositions, conjunctions, auxiliary words, sighs, moods, and vocabularies, and further, the part of speech in this embodiment is defined as follows:<!WORDClASS1>=<W1|W2|...|Wn>(ii) a W represents a specific word or phrase.
In this embodiment, the semantic operation of the wrongly written word recognition mode includes three types, which are respectively defined as follows:
OK (< target word >): indicating that the "target word" is correct if the pattern is satisfied;
MARK (< target word >): indicating that the "target word" may be flagged incorrectly if the sentence satisfies the pattern;
REWRITE (< target word >, < correct word >): if the sentence satisfies the mode, the target word is wrong and contains wrongly written characters, and the correct word is the corresponding correct word and is automatically replaced to realize proofreading.
In the present invention system, the present invention program is executed, and when executing the step D2., the method includes:
D21. defining a graph structure; the number of edges, vertexes, incomes, nodes and the like of the graph are defined, and one specific implementation given by the invention is as follows:
static int nEdge; // number of sides
static vector<gtype>G[W];
static int nRu [ W ]; // degree of penetration
static int nType [ W ]; //1 words, 2 parts of speech
static int nBelong [ W ]; // to which rule class
// index (Global)
static int nSum; // FindID element number, total number of graph nodes
static map < string, int > FindID; // corresponding reference numerals in the figures
static map < int, string > FindName; establishing a mapping of indices and words
D22. Defining the structure of the rules (rules, i.e. wrongly written word recognition patterns);
static int nRuleClass;
static vector<RuleClassType>RuleClass;
after the index is established, the mode matching is carried out on the text (Chinese sentence) to be checked for errors through the established index structure, and corresponding operation is carried out according to semantic operation in the matched wrongly-written character recognition mode, so that automatic error checking and automatic error correction are realized.
The program of the invention is run, performing said step D3. comprising
D31. Segmenting words of a sentence to be checked and marking words at each position;
in this step, the sentence to be searched after word segmentation is W1W2…WNFor the sentence after word segmentation, the tag array Status N is used]Word W for each positioniAnd (3) marking:
an initial state, Status [ i ] ═ 0(1< ═ i < ═ n);
D32. sequentially scanning the words W in the sentence S to be debuggediIf the end of the sentence S is reached, quitting error checking and turning to D37, otherwise, turning to D33;
D33. w in the sentence "S" to be debuggediMatching with the wrongly recognized character recognition mode index, and if the matching is successful, putting the matching result into a temporary array (vecTempResult);
D34. then taking intersection from the result in the temporary array (vecTempResult), judging whether the number of the matches is equal to the length of the rule, and putting the index numbers of the rule with the same length into the final array vecResult:
D35. sequentially traversing each rule in the array vecResult to see whether the sequence of the matched rules is consistent with the rules, and if so, indicating that the matched rules are effective
For example: the recognition sentence "are these children this is to do that? ";
in the wrongly written character recognition mode, there are rules: NOTICONTAIN (S, <! ALL QUESTER >) & MATCHED (S, < that > <! no query assistant >) → MARK (that);
the matching process is as follows:
NOTICAIN (S, < | all interrogatories >) -TRUE
MATCHED (S, < that > <! Do not doubt >) -TRUE
Matching is successful, the back-piece in the rule is executed, and the 'that' in the marked sentence is possibly wrong;
D36. matching is successful, if the back-piece (referring to the back-piece of the data structure in the computer language) is MARK, Status [ i ] of the current target word is marked as 1 to indicate that the word has an error, and if the back-piece is REWRITE, Status [ i ] of the current target word is marked as 2 to indicate that the word has an error and the word is replaced by a correct word in the back-piece of the wrongly recognized character recognition mode.
D37. And outputting a debugging result, and finishing the current sentence debugging.
The system of the invention implements the method, realizes the definition of the wrongly written character recognition mode and the establishment of the index through a program, and automatically corrects and corrects the error of the text to be checked by utilizing the wrongly written character recognition mode index; the index structure is established based on the flexibility of the graph storage structure to realize breadth-first search and depth-first search of data, so that a perfect database (matching library) is constructed, and the accuracy of wrongly-written character recognition is improved.
In particular, according to the embodiments of the present disclosure, the structure described in the drawings (logic block diagram) referred to may be implemented as a computer software program, for example, the above-disclosed embodiment 2 includes a computer program product as a computer program carried on a computer readable medium, the computer program containing codes for implementing the procedures shown in the structure of fig. 1.
Constructing the wrongly written or mispronounced word recognition system based on pattern matching through a program; the programming languages used to construct the system include an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The system for recognizing wrongly written words based on pattern matching is constructed as program code that can be completely executed on a user computer/smart mobile terminal (e.g., mobile phone, pad, etc.), partially executed on the user computer/smart mobile terminal (e.g., mobile phone, pad, etc.), executed as a stand-alone software package, partially executed on the user computer/smart mobile terminal (e.g., mobile phone, pad, etc.) and partially executed on a remote computer, or completely executed on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer or the intelligent mobile terminal through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the internet using an internet service provider).
Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; such modifications and substitutions do not depart from the spirit and scope of the present invention, and they should be construed as being included in the following claims and description.

Claims (4)

1. A wrongly-written character recognition method based on pattern matching is characterized in that: the method comprises the following steps:
s1, defining a wrongly written character recognition mode according to the structural characteristics of a language;
s2, establishing an index of a wrongly written character recognition mode based on a graph storage structure;
s3, automatically checking and correcting the error of the text to be checked through the index of the wrongly written character recognition mode;
s1, establishing a wrongly-written or mispronounced character recognition mode according to a grammatical structure and semantic restriction characteristics of Chinese, wherein the wrongly-written or mispronounced character recognition mode comprises the steps of setting recognition matching conditions and associating semantic operations as recognition rules to form the wrongly-written or mispronounced character recognition mode;
the identification matching condition is formed by matching limiting functions;
the structure of the recognition pattern includes:
rule 1: NOTCCONTAIN (S, < target >) & NOTEDWITH (S, < | punctuation >) & MATCHED (S, < | certain type 1> < | certain type 2>) → OK (< target >);
rule 2: notcontinue (S, < word class1>) & notenddwell (S, < target word > < | word class 2>) → MARK (< target word >);
rule 3: notcontinue (S, < word class1>) & MATCHED (S, < target word > < | word class 2>) → REWRITE (< target word >, < correct word >);
wherein any plurality of characters are spaced apart in the representation; and; → that the preceding matching condition is associated with the subsequent semantic operation;
the restriction function includes:
NOTCEAIN (< S >, < W | WORDCLASS1>), used for judging whether the sentence "S" to be debugged contains the target word "W" or the word class "WORDCLASS 1", if not, returning to TRUE, otherwise, returning to FALSE;
NOTINDWITH (< S >, < W | WORDClASS1>), used for judging whether the sentence "S" to be debugged ends with the target word "W" or the part of speech "WORDClASS 1", if not, returning to TRUE, otherwise, returning to FALSE;
MATCHED (< S >, < W | WORDClASS1>) is used for judging whether the sentence "S" to be debugged matches the target word "W" or the word class "WORDClASS 1", if the matching is successful, returning to TRUE, otherwise returning to FALSE;
the collocation of the restriction function is completed through a connector;
s1, setting a recognition matching condition and associating semantic operation, wherein the semantic operation comprises the following steps:
OK (< target word >): indicating that the target word is correct if the sentence to be debugged satisfies the recognition matching condition;
MARK (< target word >): indicating that the target word is possible to be wrong and marked if the sentence to be debugged meets the recognition matching condition;
REWRITE (< target word >, < correct word >): the method indicates that if the sentence to be debugged meets the recognition matching condition, the target word is wrong and contains wrongly written characters, and the correct word is the corresponding correct word and is automatically replaced.
2. The method for identifying wrongly written words based on pattern matching as claimed in claim 1, wherein: s2, establishing an index of a wrongly written character recognition mode based on a graph storage structure, comprising
S21, defining a graph structure through codes;
s21, parameters in the graph structure are defined through codes.
3. The method for identifying wrongly written words based on pattern matching as claimed in claim 1, wherein: the step s3, automatically checking and correcting the error of the text to be checked through the index of the wrongly written character recognition mode, including:
s31, segmenting words of a sentence to be debugged and marking words at each position;
s32, sequentially filtering words in the sentence to be debugged, if the end of the sentence is reached, quitting the debugging, otherwise, turning to S33;
s33, matching words in the sentence to be debugged with the wrongly-written or mispronounced character recognition mode indexes, and if the matching is successful, putting the matching result into a temporary array;
s34, taking intersection of results in the temporary array, judging whether the number of elements successfully matched is equal to the length of a matching rule or not, and putting the rule index numbers with equal length into a final array; the length of the rule is judged by taking "&" in the rule as a divider;
s35, sequentially traversing each rule in the final array, and judging whether the sequence of the successfully matched rules is consistent with the rules or not, if so, successfully matching;
s36, after the matching is successful, semantic operation is executed according to the back piece of the wrongly-written or mispronounced character recognition mode;
and S37, outputting a debugging result, and finishing the current sentence debugging.
4. A wrongly written or mispronounced word recognition system based on pattern matching is characterized in that: the system comprises a processor and a memory, wherein the memory stores a program, and when the program is executed by the processor, the method comprises the following steps:
s1, defining a wrongly written character recognition mode according to the structural characteristics of the language;
s2, establishing an index of the wrongly written character recognition mode based on the graph storage structure;
s3, automatically debugging and correcting the text to be debugged through the established index structure;
s1, establishing a wrongly-written or mispronounced character recognition mode according to a grammatical structure and semantic restriction characteristics of Chinese, wherein the wrongly-written or mispronounced character recognition mode comprises the steps of setting recognition matching conditions and associating semantic operations as recognition rules to form the wrongly-written or mispronounced character recognition mode;
the identification matching condition is formed by matching limiting functions; the limiting function includes
NOTCEAIN (< S >, < W | WORDCLASS1>), used for judging whether the sentence "S" to be debugged contains the target word "W" or the word class "WORDCLASS 1", if not, returning to TRUE, otherwise, returning to FALSE;
NOTINDWITH (< S >, < W | WORDClASS1>), used for judging whether the sentence "S" to be debugged ends with the target word "W" or the part of speech "WORDClASS 1", if not, returning to TRUE, otherwise, returning to FALSE;
MATCHED (< S >, < W | WORDClASS1>) is used for judging whether the sentence "S" to be debugged matches the target word "W" or the word class "WORDClASS 1", if the matching is successful, returning to TRUE, otherwise returning to FALSE;
the collocation of the restriction function is completed through a connector;
in the setting, identifying and matching conditions and associating semantic operations, the semantic operations include:
OK (< target word >): indicating that the target word is correct if the sentence to be debugged satisfies the recognition matching condition;
MARK (< target word >): indicating that the target word is possible to be wrong and marked if the sentence to be debugged meets the recognition matching condition;
REWRITE (< target word >, < correct word >): the method indicates that if the sentence to be debugged meets the recognition matching condition, the target word is wrong and contains wrongly written characters, and the correct word is the corresponding correct word and is automatically replaced.
CN201911219533.8A 2019-12-03 2019-12-03 Chinese wrongly-written character recognition method and system based on pattern matching Active CN110991166B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911219533.8A CN110991166B (en) 2019-12-03 2019-12-03 Chinese wrongly-written character recognition method and system based on pattern matching

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911219533.8A CN110991166B (en) 2019-12-03 2019-12-03 Chinese wrongly-written character recognition method and system based on pattern matching

Publications (2)

Publication Number Publication Date
CN110991166A CN110991166A (en) 2020-04-10
CN110991166B true CN110991166B (en) 2021-07-30

Family

ID=70089697

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911219533.8A Active CN110991166B (en) 2019-12-03 2019-12-03 Chinese wrongly-written character recognition method and system based on pattern matching

Country Status (1)

Country Link
CN (1) CN110991166B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7853874B2 (en) * 1998-05-26 2010-12-14 SAS Institute Spelling and grammar checking system
CN105045778A (en) * 2015-06-24 2015-11-11 江苏科技大学 Chinese homonym error auto-proofreading method
CN105279149A (en) * 2015-10-21 2016-01-27 上海应用技术学院 Chinese text automatic correction method
CN106547741A (en) * 2016-11-21 2017-03-29 江苏科技大学 A kind of Chinese language text auto-collation based on collocation

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7853874B2 (en) * 1998-05-26 2010-12-14 SAS Institute Spelling and grammar checking system
CN105045778A (en) * 2015-06-24 2015-11-11 江苏科技大学 Chinese homonym error auto-proofreading method
CN105279149A (en) * 2015-10-21 2016-01-27 上海应用技术学院 Chinese text automatic correction method
CN106547741A (en) * 2016-11-21 2017-03-29 江苏科技大学 A kind of Chinese language text auto-collation based on collocation

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于局部上下文特征的组合的中文真词错误自动校对研究;刘亮亮 等;《计算机科学》;20161231;第43卷(第12期);第31-35页 *

Also Published As

Publication number Publication date
CN110991166A (en) 2020-04-10

Similar Documents

Publication Publication Date Title
CN103885938B (en) Industry spelling mistake checking method based on user feedback
US9489371B2 (en) Detection of data in a sequence of characters
CN101002198B (en) Systems and methods for spell correction of non-roman characters and words
US20050091030A1 (en) Compound word breaker and spell checker
CN102439590A (en) System and method for automatic semantic labeling of natural language texts
KR101509727B1 (en) Apparatus for creating alignment corpus based on unsupervised alignment and method thereof, and apparatus for performing morphological analysis of non-canonical text using the alignment corpus and method thereof
CN108563629B (en) Automatic log analysis rule generation method and device
CN115328756A (en) Test case generation method, device and equipment
Mishra et al. A survey of spelling error detection and correction techniques
KR100911834B1 (en) Method and apparatus for correcting of translation error by using error-correction pattern in a translation system
US10120843B2 (en) Generation of parsable data for deep parsing
CN115438650B (en) Contract text error correction method, system, equipment and medium fusing multi-source characteristics
Pal et al. OCR error correction of an inflectional indian language using morphological parsing
CN112182353B (en) Method, electronic device, and storage medium for information search
CN111401012A (en) Text error correction method, electronic device and computer readable storage medium
KR20150092879A (en) Language Correction Apparatus and Method based on n-gram data and linguistic analysis
CN110991166B (en) Chinese wrongly-written character recognition method and system based on pattern matching
CN110929514B (en) Text collation method, text collation apparatus, computer-readable storage medium, and electronic device
CN116450896A (en) Text fuzzy matching method, device, electronic equipment and readable storage medium
CN110795617A (en) Error correction method and related device for search terms
CN111079415A (en) Chinese automatic error checking method based on collocation conflict
Jasur et al. Personal names spell-checking–a study related to Uzbek
US20050267735A1 (en) Critiquing clitic pronoun ordering in french
Zahariev A linguistic approach to extracting acronym expansions from text
Faisal et al. A rule-based bengali grammar checker

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant