CN112232055B - Text detection and correction method based on pinyin similarity and language model - Google Patents

Text detection and correction method based on pinyin similarity and language model Download PDF

Info

Publication number
CN112232055B
CN112232055B CN202011169315.0A CN202011169315A CN112232055B CN 112232055 B CN112232055 B CN 112232055B CN 202011169315 A CN202011169315 A CN 202011169315A CN 112232055 B CN112232055 B CN 112232055B
Authority
CN
China
Prior art keywords
pinyin
word
corrected
sentence
dictionary
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011169315.0A
Other languages
Chinese (zh)
Other versions
CN112232055A (en
Inventor
韩竞
李晓冬
梁木
吴蔚
王鑫鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
CETC 28 Research Institute
Original Assignee
CETC 28 Research Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by CETC 28 Research Institute filed Critical CETC 28 Research Institute
Priority to CN202011169315.0A priority Critical patent/CN112232055B/en
Publication of CN112232055A publication Critical patent/CN112232055A/en
Application granted granted Critical
Publication of CN112232055B publication Critical patent/CN112232055B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention discloses a text detection and correction method based on pinyin similarity and a language model, which comprises the steps of collecting a large number of correct instruction text sentences as training sentences; selecting words in the professional field from the training sentences, and constructing a custom dictionary; utilizing a HanLP language processing tool package and a custom dictionary to segment training sentences; counting the occurrence times of each word and each word combination in all training sentences in word segmentation results, and constructing a Bi-Gram language model; converting the sentence to be corrected into corresponding pinyin to be corrected, and converting the word of the custom dictionary into corresponding dictionary pinyin; and correcting the sentence to be corrected according to the pinyin similarity of the pinyin to be corrected and the dictionary pinyin and combining the sentence rationality of the sentence to be corrected, so as to obtain the corrected sentence. The invention considers the semantic information and the context of the sentence through word pinyin similarity calculation and sentence rationality analysis, is beneficial to detecting the wrong words in the sentence and improves the correction accuracy.

Description

Text detection and correction method based on pinyin similarity and language model
Technical Field
The invention relates to the technical field of text detection, in particular to a text detection and correction method based on pinyin similarity and a language model.
Background
When the voice recognition in the open field is directly applied to the professional field, the text after the voice recognition is wrong due to the interference of noise, user accent and the like and the lack of professional vocabulary, so that the analyzability of the text is reduced. The Chinese correction technology is an important technology for realizing automatic checking and automatic correction of Chinese sentences, and aims to improve the language correctness and reduce the manual checking cost. In the existing research about text correction, most of the research is standard texts facing the open field, such as newspaper, books and periodicals texts, etc. Under the specific field, the speech recognition engine in the general field has low sentence recognition rate on the specific field, and named entities and professional terms in some fields cannot be accurately recognized. At present, researches are carried out in the face of high-specialized and field-specific text correction, and the research is quite delicate and a great challenge.
Some scholars are currently researching text correction after speech conversion. Wang Xingjian proposes a correction method based on an N-Gram language model that tries to correct with an N-Gram of pinyin, but does not take into account the semantic information and context of sentences. Long Lixia et al propose a correction method based on instance context, which uses core words of the field as a knowledge base, finds sentences containing the core words in training sentences as examples, calculates the context correlation and semantic similarity between the words and the instance set, thereby performing error detection, and calculates the highest context harmony as a correction result according to the confusion set generated by the pinyin confusion rule, but the method is excessively dependent on the instance library, and cannot correct if similar examples cannot be found in the input text.
Disclosure of Invention
The invention provides a text detection and correction method based on pinyin similarity and a language model, which aims to solve the problems that the existing text correction method cannot fully consider semantic information and context of sentences and is excessively dependent on an instance library of the sentences, so that the text correction accuracy is low.
The invention provides a text detection and correction method based on pinyin similarity and a language model, which comprises the following steps:
step 1, collecting a large number of correct instruction text sentences as training sentences;
step 2, selecting words in the professional field from the training sentences for constructing a custom dictionary;
step 3, utilizing a HanLP language processing tool package and the custom dictionary to segment the training sentences to obtain a word segmentation result;
step 4, counting the occurrence times of each word and each word combination in all training sentences in the word segmentation result, and constructing a Bi-Gram language model for evaluating the sentence rationality of the sentence to be corrected;
step 5, converting the sentence to be corrected into corresponding pinyin to be corrected, and converting the word of the custom dictionary into corresponding dictionary pinyin;
and 6, correcting the sentence to be corrected according to the pinyin similarity of the pinyin to be corrected and the pinyin of the dictionary and combining the sentence rationality of the sentence to be corrected to obtain the corrected sentence.
Further, in one implementation, the step 1 includes: in a certain professional field, collecting more than 1000 correct instruction text sentences as training sentences according to documents or data in the certain professional field; wherein the correct instruction text sentence is a sentence conforming to a term rule in a certain professional field.
Further, in one implementation, the step 2 includes:
calculating the occurrence frequency of each professional word in each instruction text sentence;
if the occurrence frequency of any professional word in the instruction text sentence is higher than a preset occurrence threshold, selecting the professional word for constructing a custom dictionary;
the custom dictionary takes the form of text files; and each professional word is respectively arranged in one row of the custom dictionary.
Further, in one implementation, the step 3 includes: and carrying out word segmentation on each training sentence according to words contained in a universal dictionary and a custom dictionary in the HanLP language processing tool kit through a standard word segmentation device in the HanLP natural language processing tool kit to obtain a word segmentation result, wherein the word segmentation result is that each training sentence is divided into a plurality of word combinations.
Further, in one implementation, the step 4 includes:
if the word segmentation result is that the training sentence after word segmentation contains n words;
the n words are w respectively 1 w 2 …w n I.e. the training sentence s= (w 1 w 2 …w n ) The N-Gram language model of the training sentence represents all word combinations obtained by combining neighboring words, i.e., (w) 1 w 2 ...w N ),(w 2 w 3 ...w N+1 ),…,(w n+1-N …w n-1 w n ) Wherein N represents the length of the word combination, that is, the number of the word combination including adjacent words is N, and the adjacent words are words next to each other in the word segmentation result of the training sentence; counting the occurrence times of each word combination and each word in all training sentences;
obtaining the probability of occurrence of the sentence S 'to be corrected in the technical field, and evaluating the sentence rationality of the sentence S' to be corrected according to the following formula:
P(S')=P(w 1 w 2 …w m )=P(w 1 )P(w 2 |w 1 )…P(w m |w m-1 …w 2 w 1 )
wherein, according to the single word and the occurrence frequency of word combination in all training sentences in the word segmentation result, a first word w is obtained 1 The probability of occurrence in the technical field is P (w 1 ) The second word w 2 The occurrence of (1) depends on the first word w 1 I.e. the second word w 2 The probability of occurrence in the technical field is P (w 2 |w 1 ) The method comprises the steps of carrying out a first treatment on the surface of the Similarly, the mth word w m The occurrence of (c) depends on the m-1 th word w m-1 …w 2 w 1 The mth word w m The probability of occurrence in the technical field is P (w m |w m-1 …w 2 w 1 )。
Further, in one implementation, the step 4 includes:
if the appearance of one word in one sentence only depends on one word appearing in front of the sentence, namely when the N value takes 2, the word combination is called a binary word combination Bi-Gram, all the training sentences are segmented, the occurrence frequency of each word and each binary word combination Bi-Gram is counted, and the Bi-Gram language model is formed;
the probability of occurrence of the statement to be corrected S' in the technical field to which the statement to be corrected belongs is:
P(S‘)=P(w 1 w 2 …w m )=P(w 1 )P(w 2 |w 1 )…P(w m |w m-1 )。
further, in one implementation, the step 4 includes:
the m-th word w is smoothed by adding the following formula to the Bi-Gram language model m The probability of occurrence in the technical field of the art is:
Figure GDA0004118891830000031
/>
wherein C (w) m-1 ) Representing a single word w m-1 The number of occurrences in all training sentences, C (w m-1 w m ) Representing binary word combinations (w m-1 w m ) The occurrence times in all training sentences, k is a constant, k is more than 0 and less than or equal to 1, and V| represents the category number of all words in the word segmentation result of all training sentences;
therefore, the probability of occurrence of the statement to be corrected S' in the technical field to which it belongs is:
Figure GDA0004118891830000041
the data smoothing is realized through the addition smoothing, and even if the sentence contains word combinations which do not appear in the language model, the probability of the sentence can be ensured to be not 0.
Further, in one implementation, the step 5 includes:
step 5-1, converting each Chinese character in the sentence to be corrected into pinyin by utilizing the HanLP natural language processing tool package, namely pinyin to be corrected; converting each Chinese character in the custom dictionary into pinyin, namely dictionary pinyin, by utilizing the HanLP natural language processing tool kit;
step 5-2, setting a pinyin similarity threshold, and judging whether the pinyin to be corrected needs to be corrected according to a comparison result of the similarity between the pinyin to be corrected and the dictionary pinyin and the similarity threshold;
if the similarity between the pinyin to be corrected and the dictionary pinyin is smaller than a similarity threshold, determining that the pinyin to be corrected does not need to be corrected;
and if the similarity between the pinyin to be corrected and the dictionary pinyin is greater than or equal to a similarity threshold, determining that the pinyin to be corrected needs to be corrected.
Further, in one implementation, the step 6 includes:
calculating the pinyin similarity according to the pinyin to be corrected and the dictionary pinyin;
calculating the similarity between the dictionary pinyin and the to-be-corrected pinyin character strings in the sliding window at the initial position of the to-be-corrected sentence by taking the actual length of the words in the custom dictionary as the length of the sliding window, and traversing the dictionary pinyin corresponding to all the words in the custom dictionary to calculate the similarity;
if the similarity of the pinyin character strings is greater than or equal to a preset similarity threshold value between the dictionary pinyin corresponding to the word in the custom dictionary and the pinyin to be corrected corresponding to the word in the sliding window, temporarily replacing the word in the sliding window with the word in the custom dictionary to obtain a temporary correction sentence;
according to the constructed Bi-Gram language model, analyzing the rationality of the temporary correction statement, namely calculating the probability of the temporary correction statement in the professional field;
if the reasonability of the temporary correction statement is greater than that of the statement to be corrected, determining to replace the words in the sliding window with the words in the custom dictionary, and shifting the position of the sliding window to the right by the length of the words;
if the reasonability of the temporary correction statement is smaller than or equal to that of the statement to be corrected, the words in the sliding window are not replaced, namely the statement to be corrected is kept as it is.
Further, in one implementation, the step 6 includes:
if all words in the user-defined dictionary are traversed, the pinyin of all words in the user-defined dictionary and the pinyin of the words of the sentence to be corrected are smaller than a preset similarity threshold value, the position of the sliding window is shifted to the right by a word distance, then all words in the user-defined dictionary are traversed again by the position of the sliding window to calculate the similarity of pinyin strings and analyze the rationality of the sentence until the sliding window reaches the end of the sentence to be corrected to finish correction, and finally a correction sentence is output; the correction statement is a statement obtained after correction of the statement to be corrected from the beginning to the end is completed through a sliding window.
According to the technical scheme, the embodiment of the invention provides a text detection and correction method based on pinyin similarity and a language model, which comprises two processes of pinyin similarity calculation and sentence rationality analysis.
In the prior art, the text correction method cannot consider the semantic information and the context of sentences, and is excessively dependent on an instance library, so that the text correction accuracy is low. By adopting the method, the instruction text detection and correction are carried out based on the pinyin similarity and the N-Gram model, so that semantic context errors caused by pinyin correction can be effectively avoided. Specifically, the method analyzes the rationality of sentences according to the trained Bi-Gram language model by analyzing the pinyin similarity among words, so that the method greatly improves the accuracy of text detection and correction compared with the prior art.
In particular, compared with the prior art, the invention has the remarkable advantages that:
(1) The text correction method provided by the invention is used as a text processing means, and can be used for correcting text of sentence results of voice recognition in the military field. The pinyin similarity calculation is used as a text similarity assessment means, and text similarity comparison can be carried out on words after voice recognition in the military field and dictionary words. The N-Gram language model obtains the probability of word combination by training a large number of sentences, and reflects the context and semantic information of the sentences.
(2) The N-Gram language model of the present invention is based on the Markov assumption that the probability of occurrence of the nth word is related to only the preceding N-1 words, irrespective of the other words. The larger the value of N, the more accurate the probability given by the language model, but the more the parameter quantity is contained, the larger the calculated quantity is. Through a large number of sentence training, a very accurate language model is constructed. By adopting the data smoothing method, the data sparseness problem of the N-Gram language model is effectively solved, so that the semantic information of sentences is more accurately analyzed, and the rationality of the sentences in the correction process is ensured.
Drawings
In order to more clearly illustrate the technical solution of the present invention, the drawings that are needed in the embodiments will be briefly described below, and it will be obvious to those skilled in the art that other drawings can be obtained from these drawings without inventive effort.
FIG. 1 is a schematic diagram of a text detection and correction method based on Pinyin similarity and language model according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a custom dictionary in a text detection and correction method based on Pinyin similarity and language model according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a Bi-Gram language model in a text detection and correction method based on Pinyin similarity and language models according to an embodiment of the present invention;
fig. 4 is a schematic flow chart of a specific implementation of a text detection and correction method based on pinyin similarity and language model according to an embodiment of the present invention.
Detailed Description
In order that the above-recited objects, features and advantages of the present invention will become more readily apparent, a more particular description of the invention will be rendered by reference to the appended drawings and appended detailed description.
The embodiment of the invention discloses a text detection and correction method based on pinyin similarity and a language model, which is applied to correction of instruction texts in the professional field.
As shown in fig. 1, the present embodiment provides a text detection and correction method based on pinyin similarity and language model, which includes the following steps:
step 1, collecting a large number of correct instruction text sentences as training sentences;
step 2, selecting words in the professional field from the training sentences for constructing a custom dictionary;
step 3, utilizing a HanLP language processing tool package and the custom dictionary to segment the training sentences to obtain a word segmentation result;
step 4, counting the occurrence times of each word and each word combination in all training sentences in the word segmentation result, and constructing a Bi-Gram language model for evaluating the sentence rationality of the sentence to be corrected;
step 5, converting the sentence to be corrected into corresponding pinyin to be corrected, and converting the word of the custom dictionary into corresponding dictionary pinyin;
and 6, correcting the sentence to be corrected according to the pinyin similarity of the pinyin to be corrected and the pinyin of the dictionary and combining the sentence rationality of the sentence to be corrected to obtain the corrected sentence.
In the text detection and correction method based on pinyin similarity and language model of the embodiment, the step 1 includes: in a certain professional field, collecting more than 1000 correct instruction text sentences as training sentences according to documents or data in the certain professional field; wherein the correct instruction text sentence is a sentence conforming to a term rule in a certain professional field.
In the text detection and correction method based on pinyin similarity and language model of the embodiment, the step 2 includes:
calculating the occurrence frequency of each professional word in each training sentence;
if the occurrence frequency of any professional word in the training sentence is higher than a preset occurrence threshold, selecting the professional word for constructing a custom dictionary; specifically, in this embodiment, the preset occurrence threshold is 10%, and the setting of the preset threshold may be adjusted according to an actual scene.
The custom dictionary takes the form of text files; each professional word is respectively placed in a row of the custom dictionary, as shown in fig. 2.
In the text detection and correction method based on pinyin similarity and language model of the embodiment, the step 3 includes: and carrying out word segmentation on each training sentence according to words contained in a universal dictionary and a custom dictionary in the HanLP language processing tool kit through a standard word segmentation device in the HanLP natural language processing tool kit to obtain a word segmentation result, wherein the word segmentation result is that each training sentence is divided into a plurality of word combinations.
In the text detection and correction method based on pinyin similarity and language model of the embodiment, the step 4 includes:
if the word segmentation result is that the training sentence after word segmentation contains n words;
the n words are w respectively 1 w 2 …w n I.e. the training sentence s= (w 1 w 2 …w n ) The N-Gram language model of the training sentence represents all word combinations obtained by combining neighboring words, i.e., (w) 1 w 2 …w N ),(w 2 w 3 …w N+1 ),...,(w n+1-N …w n-1 w n ) Wherein N represents the length of the word combination, i.e. the word combination contains the number of adjacent wordsThe quantity is N, and the adjacent words are next words in word segmentation results of training sentences; counting the occurrence times of each word combination and each word in all training sentences;
obtaining the probability of occurrence of the sentence S 'to be corrected in the technical field, and evaluating the sentence rationality of the sentence S' to be corrected according to the following formula:
P(S')=P(w 1 w 2 …w m )=P(w 1 )P(w 2 |w 1 )…P(w m |w m-1 …w 2 w 1 )
wherein, according to the single word and the occurrence frequency of word combination in all training sentences in the word segmentation result, a first word w is obtained 1 The probability of occurrence in the technical field is P (w 1 ) The second word w 2 The occurrence of (1) depends on the first word w 1 I.e. the second word w 2 The probability of occurrence in the technical field is P (w 2 |w 1 ) The method comprises the steps of carrying out a first treatment on the surface of the Similarly, the mth word w m The occurrence of (c) depends on the m-1 th word w m-1 …w 2 w 1 The mth word w m The probability of occurrence in the technical field is P (w m |w m-1 …w 2 w 1 )。
In the text detection and correction method based on pinyin similarity and language model of the embodiment, the step 4 includes: in order to simplify the algorithm complexity, if the appearance of one word in one sentence only depends on one word appearing in front of the sentence, namely when the N value takes 2, the word combination is called a binary word combination Bi-Gram, all training sentences are segmented, the appearance times of each word and each binary word combination Bi-Gram are counted, and a Bi-Gram language model is formed; specifically, as shown in fig. 3, the Bi-Gram language model is a statistical result of word combinations.
The probability of occurrence of the statement to be corrected S' in the technical field to which the statement to be corrected belongs is:
P(S‘)=P(w 1 w 2 ...w m )=P(w 1 )P(w 2 |w 1 )...P(w m |w m-1 )。
in the text detection and correction method based on pinyin similarity and language model of the embodiment, the step 4 includes: because the Bi-Gram language model has a sparse problem, that is, if many binary word combinations of the sentence to be corrected do not appear in the training sentence, the probability P (S ') of the sentence S ' to be corrected appearing in the professional field to which the sentence S ' to be corrected belongs will be 0, which will severely limit the application range of the Bi-Gram language model. Therefore, the application range of the Bi-Gram language model can be enlarged by adopting additive smoothing to the Bi-Gram language model.
The m-th word w is smoothed by adding the following formula to the Bi-Gram language model m The probability of occurrence in the technical field of the art is:
Figure GDA0004118891830000091
wherein C (w) m-1 ) Representing a single word w m-1 The number of occurrences in all training sentences, C (w m-1 w m ) Representing binary word combinations (w m-1 w m ) The occurrence times in all training sentences, k is a constant, k is more than 0 and less than or equal to 1, and V| represents the category number of all words in the word segmentation result of all training sentences;
therefore, the probability of occurrence of the statement to be corrected S' in the technical field to which it belongs is:
Figure GDA0004118891830000092
the data smoothing is realized through the addition smoothing, and even if the sentence contains word combinations which do not appear in the language model, the probability of the sentence can be ensured to be not 0.
In the text detection and correction method based on pinyin similarity and language model of the present embodiment, the step 5 includes:
step 5-1, converting each Chinese character in the sentence to be corrected into pinyin by utilizing the HanLP natural language processing tool package, namely pinyin to be corrected; converting each Chinese character in the custom dictionary into pinyin, namely dictionary pinyin, by utilizing the HanLP natural language processing tool kit;
step 5-2, setting a pinyin similarity threshold, and judging whether the pinyin to be corrected needs to be corrected according to a comparison result of the similarity between the pinyin to be corrected and the dictionary pinyin and the similarity threshold;
if the similarity between the pinyin to be corrected and the dictionary pinyin is smaller than a similarity threshold, determining that the pinyin to be corrected does not need to be corrected;
and if the similarity between the pinyin to be corrected and the dictionary pinyin is greater than or equal to a similarity threshold, determining that the pinyin to be corrected needs to be corrected. Specifically, in this embodiment, the similarity threshold is 0.75, and the similarity threshold is used for being adjusted appropriately according to the actual scene.
In the text detection and correction method based on pinyin similarity and language model of the embodiment, the step 6 includes:
calculating the pinyin similarity according to the pinyin to be corrected and the dictionary pinyin;
calculating the similarity between the dictionary pinyin and the to-be-corrected pinyin character strings in the sliding window at the initial position of the to-be-corrected sentence by taking the actual length of the words in the custom dictionary as the length of the sliding window, and traversing the dictionary pinyin corresponding to all the words in the custom dictionary to calculate the similarity;
if the similarity of the pinyin character strings is greater than or equal to a preset similarity threshold value between the dictionary pinyin corresponding to the word in the custom dictionary and the pinyin to be corrected corresponding to the word in the sliding window, temporarily replacing the word in the sliding window with the word in the custom dictionary to obtain a temporary correction sentence;
according to the constructed Bi-Gram language model, analyzing the rationality of the temporary correction statement, namely calculating the probability of the temporary correction statement in the professional field;
if the reasonability of the temporary correction statement is greater than that of the statement to be corrected, determining to replace the words in the sliding window with the words in the custom dictionary, and shifting the position of the sliding window to the right by the length of the words;
if the reasonability of the temporary correction statement is smaller than or equal to that of the statement to be corrected, the words in the sliding window are not replaced, namely the statement to be corrected is kept as it is.
In the text detection and correction method based on pinyin similarity and language model of the embodiment, the step 6 includes:
if all words in the user-defined dictionary are traversed, the pinyin of all words in the user-defined dictionary and the pinyin of the words of the sentence to be corrected are smaller than a preset similarity threshold value, the position of the sliding window is shifted to the right by a word distance, then all words in the user-defined dictionary are traversed again by the position of the sliding window to calculate the similarity of pinyin strings and analyze the rationality of the sentence until the sliding window reaches the end of the sentence to be corrected to finish correction, and finally a correction sentence is output; the correction statement is a statement obtained after correction of the statement to be corrected from the beginning to the end is completed through a sliding window.
Specifically, the text detection and correction method based on pinyin similarity and language model provides the following embodiments:
according to the step 1, in a certain professional field, a large number of correct instruction text sentences are collected according to documents or data in a certain field. The number of the correct instruction text sentences obtained by gathering is larger than 1000, and the common term collocation rules in the military field are a) subject+predicate b) predicate+object c) subject+predicate+object d) predicate+object+object, and the like, which are specifically shown in table 1.
TABLE 1 statement component Table
Figure GDA0004118891830000111
According to the step 2, professional words in the military field are generally unique to the military field, and in combination with fig. 2, selecting words in the professional field from the training sentences, and selecting a large number of words to construct a custom dictionary, wherein fig. 2 is a custom dictionary, and more than 90 words are selected;
according to the step 3, the training sentences are segmented by utilizing a HanLP language processing tool kit and the custom dictionary, and a segmentation result is obtained;
according to the step 4, a Bi-Gram language model is constructed by using the word segmentation result in combination with FIG. 3, and FIG. 3 is a Bi-Gram language model trained by using more than 12000 training sentences. Let the training sentence S contain n words, i.e. s= (w) 1 w 2 ...w n ) The binary word combination Bi-Gram of the sentence represents the word combination obtained by dividing the word segmentation result of the original sentence according to the word length 2, namely, the word combination with the number of all words being 2 in the sentence.
The probability of occurrence of the statement to be corrected S' in the technical field to which it belongs is:
P(S')=P(w 1 w 2 ...w m )=P(w 1 )P(w 2 |w 1 )...P(w m |w m-1 )。
the method adopts addition smoothing to the Bi-Gram language model, and is characterized in that a constant k (k is more than 0 and less than or equal to 1) is added to the occurrence frequency of each binary word combination, and the method is as follows:
Figure GDA0004118891830000121
wherein C (w) m-1 ) Representing a single word w m-1 The number of occurrences in all training sentences, C (w m-1 w m ) Representing binary word combinations (w m-1 w m ) The number of occurrences in all training sentences, k is a constant, and 0 < k.ltoreq.1, |V| represents all trainingThe number of kinds of all words in the word segmentation result of the training sentence;
therefore, the probability of occurrence of the statement to be corrected S' in the technical field to which it belongs is:
Figure GDA0004118891830000122
according to the step 5, referring to fig. 4, in the first step, each Chinese character in the sentence to be corrected and the custom dictionary is converted into pinyin by utilizing HanLP, namely, the pinyin of "tai drawing moves to the left", "situation drawing", "positioning" and "moving" are respectively "tai shi tu xiang zuo yi dong", "tai shitu", "ding wei" and "yi dong"; second, assume that only one letter in a simple two-word pinyin is wrong, i.e., only one letter in pinyin "ab cd" is wrong, and the similarity between the wrong pinyin and the correct pinyin is 0.75. In this embodiment, a pinyin similarity threshold is initially set to 0.75, and the similarity threshold may be appropriately adjusted according to an actual scene.
According to the step 6, in the first step, at the initial position of the sentence to be corrected, the pinyin of the user-defined dictionary word is used as a sliding window, the lengths of the dictionary word 'situation map', 'positioning' and 'moving' are respectively used as the lengths of the sliding window, the lengths are respectively 3, 2 and 2, and then the character string similarity calculation is carried out on the pinyin of the dictionary word and the pinyin of the word in the window, namely the similarity between 'tai shitu' and 'tai shi', 'ding wei' and 'tai shi' and the similarity between 'yi dong' and 'tai shi' are respectively calculated;
if the pinyin similarity between the dictionary words and the words in the window of the sentence to be corrected is greater than or equal to a preset similarity threshold value, the words in the window are replaced by the dictionary words until all the words in the dictionary are traversed. As can be seen from fig. 4, the word satisfying the similarity threshold condition is a "situation map", and the sentence to be corrected is temporarily replaced by a "situation map is moved leftwards"; and then calculating the probability of the sentence 'situation map moving leftwards' according to the constructed Bi-Gram language model. At this time, the sentence probability of the sentence "situation map is moved leftwards" is larger than that of the primitive sentence "the situation map is moved leftwards", and then the words in the window are replaced, at this time, the sentence to be corrected is changed into the sentence "the situation map is moved leftwards", the sliding window position is moved rightwards by the length of the word "the situation map", and in particular, in the embodiment, 3 words are moved rightwards;
secondly, for a statement to be corrected, "situation map moves to the left", similarity between "tai shitu" and "xiang zuo yi", "ding wei" and "xiang zuo" and similarity between "yi dong" and "xiang zuo" are calculated respectively. After traversing all words in the dictionary, the window position is shifted to the right by one step, namely the window is shifted to the right by 1 word, then the dictionary words are traversed by the window position to calculate the pinyin character string similarity, the similarly, the threshold requirement of the pinyin similarity is still not satisfied, the window position is continuously shifted to the right by 1 word, and at the moment, the sentence to be corrected is still a situation map is shifted to the left.
At the moment, similarity between 'ding wei' and 'yi dong' and similarity between 'yi dong' and 'yi dong' are calculated respectively, words meeting the similarity threshold condition are 'moving', and the sentence to be corrected is temporarily replaced by 'situation map moving leftwards'; and then calculating the probability of the sentence 'situation map moving leftwards' according to the Bi-Gram language model. At the moment, the sentence probability of the sentence 'situation map moving leftwards' is larger than that of the primitive sentence 'situation map moving leftwards', the words in the window are determined to be replaced, and finally, the corrected sentence is 'situation map moving leftwards'.
Table 2 statistics of correction accuracy using the present method
Figure GDA0004118891830000131
According to the technical scheme, the embodiment of the invention provides a text detection and correction method based on pinyin similarity and a language model, which comprises two processes of pinyin similarity calculation and sentence rationality analysis.
In the prior art, the text correction method cannot consider the semantic information and the context of sentences, and is excessively dependent on an instance library, so that the text correction accuracy is low. By adopting the method, the instruction text detection and correction are carried out based on the pinyin similarity and the N-Gram model, so that semantic context errors caused by pinyin correction can be effectively avoided. Specifically, the method analyzes the rationality of sentences according to the trained Bi-Gram language model by analyzing the pinyin similarity among words, so that the method greatly improves the accuracy of text detection and correction compared with the prior art.
In particular, compared with the prior art, the invention has the remarkable advantages that:
(1) The text correction method provided by the invention is used as a text processing means, and can be used for correcting text of sentence results of voice recognition in the military field. The pinyin similarity calculation is used as a text similarity assessment means, and text similarity comparison can be carried out on words after voice recognition in the military field and dictionary words. The N-Gram language model obtains the probability of word combination by training a large number of sentences, and reflects the context and semantic information of the sentences.
(2) The N-Gram language model of the present invention is based on the Markov assumption that the probability of occurrence of the nth word is related to only the preceding N-1 words, irrespective of the other words. The larger the value of N, the more accurate the probability given by the language model, but the more the parameter quantity is contained, the larger the calculated quantity is. Through a large number of sentence training, a very accurate language model is constructed. By adopting the data smoothing method, the data sparseness problem of the N-Gram language model is effectively solved, so that the semantic information of sentences is more accurately analyzed, and the rationality of the sentences in the correction process is ensured.
In a specific implementation, the present invention further provides a computer storage medium, where the computer storage medium may store a program, where the program may include some or all of the steps in each embodiment of a text detection and correction method based on pinyin similarity and language model provided by the present invention when the program is executed. The storage medium may be a magnetic disk, an optical disk, a read-only memory (ROM), a random-access memory (random access memory, RAM), or the like.
It will be apparent to those skilled in the art that the techniques of embodiments of the present invention may be implemented in software plus a necessary general purpose hardware platform. Based on such understanding, the technical solutions in the embodiments of the present invention may be embodied in essence or what contributes to the prior art in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the embodiments or some parts of the embodiments of the present invention.
The same or similar parts between the various embodiments in this specification are referred to each other. The embodiments of the present invention described above do not limit the scope of the present invention.

Claims (3)

1. A text detection and correction method based on pinyin similarity and language model is characterized by comprising the following steps:
step 1, collecting a large number of correct instruction text sentences as training sentences;
step 2, selecting words in the professional field from the training sentences for constructing a custom dictionary;
step 3, utilizing a HanLP language processing tool package and the custom dictionary to segment the training sentences to obtain a word segmentation result;
step 4, counting the occurrence times of each word and each word combination in all training sentences in the word segmentation result, and constructing a Bi-Gram language model for evaluating the sentence rationality of the sentence to be corrected;
step 5, converting the sentence to be corrected into corresponding pinyin to be corrected, and converting the word of the custom dictionary into corresponding dictionary pinyin;
step 6, correcting the sentence to be corrected according to the pinyin similarity of the pinyin to be corrected and the pinyin of the dictionary and combining the sentence rationality of the sentence to be corrected to obtain a corrected sentence;
the step 2 includes:
calculating the occurrence frequency of each professional word in each training sentence;
if the occurrence frequency of any professional word in the training sentence is higher than a preset occurrence threshold, selecting the professional word for constructing a custom dictionary;
the custom dictionary takes the form of text files; each professional word is respectively arranged in one row of the custom dictionary;
the step 3 includes: through a standard word segmentation device in the HanLP language processing tool kit, according to words contained in a universal dictionary and a custom dictionary in the HanLP language processing tool kit, segmenting each training sentence to obtain a word segmentation result, wherein the word segmentation result is that each training sentence is divided into a plurality of word combinations;
the step 4 includes:
if the word segmentation result is that the training sentence after word segmentation contains n words;
the n words are w respectively 1 w 2 …w n I.e. the training sentence s= (w 1 w 2 …w n ) The N-Gram language model of the training sentence represents all word combinations obtained by combining adjacent words, i.e. (w) 1 w 2 …w N ),(w 2 w 3 …w N+1 ),…,(w n+1-N …w n-1 w n ) Wherein N represents the length of the word combination, that is, the number of the word combination including adjacent words is N, and the adjacent words are words next to each other in the word segmentation result of the training sentence; counting the occurrence times of each word combination and each word in all training sentences;
obtaining the probability of occurrence of the sentence S 'to be corrected in the technical field, and evaluating the sentence rationality of the sentence S' to be corrected according to the following formula:
P(S')=P(w 1 w 2 …w m )=P(w 1 )P(w 2 |w 1 )…P(w m |w m-1 …w 2 w 1 )
wherein, according to the single word and the occurrence frequency of word combination in all training sentences in the word segmentation result, a first word w is obtained 1 The probability of occurrence in the technical field is P (w 1 ) The second word w 2 The occurrence of (1) depends on the first word w 1 I.e. the second word w 2 The probability of occurrence in the technical field is P (w 2 |w 1 ) The method comprises the steps of carrying out a first treatment on the surface of the Similarly, the mth word w m The occurrence of (c) depends on the m-1 th word w m-1 …w 2 w 1 The mth word w m The probability of occurrence in the technical field is P (w m |w m-1 …w 2 w 1 );
The step 4 includes:
if the appearance of a word in a sentence only depends on the word appearing in front of the sentence, namely when the N value takes 2, the word combination is called a binary word combination Bi-Gram, all the training sentences are segmented, the occurrence times of each word and each binary word combination Bi-Gram are counted, and the Bi-Gram language model is formed;
the probability of occurrence of the statement to be corrected S' in the technical field to which the statement to be corrected belongs is:
P(S')=P(w 1 w 2 …w m )=P(w 1 )P(w 2 |w 1 )…P(w m |w m-1 );
the step 5 includes:
step 5-1, converting each Chinese character in the sentence to be corrected into pinyin by utilizing the HanLP language processing tool package, namely pinyin to be corrected; converting each Chinese character in the custom dictionary into pinyin, namely dictionary pinyin, by utilizing the HanLP language processing tool package;
step 5-2, setting a pinyin similarity threshold, and judging whether the pinyin to be corrected needs to be corrected according to a comparison result of the similarity between the pinyin to be corrected and the pinyin of the dictionary and the pinyin similarity threshold;
if the similarity between the pinyin to be corrected and the dictionary pinyin is smaller than a pinyin similarity threshold, determining that the pinyin to be corrected does not need to be corrected;
if the similarity between the pinyin to be corrected and the dictionary pinyin is greater than or equal to a pinyin similarity threshold, determining that the pinyin to be corrected needs to be corrected;
the step 6 includes:
calculating the pinyin similarity according to the pinyin to be corrected and the dictionary pinyin;
calculating the similarity between the dictionary pinyin and the to-be-corrected pinyin character strings in the sliding window at the initial position of the to-be-corrected sentence by taking the actual length of the words in the custom dictionary as the length of the sliding window, and traversing the dictionary pinyin corresponding to all the words in the custom dictionary to calculate the similarity;
if the similarity of the pinyin character strings is greater than or equal to a set pinyin similarity threshold between the dictionary pinyin corresponding to the word in the custom dictionary and the pinyin to be corrected corresponding to the word in the sliding window, temporarily replacing the word in the sliding window with the word in the custom dictionary to obtain a temporary correction sentence;
according to the constructed Bi-Gram language model, analyzing the rationality of the temporary correction statement, namely calculating the probability of the temporary correction statement in the professional field;
if the reasonability of the temporary correction statement is greater than that of the statement to be corrected, determining to replace the words in the sliding window with the words in the custom dictionary, and shifting the position of the sliding window to the right by the length of the words;
if the rationality of the temporary correction statement is less than or equal to that of the statement to be corrected, not replacing the words in the sliding window, namely, keeping the statement to be corrected in the original state;
the step 6 includes:
if the similarity between dictionary spellings corresponding to all words in the custom dictionary and the pinyin to be corrected is smaller than a set pinyin similarity threshold after traversing all words in the custom dictionary, the position of the sliding window is shifted to the right by a word distance; traversing all words in the custom dictionary again by the position of the sliding window to calculate the similarity of the pinyin character strings and analyze the rationality of the sentences until the sliding window reaches the end of the sentences to be corrected to finish correction, and finally outputting correction sentences; the correction statement is a statement obtained after correction of the statement to be corrected from the beginning to the end is completed through a sliding window.
2. The method for text detection and correction based on pinyin similarity and language model of claim 1, wherein step 1 comprises: in a certain professional field, collecting more than 1000 correct instruction text sentences as training sentences according to documents or data in the certain professional field; wherein the correct instruction text sentence is a sentence conforming to a term rule in a certain professional field.
3. The method for text detection and correction based on pinyin similarity and language model of claim 1, wherein the step 4 comprises:
the m-th word w is smoothed by adding the following formula to the Bi-Gram language model m The probability of occurrence in the technical field of the art is:
Figure FDA0004131023580000041
wherein C (w) m-1 ) Representing a single word w m-1 The number of occurrences in all training sentences, C (w m-1 w m ) Representing binary word combinations (w m-1 w m ) The occurrence times in all training sentences, k is a constant, k is more than 0 and less than or equal to 1, and V| represents the category number of all words in the word segmentation result of all training sentences;
therefore, the probability of occurrence of the statement to be corrected S' in the technical field to which it belongs is:
Figure FDA0004131023580000042
/>
CN202011169315.0A 2020-10-28 2020-10-28 Text detection and correction method based on pinyin similarity and language model Active CN112232055B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011169315.0A CN112232055B (en) 2020-10-28 2020-10-28 Text detection and correction method based on pinyin similarity and language model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011169315.0A CN112232055B (en) 2020-10-28 2020-10-28 Text detection and correction method based on pinyin similarity and language model

Publications (2)

Publication Number Publication Date
CN112232055A CN112232055A (en) 2021-01-15
CN112232055B true CN112232055B (en) 2023-05-02

Family

ID=74109140

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011169315.0A Active CN112232055B (en) 2020-10-28 2020-10-28 Text detection and correction method based on pinyin similarity and language model

Country Status (1)

Country Link
CN (1) CN112232055B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112926306B (en) * 2021-03-08 2024-01-23 北京百度网讯科技有限公司 Text error correction method, device, equipment and storage medium
CN113361238B (en) * 2021-05-21 2022-02-11 北京语言大学 Method and device for automatically proposing question by recombining question types with language blocks
CN113435186B (en) * 2021-06-18 2022-05-20 上海熙瑾信息技术有限公司 Chinese text error correction system, method, device and computer readable storage medium

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6167367A (en) * 1997-08-09 2000-12-26 National Tsing Hua University Method and device for automatic error detection and correction for computerized text files
JP2004264464A (en) * 2003-02-28 2004-09-24 Techno Network Shikoku Co Ltd Voice recognition error correction system using specific field dictionary
US20060293889A1 (en) * 2005-06-27 2006-12-28 Nokia Corporation Error correction for speech recognition systems
US10049099B2 (en) * 2015-04-10 2018-08-14 Facebook, Inc. Spell correction with hidden markov models on online social networks
CN109948144B (en) * 2019-01-29 2022-12-06 汕头大学 Teacher utterance intelligent processing method based on classroom teaching situation
CN111369996B (en) * 2020-02-24 2023-08-18 网经科技(苏州)有限公司 Speech recognition text error correction method in specific field
CN111326160A (en) * 2020-03-11 2020-06-23 南京奥拓电子科技有限公司 Speech recognition method, system and storage medium for correcting noise text

Also Published As

Publication number Publication date
CN112232055A (en) 2021-01-15

Similar Documents

Publication Publication Date Title
CN112232055B (en) Text detection and correction method based on pinyin similarity and language model
CN107305768B (en) Error-prone character calibration method in voice interaction
CN107480143B (en) Method and system for segmenting conversation topics based on context correlation
Schuster et al. Japanese and korean voice search
CN112712804B (en) Speech recognition method, system, medium, computer device, terminal and application
CN111145718B (en) Chinese mandarin character-voice conversion method based on self-attention mechanism
CN111178074A (en) Deep learning-based Chinese named entity recognition method
WO2021189624A1 (en) Method and apparatus for decoding voice data, computer device and storage medium
CN114580382A (en) Text error correction method and device
JP5809381B1 (en) Natural language processing system, natural language processing method, and natural language processing program
CN110134950B (en) Automatic text proofreading method combining words
Yamamoto et al. Multi-class composite N-gram language model
CN112183073A (en) Text error correction and completion method suitable for legal hot-line speech recognition
Puigcerver et al. Probabilistic interpretation and improvements to the HMM-filler for handwritten keyword spotting
CN109033066A (en) A kind of abstract forming method and device
CN110347833B (en) Classification method for multi-round conversations
KR101483947B1 (en) Apparatus for discriminative training acoustic model considering error of phonemes in keyword and computer recordable medium storing the method thereof
CN110610006A (en) Morphological double-channel Chinese word embedding method based on strokes and glyphs
CN115860015A (en) Translation memory-based transcribed text translation method and computer equipment
CN114254628A (en) Method and device for quickly extracting hot words by combining user text in voice transcription, electronic equipment and storage medium
CN114548049A (en) Digital regularization method, device, equipment and storage medium
CN111090720B (en) Hot word adding method and device
van Noord et al. Wide coverage parsing with stochastic attribute value grammars
JP5057916B2 (en) Named entity extraction apparatus, method, program, and recording medium
CN110750967A (en) Pronunciation labeling method and device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant