CN112232055B

CN112232055B - Text detection and correction method based on pinyin similarity and language model

Info

Publication number: CN112232055B
Application number: CN202011169315.0A
Authority: CN
Inventors: 韩竞; 李晓冬; 梁木; 吴蔚; 王鑫鹏
Original assignee: CETC 28 Research Institute
Current assignee: CETC 28 Research Institute
Priority date: 2020-10-28
Filing date: 2020-10-28
Publication date: 2023-05-02
Anticipated expiration: 2040-10-28
Also published as: CN112232055A

Abstract

The invention discloses a text detection and correction method based on pinyin similarity and a language model, which comprises the steps of collecting a large number of correct instruction text sentences as training sentences; selecting words in the professional field from the training sentences, and constructing a custom dictionary; utilizing a HanLP language processing tool package and a custom dictionary to segment training sentences; counting the occurrence times of each word and each word combination in all training sentences in word segmentation results, and constructing a Bi-Gram language model; converting the sentence to be corrected into corresponding pinyin to be corrected, and converting the word of the custom dictionary into corresponding dictionary pinyin; and correcting the sentence to be corrected according to the pinyin similarity of the pinyin to be corrected and the dictionary pinyin and combining the sentence rationality of the sentence to be corrected, so as to obtain the corrected sentence. The invention considers the semantic information and the context of the sentence through word pinyin similarity calculation and sentence rationality analysis, is beneficial to detecting the wrong words in the sentence and improves the correction accuracy.

Description

Text detection and correction method based on pinyin similarity and language model

Technical Field

The invention relates to the technical field of text detection, in particular to a text detection and correction method based on pinyin similarity and a language model.

Background

When the voice recognition in the open field is directly applied to the professional field, the text after the voice recognition is wrong due to the interference of noise, user accent and the like and the lack of professional vocabulary, so that the analyzability of the text is reduced. The Chinese correction technology is an important technology for realizing automatic checking and automatic correction of Chinese sentences, and aims to improve the language correctness and reduce the manual checking cost. In the existing research about text correction, most of the research is standard texts facing the open field, such as newspaper, books and periodicals texts, etc. Under the specific field, the speech recognition engine in the general field has low sentence recognition rate on the specific field, and named entities and professional terms in some fields cannot be accurately recognized. At present, researches are carried out in the face of high-specialized and field-specific text correction, and the research is quite delicate and a great challenge.

Some scholars are currently researching text correction after speech conversion. Wang Xingjian proposes a correction method based on an N-Gram language model that tries to correct with an N-Gram of pinyin, but does not take into account the semantic information and context of sentences. Long Lixia et al propose a correction method based on instance context, which uses core words of the field as a knowledge base, finds sentences containing the core words in training sentences as examples, calculates the context correlation and semantic similarity between the words and the instance set, thereby performing error detection, and calculates the highest context harmony as a correction result according to the confusion set generated by the pinyin confusion rule, but the method is excessively dependent on the instance library, and cannot correct if similar examples cannot be found in the input text.

Disclosure of Invention

The invention provides a text detection and correction method based on pinyin similarity and a language model, which aims to solve the problems that the existing text correction method cannot fully consider semantic information and context of sentences and is excessively dependent on an instance library of the sentences, so that the text correction accuracy is low.

The invention provides a text detection and correction method based on pinyin similarity and a language model, which comprises the following steps:

step 1, collecting a large number of correct instruction text sentences as training sentences;

step 2, selecting words in the professional field from the training sentences for constructing a custom dictionary;

step 3, utilizing a HanLP language processing tool package and the custom dictionary to segment the training sentences to obtain a word segmentation result;

step 4, counting the occurrence times of each word and each word combination in all training sentences in the word segmentation result, and constructing a Bi-Gram language model for evaluating the sentence rationality of the sentence to be corrected;

step 5, converting the sentence to be corrected into corresponding pinyin to be corrected, and converting the word of the custom dictionary into corresponding dictionary pinyin;

and 6, correcting the sentence to be corrected according to the pinyin similarity of the pinyin to be corrected and the pinyin of the dictionary and combining the sentence rationality of the sentence to be corrected to obtain the corrected sentence.

Further, in one implementation, the step 1 includes: in a certain professional field, collecting more than 1000 correct instruction text sentences as training sentences according to documents or data in the certain professional field; wherein the correct instruction text sentence is a sentence conforming to a term rule in a certain professional field.

Further, in one implementation, the step 2 includes:

calculating the occurrence frequency of each professional word in each instruction text sentence;

if the occurrence frequency of any professional word in the instruction text sentence is higher than a preset occurrence threshold, selecting the professional word for constructing a custom dictionary;

the custom dictionary takes the form of text files; and each professional word is respectively arranged in one row of the custom dictionary.

Further, in one implementation, the step 3 includes: and carrying out word segmentation on each training sentence according to words contained in a universal dictionary and a custom dictionary in the HanLP language processing tool kit through a standard word segmentation device in the HanLP natural language processing tool kit to obtain a word segmentation result, wherein the word segmentation result is that each training sentence is divided into a plurality of word combinations.

Further, in one implementation, the step 4 includes:

if the word segmentation result is that the training sentence after word segmentation contains n words;

the n words are w respectively ₁ w ₂ …w _n I.e. the training sentence s= (w ₁ w ₂ …w _n ) The N-Gram language model of the training sentence represents all word combinations obtained by combining neighboring words, i.e., (w) ₁ w ₂ ...w _N )，(w ₂ w ₃ ...w _N+1 )，…，(w _n+1-N …w _n-1 w _n ) Wherein N represents the length of the word combination, that is, the number of the word combination including adjacent words is N, and the adjacent words are words next to each other in the word segmentation result of the training sentence; counting the occurrence times of each word combination and each word in all training sentences;

obtaining the probability of occurrence of the sentence S 'to be corrected in the technical field, and evaluating the sentence rationality of the sentence S' to be corrected according to the following formula:

P(S')＝P(w ₁ w ₂ …w _m )＝P(w ₁ )P(w ₂ |w ₁ )…P(w _m |w _m-1 …w ₂ w ₁ )

wherein, according to the single word and the occurrence frequency of word combination in all training sentences in the word segmentation result, a first word w is obtained ₁ The probability of occurrence in the technical field is P (w ₁ ) The second word w ₂ The occurrence of (1) depends on the first word w ₁ I.e. the second word w ₂ The probability of occurrence in the technical field is P (w ₂ |w ₁ ) The method comprises the steps of carrying out a first treatment on the surface of the Similarly, the mth word w _m The occurrence of (c) depends on the m-1 th word w _m-1 …w ₂ w ₁ The mth word w _m The probability of occurrence in the technical field is P (w _m |w _m-1 …w ₂ w ₁ )。

Further, in one implementation, the step 4 includes:

if the appearance of one word in one sentence only depends on one word appearing in front of the sentence, namely when the N value takes 2, the word combination is called a binary word combination Bi-Gram, all the training sentences are segmented, the occurrence frequency of each word and each binary word combination Bi-Gram is counted, and the Bi-Gram language model is formed;

the probability of occurrence of the statement to be corrected S' in the technical field to which the statement to be corrected belongs is:

P(S‘)＝P(w ₁ w ₂ …w _m )＝P(w ₁ )P(w ₂ |w ₁ )…P(w _m |w _m-1 )。

further, in one implementation, the step 4 includes:

the m-th word w is smoothed by adding the following formula to the Bi-Gram language model _m The probability of occurrence in the technical field of the art is:

/>

wherein C (w) _m-1 ) Representing a single word w _m-1 The number of occurrences in all training sentences, C (w _m-1 w _m ) Representing binary word combinations (w _m-1 w _m ) The occurrence times in all training sentences, k is a constant, k is more than 0 and less than or equal to 1, and V| represents the category number of all words in the word segmentation result of all training sentences;

therefore, the probability of occurrence of the statement to be corrected S' in the technical field to which it belongs is:

the data smoothing is realized through the addition smoothing, and even if the sentence contains word combinations which do not appear in the language model, the probability of the sentence can be ensured to be not 0.

Further, in one implementation, the step 5 includes:

step 5-1, converting each Chinese character in the sentence to be corrected into pinyin by utilizing the HanLP natural language processing tool package, namely pinyin to be corrected; converting each Chinese character in the custom dictionary into pinyin, namely dictionary pinyin, by utilizing the HanLP natural language processing tool kit;

step 5-2, setting a pinyin similarity threshold, and judging whether the pinyin to be corrected needs to be corrected according to a comparison result of the similarity between the pinyin to be corrected and the dictionary pinyin and the similarity threshold;

if the similarity between the pinyin to be corrected and the dictionary pinyin is smaller than a similarity threshold, determining that the pinyin to be corrected does not need to be corrected;

and if the similarity between the pinyin to be corrected and the dictionary pinyin is greater than or equal to a similarity threshold, determining that the pinyin to be corrected needs to be corrected.

Further, in one implementation, the step 6 includes:

calculating the pinyin similarity according to the pinyin to be corrected and the dictionary pinyin;

calculating the similarity between the dictionary pinyin and the to-be-corrected pinyin character strings in the sliding window at the initial position of the to-be-corrected sentence by taking the actual length of the words in the custom dictionary as the length of the sliding window, and traversing the dictionary pinyin corresponding to all the words in the custom dictionary to calculate the similarity;

if the similarity of the pinyin character strings is greater than or equal to a preset similarity threshold value between the dictionary pinyin corresponding to the word in the custom dictionary and the pinyin to be corrected corresponding to the word in the sliding window, temporarily replacing the word in the sliding window with the word in the custom dictionary to obtain a temporary correction sentence;

according to the constructed Bi-Gram language model, analyzing the rationality of the temporary correction statement, namely calculating the probability of the temporary correction statement in the professional field;

if the reasonability of the temporary correction statement is greater than that of the statement to be corrected, determining to replace the words in the sliding window with the words in the custom dictionary, and shifting the position of the sliding window to the right by the length of the words;

if the reasonability of the temporary correction statement is smaller than or equal to that of the statement to be corrected, the words in the sliding window are not replaced, namely the statement to be corrected is kept as it is.

Further, in one implementation, the step 6 includes:

if all words in the user-defined dictionary are traversed, the pinyin of all words in the user-defined dictionary and the pinyin of the words of the sentence to be corrected are smaller than a preset similarity threshold value, the position of the sliding window is shifted to the right by a word distance, then all words in the user-defined dictionary are traversed again by the position of the sliding window to calculate the similarity of pinyin strings and analyze the rationality of the sentence until the sliding window reaches the end of the sentence to be corrected to finish correction, and finally a correction sentence is output; the correction statement is a statement obtained after correction of the statement to be corrected from the beginning to the end is completed through a sliding window.

According to the technical scheme, the embodiment of the invention provides a text detection and correction method based on pinyin similarity and a language model, which comprises two processes of pinyin similarity calculation and sentence rationality analysis.

In the prior art, the text correction method cannot consider the semantic information and the context of sentences, and is excessively dependent on an instance library, so that the text correction accuracy is low. By adopting the method, the instruction text detection and correction are carried out based on the pinyin similarity and the N-Gram model, so that semantic context errors caused by pinyin correction can be effectively avoided. Specifically, the method analyzes the rationality of sentences according to the trained Bi-Gram language model by analyzing the pinyin similarity among words, so that the method greatly improves the accuracy of text detection and correction compared with the prior art.

In particular, compared with the prior art, the invention has the remarkable advantages that:

(1) The text correction method provided by the invention is used as a text processing means, and can be used for correcting text of sentence results of voice recognition in the military field. The pinyin similarity calculation is used as a text similarity assessment means, and text similarity comparison can be carried out on words after voice recognition in the military field and dictionary words. The N-Gram language model obtains the probability of word combination by training a large number of sentences, and reflects the context and semantic information of the sentences.

(2) The N-Gram language model of the present invention is based on the Markov assumption that the probability of occurrence of the nth word is related to only the preceding N-1 words, irrespective of the other words. The larger the value of N, the more accurate the probability given by the language model, but the more the parameter quantity is contained, the larger the calculated quantity is. Through a large number of sentence training, a very accurate language model is constructed. By adopting the data smoothing method, the data sparseness problem of the N-Gram language model is effectively solved, so that the semantic information of sentences is more accurately analyzed, and the rationality of the sentences in the correction process is ensured.

Drawings

In order to more clearly illustrate the technical solution of the present invention, the drawings that are needed in the embodiments will be briefly described below, and it will be obvious to those skilled in the art that other drawings can be obtained from these drawings without inventive effort.

FIG. 1 is a schematic diagram of a text detection and correction method based on Pinyin similarity and language model according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a custom dictionary in a text detection and correction method based on Pinyin similarity and language model according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a Bi-Gram language model in a text detection and correction method based on Pinyin similarity and language models according to an embodiment of the present invention;

fig. 4 is a schematic flow chart of a specific implementation of a text detection and correction method based on pinyin similarity and language model according to an embodiment of the present invention.

Detailed Description

In order that the above-recited objects, features and advantages of the present invention will become more readily apparent, a more particular description of the invention will be rendered by reference to the appended drawings and appended detailed description.

The embodiment of the invention discloses a text detection and correction method based on pinyin similarity and a language model, which is applied to correction of instruction texts in the professional field.

As shown in fig. 1, the present embodiment provides a text detection and correction method based on pinyin similarity and language model, which includes the following steps:

In the text detection and correction method based on pinyin similarity and language model of the embodiment, the step 1 includes: in a certain professional field, collecting more than 1000 correct instruction text sentences as training sentences according to documents or data in the certain professional field; wherein the correct instruction text sentence is a sentence conforming to a term rule in a certain professional field.

In the text detection and correction method based on pinyin similarity and language model of the embodiment, the step 2 includes:

calculating the occurrence frequency of each professional word in each training sentence;

if the occurrence frequency of any professional word in the training sentence is higher than a preset occurrence threshold, selecting the professional word for constructing a custom dictionary; specifically, in this embodiment, the preset occurrence threshold is 10%, and the setting of the preset threshold may be adjusted according to an actual scene.

The custom dictionary takes the form of text files; each professional word is respectively placed in a row of the custom dictionary, as shown in fig. 2.

In the text detection and correction method based on pinyin similarity and language model of the embodiment, the step 3 includes: and carrying out word segmentation on each training sentence according to words contained in a universal dictionary and a custom dictionary in the HanLP language processing tool kit through a standard word segmentation device in the HanLP natural language processing tool kit to obtain a word segmentation result, wherein the word segmentation result is that each training sentence is divided into a plurality of word combinations.

In the text detection and correction method based on pinyin similarity and language model of the embodiment, the step 4 includes:

the n words are w respectively ₁ w ₂ …w _n I.e. the training sentence s= (w ₁ w ₂ …w _n ) The N-Gram language model of the training sentence represents all word combinations obtained by combining neighboring words, i.e., (w) ₁ w ₂ …w _N )，(w ₂ w ₃ …w _N+1 )，...，(w _n+1-N …w _n-1 w _n ) Wherein N represents the length of the word combination, i.e. the word combination contains the number of adjacent wordsThe quantity is N, and the adjacent words are next words in word segmentation results of training sentences; counting the occurrence times of each word combination and each word in all training sentences;

In the text detection and correction method based on pinyin similarity and language model of the embodiment, the step 4 includes: in order to simplify the algorithm complexity, if the appearance of one word in one sentence only depends on one word appearing in front of the sentence, namely when the N value takes 2, the word combination is called a binary word combination Bi-Gram, all training sentences are segmented, the appearance times of each word and each binary word combination Bi-Gram are counted, and a Bi-Gram language model is formed; specifically, as shown in fig. 3, the Bi-Gram language model is a statistical result of word combinations.

P(S‘)＝P(w ₁ w ₂ ...w _m )＝P(w ₁ )P(w ₂ |w ₁ )...P(w _m |w _m-1 )。

in the text detection and correction method based on pinyin similarity and language model of the embodiment, the step 4 includes: because the Bi-Gram language model has a sparse problem, that is, if many binary word combinations of the sentence to be corrected do not appear in the training sentence, the probability P (S ') of the sentence S ' to be corrected appearing in the professional field to which the sentence S ' to be corrected belongs will be 0, which will severely limit the application range of the Bi-Gram language model. Therefore, the application range of the Bi-Gram language model can be enlarged by adopting additive smoothing to the Bi-Gram language model.

In the text detection and correction method based on pinyin similarity and language model of the present embodiment, the step 5 includes:

and if the similarity between the pinyin to be corrected and the dictionary pinyin is greater than or equal to a similarity threshold, determining that the pinyin to be corrected needs to be corrected. Specifically, in this embodiment, the similarity threshold is 0.75, and the similarity threshold is used for being adjusted appropriately according to the actual scene.

In the text detection and correction method based on pinyin similarity and language model of the embodiment, the step 6 includes:

Specifically, the text detection and correction method based on pinyin similarity and language model provides the following embodiments:

according to the step 1, in a certain professional field, a large number of correct instruction text sentences are collected according to documents or data in a certain field. The number of the correct instruction text sentences obtained by gathering is larger than 1000, and the common term collocation rules in the military field are a) subject+predicate b) predicate+object c) subject+predicate+object d) predicate+object+object, and the like, which are specifically shown in table 1.

TABLE 1 statement component Table

According to the step 2, professional words in the military field are generally unique to the military field, and in combination with fig. 2, selecting words in the professional field from the training sentences, and selecting a large number of words to construct a custom dictionary, wherein fig. 2 is a custom dictionary, and more than 90 words are selected;

according to the step 3, the training sentences are segmented by utilizing a HanLP language processing tool kit and the custom dictionary, and a segmentation result is obtained;

according to the step 4, a Bi-Gram language model is constructed by using the word segmentation result in combination with FIG. 3, and FIG. 3 is a Bi-Gram language model trained by using more than 12000 training sentences. Let the training sentence S contain n words, i.e. s= (w) ₁ w ₂ ...w _n ) The binary word combination Bi-Gram of the sentence represents the word combination obtained by dividing the word segmentation result of the original sentence according to the word length 2, namely, the word combination with the number of all words being 2 in the sentence.

The probability of occurrence of the statement to be corrected S' in the technical field to which it belongs is:

P(S')＝P(w ₁ w ₂ ...w _m )＝P(w ₁ )P(w ₂ |w ₁ )...P(w _m |w _m-1 )。

the method adopts addition smoothing to the Bi-Gram language model, and is characterized in that a constant k (k is more than 0 and less than or equal to 1) is added to the occurrence frequency of each binary word combination, and the method is as follows:

wherein C (w) _m-1 ) Representing a single word w _m-1 The number of occurrences in all training sentences, C (w _m-1 w _m ) Representing binary word combinations (w _m-1 w _m ) The number of occurrences in all training sentences, k is a constant, and 0 < k.ltoreq.1, |V| represents all trainingThe number of kinds of all words in the word segmentation result of the training sentence;

according to the step 5, referring to fig. 4, in the first step, each Chinese character in the sentence to be corrected and the custom dictionary is converted into pinyin by utilizing HanLP, namely, the pinyin of "tai drawing moves to the left", "situation drawing", "positioning" and "moving" are respectively "tai shi tu xiang zuo yi dong", "tai shitu", "ding wei" and "yi dong"; second, assume that only one letter in a simple two-word pinyin is wrong, i.e., only one letter in pinyin "ab cd" is wrong, and the similarity between the wrong pinyin and the correct pinyin is 0.75. In this embodiment, a pinyin similarity threshold is initially set to 0.75, and the similarity threshold may be appropriately adjusted according to an actual scene.

According to the step 6, in the first step, at the initial position of the sentence to be corrected, the pinyin of the user-defined dictionary word is used as a sliding window, the lengths of the dictionary word 'situation map', 'positioning' and 'moving' are respectively used as the lengths of the sliding window, the lengths are respectively 3, 2 and 2, and then the character string similarity calculation is carried out on the pinyin of the dictionary word and the pinyin of the word in the window, namely the similarity between 'tai shitu' and 'tai shi', 'ding wei' and 'tai shi' and the similarity between 'yi dong' and 'tai shi' are respectively calculated;

if the pinyin similarity between the dictionary words and the words in the window of the sentence to be corrected is greater than or equal to a preset similarity threshold value, the words in the window are replaced by the dictionary words until all the words in the dictionary are traversed. As can be seen from fig. 4, the word satisfying the similarity threshold condition is a "situation map", and the sentence to be corrected is temporarily replaced by a "situation map is moved leftwards"; and then calculating the probability of the sentence 'situation map moving leftwards' according to the constructed Bi-Gram language model. At this time, the sentence probability of the sentence "situation map is moved leftwards" is larger than that of the primitive sentence "the situation map is moved leftwards", and then the words in the window are replaced, at this time, the sentence to be corrected is changed into the sentence "the situation map is moved leftwards", the sliding window position is moved rightwards by the length of the word "the situation map", and in particular, in the embodiment, 3 words are moved rightwards;

secondly, for a statement to be corrected, "situation map moves to the left", similarity between "tai shitu" and "xiang zuo yi", "ding wei" and "xiang zuo" and similarity between "yi dong" and "xiang zuo" are calculated respectively. After traversing all words in the dictionary, the window position is shifted to the right by one step, namely the window is shifted to the right by 1 word, then the dictionary words are traversed by the window position to calculate the pinyin character string similarity, the similarly, the threshold requirement of the pinyin similarity is still not satisfied, the window position is continuously shifted to the right by 1 word, and at the moment, the sentence to be corrected is still a situation map is shifted to the left.

At the moment, similarity between 'ding wei' and 'yi dong' and similarity between 'yi dong' and 'yi dong' are calculated respectively, words meeting the similarity threshold condition are 'moving', and the sentence to be corrected is temporarily replaced by 'situation map moving leftwards'; and then calculating the probability of the sentence 'situation map moving leftwards' according to the Bi-Gram language model. At the moment, the sentence probability of the sentence 'situation map moving leftwards' is larger than that of the primitive sentence 'situation map moving leftwards', the words in the window are determined to be replaced, and finally, the corrected sentence is 'situation map moving leftwards'.

Table 2 statistics of correction accuracy using the present method

In a specific implementation, the present invention further provides a computer storage medium, where the computer storage medium may store a program, where the program may include some or all of the steps in each embodiment of a text detection and correction method based on pinyin similarity and language model provided by the present invention when the program is executed. The storage medium may be a magnetic disk, an optical disk, a read-only memory (ROM), a random-access memory (random access memory, RAM), or the like.

It will be apparent to those skilled in the art that the techniques of embodiments of the present invention may be implemented in software plus a necessary general purpose hardware platform. Based on such understanding, the technical solutions in the embodiments of the present invention may be embodied in essence or what contributes to the prior art in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the embodiments or some parts of the embodiments of the present invention.

The same or similar parts between the various embodiments in this specification are referred to each other. The embodiments of the present invention described above do not limit the scope of the present invention.

Claims

1. A text detection and correction method based on pinyin similarity and language model is characterized by comprising the following steps:

step 6, correcting the sentence to be corrected according to the pinyin similarity of the pinyin to be corrected and the pinyin of the dictionary and combining the sentence rationality of the sentence to be corrected to obtain a corrected sentence;

the step 2 includes:

if the occurrence frequency of any professional word in the training sentence is higher than a preset occurrence threshold, selecting the professional word for constructing a custom dictionary;

the custom dictionary takes the form of text files; each professional word is respectively arranged in one row of the custom dictionary;

the step 3 includes: through a standard word segmentation device in the HanLP language processing tool kit, according to words contained in a universal dictionary and a custom dictionary in the HanLP language processing tool kit, segmenting each training sentence to obtain a word segmentation result, wherein the word segmentation result is that each training sentence is divided into a plurality of word combinations;

the step 4 includes:

the n words are w respectively ₁ w ₂ …w _n I.e. the training sentence s= (w ₁ w ₂ …w _n ) The N-Gram language model of the training sentence represents all word combinations obtained by combining adjacent words, i.e. (w) ₁ w ₂ …w _N )，(w ₂ w ₃ …w _N+1 )，…，(w _n+1-N …w _n-1 w _n ) Wherein N represents the length of the word combination, that is, the number of the word combination including adjacent words is N, and the adjacent words are words next to each other in the word segmentation result of the training sentence; counting the occurrence times of each word combination and each word in all training sentences;

wherein, according to the single word and the occurrence frequency of word combination in all training sentences in the word segmentation result, a first word w is obtained ₁ The probability of occurrence in the technical field is P (w ₁ ) The second word w ₂ The occurrence of (1) depends on the first word w ₁ I.e. the second word w ₂ The probability of occurrence in the technical field is P (w ₂ |w ₁ ) The method comprises the steps of carrying out a first treatment on the surface of the Similarly, the mth word w _m The occurrence of (c) depends on the m-1 th word w _m-1 …w ₂ w ₁ The mth word w _m The probability of occurrence in the technical field is P (w _m |w _m-1 …w ₂ w ₁ )；

The step 4 includes:

if the appearance of a word in a sentence only depends on the word appearing in front of the sentence, namely when the N value takes 2, the word combination is called a binary word combination Bi-Gram, all the training sentences are segmented, the occurrence times of each word and each binary word combination Bi-Gram are counted, and the Bi-Gram language model is formed;

P(S')＝P(w ₁ w ₂ …w _m )＝P(w ₁ )P(w ₂ |w ₁ )…P(w _m |w _m-1 )；

the step 5 includes:

step 5-1, converting each Chinese character in the sentence to be corrected into pinyin by utilizing the HanLP language processing tool package, namely pinyin to be corrected; converting each Chinese character in the custom dictionary into pinyin, namely dictionary pinyin, by utilizing the HanLP language processing tool package;

step 5-2, setting a pinyin similarity threshold, and judging whether the pinyin to be corrected needs to be corrected according to a comparison result of the similarity between the pinyin to be corrected and the pinyin of the dictionary and the pinyin similarity threshold;

if the similarity between the pinyin to be corrected and the dictionary pinyin is smaller than a pinyin similarity threshold, determining that the pinyin to be corrected does not need to be corrected;

if the similarity between the pinyin to be corrected and the dictionary pinyin is greater than or equal to a pinyin similarity threshold, determining that the pinyin to be corrected needs to be corrected;

the step 6 includes:

if the similarity of the pinyin character strings is greater than or equal to a set pinyin similarity threshold between the dictionary pinyin corresponding to the word in the custom dictionary and the pinyin to be corrected corresponding to the word in the sliding window, temporarily replacing the word in the sliding window with the word in the custom dictionary to obtain a temporary correction sentence;

if the rationality of the temporary correction statement is less than or equal to that of the statement to be corrected, not replacing the words in the sliding window, namely, keeping the statement to be corrected in the original state;

the step 6 includes:

if the similarity between dictionary spellings corresponding to all words in the custom dictionary and the pinyin to be corrected is smaller than a set pinyin similarity threshold after traversing all words in the custom dictionary, the position of the sliding window is shifted to the right by a word distance; traversing all words in the custom dictionary again by the position of the sliding window to calculate the similarity of the pinyin character strings and analyze the rationality of the sentences until the sliding window reaches the end of the sentences to be corrected to finish correction, and finally outputting correction sentences; the correction statement is a statement obtained after correction of the statement to be corrected from the beginning to the end is completed through a sliding window.

2. The method for text detection and correction based on pinyin similarity and language model of claim 1, wherein step 1 comprises: in a certain professional field, collecting more than 1000 correct instruction text sentences as training sentences according to documents or data in the certain professional field; wherein the correct instruction text sentence is a sentence conforming to a term rule in a certain professional field.

3. The method for text detection and correction based on pinyin similarity and language model of claim 1, wherein the step 4 comprises:

/>