CN109101482B

CN109101482B - Positioning method for text form near word error

Info

Publication number: CN109101482B
Application number: CN201810709763.1A
Authority: CN
Inventors: 邵玉斌; 王林坪; 龙华; 杜庆治
Original assignee: Kunming University of Science and Technology
Current assignee: Kunming University of Science and Technology
Priority date: 2018-07-02
Filing date: 2018-07-02
Publication date: 2021-08-20
Anticipated expiration: 2038-07-02
Also published as: CN109101482A

Abstract

The invention relates to a positioning method for text form and word near errors, and belongs to the technical field of natural language processing. Firstly, a long sentence is divided into a plurality of short sentences, then a Chinese character shape near character library is used for finding out the shape near character corresponding to each character in each short sentence, the shape near character and the original character form a candidate character vector, an uncommon character in the vector is removed by using a common character library, and candidate character vectors of all characters form a candidate matrix, so that a candidate character matrix of each short sentence is obtained; secondly, binding adjacent vectors in each candidate matrix into words, adding the combined correct words into a word set, extracting inactive words in the word set, and adding the inactive words into the inactive word set; extracting characters of the connecting parts of the two adjacent short sentences, combining the characters, and adding the characters into a word set if the words exist; finally, words in the word set and the stop word set are compared with the original text, the words are removed, and the rest are positions where error words exist.

Description

Positioning method for text form near word error

Technical Field

The invention relates to a positioning method for text form and word near errors, and belongs to the technical field of natural language processing.

Background

At present, because of the application of the OCR text recognition technology, when translating paper text characters into computer characters, some characters are often recognized wrongly and recognized as other characters, most of the characters are similar to the original characters, and in a large number of text proofreading, the positions of wrongly recognized characters in the text can be found out quickly, which is a precondition for text proofreading.

The method for positioning the error position in the text by using the N-gram through the context connection strength is a common method for text error detection and correction, word segmentation is a precondition for using the N-gram, but the word segmentation accuracy plays a decisive role in text error detection for word segmentation, certain time is consumed for word segmentation and probability calculation, and the efficiency is very low in accuracy and error detection speed.

Disclosure of Invention

The technical problem to be solved by the invention is to provide a positioning method for text form and word form errors, which is used for solving the problems of speed caused by word segmentation and accuracy of word segmentation results during text error detection, saving time consumed by word segmentation and probability calculation, quickly positioning error positions in a text and laying a cushion for next proofreading work.

The technical scheme of the invention is as follows: a method for positioning text form and near word errors comprises the steps of firstly, dividing a long sentence into a plurality of short sentences, then finding out form and near words corresponding to each word in each short sentence by using a Chinese character form and near word library, forming candidate word vectors by using original characters, removing the uncommon words in the vectors by using a common word library, and forming a candidate matrix by using the candidate word vectors of all the words so as to obtain a candidate word matrix of each short sentence; secondly, binding adjacent vectors in each candidate matrix into words, adding the combined correct words into a word set, extracting inactive words in the word set, and adding the inactive words into the inactive word set; extracting characters of the connecting parts of the two adjacent short sentences, combining the characters, and adding the characters into a word set if the words exist; finally, words in the word set and the stop word set are compared with the original text, the words are removed from the original text, and the rest are positions where error words exist.

The method comprises the following specific steps:

the first step is as follows: establishing a database comprising a font near word stock X, a language database Y, a common word stock Q and a stop word stock T;

the second step is that: selecting a to-be-processed sample sentence A;

the third step: preprocessing the sentence A, removing non-character characters in the sentence, and obtaining a new character string B ═ c₁c₂...c_nN is the length of the character string B;

the fourth step: dividing the character string B into g ═ { n/m } short character strings by taking the length of the character string as m, wherein (n/m) represents the minimum integer not less than the number, and combining the divided sentences into a short character string matrix L ═ L [ L/m ]₁L₂…L_i]Where i ═ g, length (L)_i-1)＝m，L_iLength of (2)Depending on length (L) when (n/m) can be divided by unity_i) When (n/m) is not divisible, the last string L_iIs equal to the remainder, L₁＝c₁c₂…c_m，L₂＝c_m+1c_m+2…c_2m；

The fifth step: converting the short string matrix L ═ L₁L₂…L_i]In each short sentence L_iFinding out the corresponding shape near character in the shape near character library X, and eliminating the unusual characters in the shape near characters by using the common character library Q to obtain L_iThe candidate word vector matrix of (2);

and a sixth step: mixing L with_iArranging and combining adjacent vectors in the matrix to form a series of words, judging whether the words belong to a corpus Y, if not, removing the words to obtain a set L of all words w ═ w { (w)₁,w₂,…w_nAnd if a certain vector is not combined with an adjacent vector into a word or the length of the vector is 1, comparing the word with the deactivated word bank T, removing the non-deactivated words, and obtaining a deactivated word set d ═ d { (d)₁,d₂…d_n}；

The seventh step: mixing L with_iLast character of (1) and L_i+1The first character of the two characters is extracted, the shape similar characters of the two characters are found out, and if the characters can be combined into a word, the word is added into the set w;

eighth step: set of words w ═ w₁,w₂,…w_nEach word in the list and the set of deactivated words d ═ d₁,d₂…d_nAnd comparing each stop word in the sentence with the original sentence B, if the word exists in the original sentence B, rejecting the word in the original sentence B, and the rest words in the original sentence B are the positions of the wrong words in the sentence.

In the third step, the processed text B is a character string with all punctuations removed.

In the fourth step, the segmentation length m is any number smaller than the sentence length n.

In the fifth step, L_iIs determined by the number of remaining near words after removal of the rare words。

The invention has the beneficial effects that: the method solves the speed problem and the accuracy problem of word segmentation results caused by word segmentation during text error detection, saves time consumed during word segmentation and probability calculation by using an N-gram, and can detect the position error of the text more quickly.

Drawings

FIG. 1 is a flow chart of the steps of the present invention;

FIG. 2 is a flow chart of a fourth step of the present invention.

Detailed Description

The invention is further described below with reference to the following detailed description (figures and accompanying drawings).

Example 1: as shown in fig. 1-2, a method for positioning text shape-near word errors, first dividing a long sentence into n short sentences with length of m, then finding out shape-near words corresponding to each word in each short sentence by using a Chinese character shape-near word library, forming candidate word vectors with original characters, removing the uncommon words in the vectors by using a common word library, forming a candidate matrix by using the candidate word vectors of all the words, and thus obtaining a candidate word matrix of each short sentence; secondly, binding adjacent vectors in each candidate matrix into words, adding the combined correct words into a word set w, extracting the vectors which cannot be combined into the words, and adding the inactive characters in the vectors into an inactive character set d; extracting characters of the connecting parts of two adjacent short sentences, combining the characters, and adding the characters into the set w if the characters have words; finally, comparing the words in the set w and the set d with the original text, and eliminating the words from the original text, wherein the rest are positions where error words exist.

The steps are as follows:

the first step is as follows: and establishing a database which comprises a font near word stock X, a corpus Y, a frequently used word stock Q and a disabled word stock T.

The second step is that: and selecting a to-be-processed sample sentence A.

The third step: preprocessing the sentence A, removing non-character characters in the sentence, and obtaining a new character string B ═ c₁c₂...c_nAnd n is the length of the character string B.

The fourth step: dividing the character string B into g ═ { n/m } short character strings by taking the length of the character string as m, wherein (n/m) represents the minimum integer not less than the number, and combining the divided sentences into a short character string matrix L ═ L [ L/m ]₁L₂…L_i]Where i ═ g, length (L)_i-1)＝m，L_iIs determined by length (L) when (n/m) is divisible_i) When (n/m) is not divisible, the last string L_iIs equal to the remainder, L₁＝c₁c₂…c_m，L₂＝c_m+1c_m+2…c_2m。

The fifth step: converting the short string matrix L ═ L₁L₂…L_i]In each short sentence L_iFinding out the corresponding shape near character in the shape near character library X, and eliminating the unusual characters in the shape near characters by using the common character library Q to obtain L_iVector matrices of candidate words, e.g. c₁Shape of Chinese character' Yuan₁c₁₁c₁₂…c_1j]Then the candidate word vector matrix is L₁＝[c₁c₁₁c₁₂…c_1j]…[c_mc_m1c_m2…c_mk]。

And a sixth step: mixing L with_iArranging and combining adjacent vectors in the matrix to form a series of words, judging whether the words belong to a corpus Y, if not, removing the words to obtain a set L of all words w ═ w { (w)₁,w₂,…w_nAnd if a certain vector is not combined with an adjacent vector into a word or the length of the vector is 1, comparing the word with the deactivated word bank T, removing the non-deactivated words, and obtaining a deactivated word set d ═ d { (d)₁,d₂…d_n}。

The seventh step: mixing L with_iLast character of (1) and L_i+1The first character of the two characters is extracted, the shape near characters of the two characters are found out, whether the characters can be combined into a word or not is judged, and if the characters can be combined into the word, the word is added into the set w.

Eighth step: set of words w ═ w₁,w₂,…w_nEach word in (b) } andset of stop words d ═ d₁,d₂…d_nAnd comparing each stop word in the sentence with the original sentence B, if the word exists in the original sentence B, rejecting the word in the original sentence B, and the rest words in the original sentence B are the positions of the wrong words in the sentence.

In the first step, the shape near word stock X contains the shape near words of all Chinese characters, the language stock Y is a corpus after word segmentation processing and statistics, the common word stock Q is a primary word stock and a secondary word stock, stop words refer to functional words and have no practical significance, for example, 'I' is 'if' and the like, and the stop word stock T contains the words

In the second step, the input sentence a may be a long sentence or a short sentence.

In the fourth step, the segmentation length m may be any number smaller than the sentence length n.

In the fifth step, L_iThe length of each candidate word vector of (a) depends on the number of remaining near words after removal of the infrequent words, so the length of each vector is not necessarily equal.

Said in said step six, L_iThe combination of adjacent vectors in the matrix is in the form: example a ═ aa₁],b＝[bb₁]For two adjacent vectors in Li, the combined result { ab, ab₁,a₁b,a₁b₁}。

Said step six wherein no combinatory words means the result of the combination { ab, ab₁,a₁b,a₁b₁And if no correct words exist in the words, judging whether a and b have stop words, and if so, adding the stop words into a stop word set d.

Said at step seven is described as_iLast character of (1) and L_i+1The first character of (1), i.e. the connection of the divided short characters, example L₁＝c₁c₂…c_m，L₂＝c_m+1c_m+2…c_2mThe two characters at the junction are { c }_m,c_m+1And combining the shapes of the two characters, and adding the combined characters into a word set w if the word exists.

Example 2: a method for positioning text form near word errors comprises the following specific steps:

step1, establishing a database which comprises a shape near word stock X, a corpus Y, a common word stock Q and a stop word stock T.

Step2, selecting the sentence A to be processed, the example is simply dare not to believe own eye. The' error character is clear.

Step3, preprocessing the sentence a, removing punctuation marks in the sentence to obtain a new character string, wherein B is simply dare not to believe own eye' n is 11 as the length of the character string B.

Step4, dividing B into 5 segments with length m, g being { n/m }, where (n/m) represents the minimum integer not less than this number, g being 3, L being [ L ═ m [, L ═ L [, n [ ], and m [, n [ ], n [ ], n [₁L₂L₃]' simply dare phase ', ' eye believed to be ' fine '],L₁、L₂Length of 5, L₃Is 1.

Step5, finding L respectively₁、L₂、L₃Vector matrix of candidate words, e.g. L₁Is 'simple cylinder', 'straight', 'don' and 'dare' and 'phase cabinet']，L₂The candidate vector matrix of (a) is [ 'belief', 'self-denier and', 'already-barred', 'of']，L₃Is 'clear' of 'a candidate vector matrix']。

Step6, combine neighboring vectors in each matrix, example L₁The adjacent vector combination in for { simply straight, a section of thick bamboo is straight, and is not dare, desquamation, dare under, dare mutually, dare the cabinet, scatter mutually, scatter the cabinet }, compares the word and the word in the corpus of the inside, draws correct word { simply straight, dare } and adds w in, correct word is w { simply straight in the three matrix, dare not, oneself }, stops that word d ═ {.

Step7, extracting the conjunctive words of two adjacent short sentences, where the conjunctive words are [ believe ], [ eye ], finding out the words with correct composition from the shape of each word, and adding the words into the matrix w, w ═ simple, dare, self, believe, eye, d ═ of {

Step8, comparing w and d with the original sentence B ', which is simply dare not to believe own eye', eliminating the words in w and d, and B ', which is simply dare not to believe own eye', locating the position of the error word in the original sentence.

While the present invention has been described in detail with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, and various changes can be made without departing from the spirit and scope of the present invention.

Claims

1. A positioning method for text form near word errors is characterized in that: firstly, a long sentence is divided into a plurality of short sentences, then a Chinese character shape near character library is used for finding out the shape near character corresponding to each character in each short sentence, the shape near character and the original character form a candidate character vector, an uncommon character in the vector is removed by using a common character library, and candidate character vectors of all characters form a candidate matrix, so that a candidate character matrix of each short sentence is obtained; secondly, binding adjacent vectors in each candidate matrix into words, adding the combined correct words into a word set, extracting inactive words in the word set, and adding the inactive words into the inactive word set; extracting characters of the connecting parts of the two adjacent short sentences, combining the characters, and adding the characters into a word set if the words exist; finally, words in the word set and the stop word set are compared with the original text, the words are removed from the original text, and the rest are positions where error words exist.

2. The method for positioning the errors of the text font near word according to claim 1, characterized by comprising the following steps:

the second step is that: selecting a to-be-processed sample sentence A;

third stepThe method comprises the following steps: preprocessing the sentence A, removing non-character characters in the sentence, and obtaining a new character string B ═ c₁c₂...c_nN is the length of the character string B, and the processed text B is the character string after all punctuations are removed;

the fourth step: dividing the character string B into g ═ { n/m } short character strings by taking the length of the character string as m, wherein (n/m) represents the minimum integer not less than the number, and combining the divided sentences into a short character string matrix L ═ L [ L/m ]₁L₂…L_i]Where i ═ g, length (L)_i-1)＝m，L_iIs determined by length (L) when (n/m) is divisible_i) When (n/m) is not divisible, the last string L_iIs equal to the remainder, L₁＝c₁c₂…c_m，L₂＝c_m+1c_m+2…c_2m；

and a sixth step: mixing L with_iAdjacent vectors in the candidate character vector matrix are arranged and combined to form a series of words, whether the words belong to the corpus Y or not is judged, if not, the words are removed, and all word sets w ═ w corresponding to L are obtained₁,w₂,…w_nAnd if a certain vector is not combined with an adjacent vector into a word or the length of the vector is 1, comparing the word with the deactivated word bank T, removing the non-deactivated words, and obtaining a deactivated word set d ═ d { (d)₁,d₂…d_n}；

eighth step: set of words w ═ w₁,w₂,…w_nEvery word in the Chinese character } and deactivationSet of words d ═ d₁,d₂…d_nAnd comparing each stop word in the sentence with the original sentence B, if the word exists in the original sentence B, rejecting the word in the original sentence B, and the rest words in the original sentence B are the positions of the wrong words in the sentence.

3. The method for locating a text-shaped near-word error according to claim 1, wherein: in the fourth step, the segmentation length m is any number smaller than the sentence length n.

4. The method for locating a text-shaped near-word error according to claim 1, wherein: in the fifth step, L_iThe length of each candidate word vector of (a) depends on the number of remaining near words after the removal of the infrequent words.