CN109101482B - Positioning method for text form near word error - Google Patents

Positioning method for text form near word error Download PDF

Info

Publication number
CN109101482B
CN109101482B CN201810709763.1A CN201810709763A CN109101482B CN 109101482 B CN109101482 B CN 109101482B CN 201810709763 A CN201810709763 A CN 201810709763A CN 109101482 B CN109101482 B CN 109101482B
Authority
CN
China
Prior art keywords
word
character
words
characters
sentence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810709763.1A
Other languages
Chinese (zh)
Other versions
CN109101482A (en
Inventor
邵玉斌
王林坪
龙华
杜庆治
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Kunming University of Science and Technology
Original Assignee
Kunming University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Kunming University of Science and Technology filed Critical Kunming University of Science and Technology
Priority to CN201810709763.1A priority Critical patent/CN109101482B/en
Publication of CN109101482A publication Critical patent/CN109101482A/en
Application granted granted Critical
Publication of CN109101482B publication Critical patent/CN109101482B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/247Thesauruses; Synonyms

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to a positioning method for text form and word near errors, and belongs to the technical field of natural language processing. Firstly, a long sentence is divided into a plurality of short sentences, then a Chinese character shape near character library is used for finding out the shape near character corresponding to each character in each short sentence, the shape near character and the original character form a candidate character vector, an uncommon character in the vector is removed by using a common character library, and candidate character vectors of all characters form a candidate matrix, so that a candidate character matrix of each short sentence is obtained; secondly, binding adjacent vectors in each candidate matrix into words, adding the combined correct words into a word set, extracting inactive words in the word set, and adding the inactive words into the inactive word set; extracting characters of the connecting parts of the two adjacent short sentences, combining the characters, and adding the characters into a word set if the words exist; finally, words in the word set and the stop word set are compared with the original text, the words are removed, and the rest are positions where error words exist.

Description

Positioning method for text form near word error
Technical Field
The invention relates to a positioning method for text form and word near errors, and belongs to the technical field of natural language processing.
Background
At present, because of the application of the OCR text recognition technology, when translating paper text characters into computer characters, some characters are often recognized wrongly and recognized as other characters, most of the characters are similar to the original characters, and in a large number of text proofreading, the positions of wrongly recognized characters in the text can be found out quickly, which is a precondition for text proofreading.
The method for positioning the error position in the text by using the N-gram through the context connection strength is a common method for text error detection and correction, word segmentation is a precondition for using the N-gram, but the word segmentation accuracy plays a decisive role in text error detection for word segmentation, certain time is consumed for word segmentation and probability calculation, and the efficiency is very low in accuracy and error detection speed.
Disclosure of Invention
The technical problem to be solved by the invention is to provide a positioning method for text form and word form errors, which is used for solving the problems of speed caused by word segmentation and accuracy of word segmentation results during text error detection, saving time consumed by word segmentation and probability calculation, quickly positioning error positions in a text and laying a cushion for next proofreading work.
The technical scheme of the invention is as follows: a method for positioning text form and near word errors comprises the steps of firstly, dividing a long sentence into a plurality of short sentences, then finding out form and near words corresponding to each word in each short sentence by using a Chinese character form and near word library, forming candidate word vectors by using original characters, removing the uncommon words in the vectors by using a common word library, and forming a candidate matrix by using the candidate word vectors of all the words so as to obtain a candidate word matrix of each short sentence; secondly, binding adjacent vectors in each candidate matrix into words, adding the combined correct words into a word set, extracting inactive words in the word set, and adding the inactive words into the inactive word set; extracting characters of the connecting parts of the two adjacent short sentences, combining the characters, and adding the characters into a word set if the words exist; finally, words in the word set and the stop word set are compared with the original text, the words are removed from the original text, and the rest are positions where error words exist.
The method comprises the following specific steps:
the first step is as follows: establishing a database comprising a font near word stock X, a language database Y, a common word stock Q and a stop word stock T;
the second step is that: selecting a to-be-processed sample sentence A;
the third step: preprocessing the sentence A, removing non-character characters in the sentence, and obtaining a new character string B ═ c1c2...cnN is the length of the character string B;
the fourth step: dividing the character string B into g ═ { n/m } short character strings by taking the length of the character string as m, wherein (n/m) represents the minimum integer not less than the number, and combining the divided sentences into a short character string matrix L ═ L [ L/m ]1L2…Li]Where i ═ g, length (L)i-1)=m,LiLength of (2)Depending on length (L) when (n/m) can be divided by unityi) When (n/m) is not divisible, the last string LiIs equal to the remainder, L1=c1c2…cm,L2=cm+1cm+2…c2m
The fifth step: converting the short string matrix L ═ L1L2…Li]In each short sentence LiFinding out the corresponding shape near character in the shape near character library X, and eliminating the unusual characters in the shape near characters by using the common character library Q to obtain LiThe candidate word vector matrix of (2);
and a sixth step: mixing L withiArranging and combining adjacent vectors in the matrix to form a series of words, judging whether the words belong to a corpus Y, if not, removing the words to obtain a set L of all words w ═ w { (w)1,w2,…wnAnd if a certain vector is not combined with an adjacent vector into a word or the length of the vector is 1, comparing the word with the deactivated word bank T, removing the non-deactivated words, and obtaining a deactivated word set d ═ d { (d)1,d2…dn};
The seventh step: mixing L withiLast character of (1) and Li+1The first character of the two characters is extracted, the shape similar characters of the two characters are found out, and if the characters can be combined into a word, the word is added into the set w;
eighth step: set of words w ═ w1,w2,…wnEach word in the list and the set of deactivated words d ═ d1,d2…dnAnd comparing each stop word in the sentence with the original sentence B, if the word exists in the original sentence B, rejecting the word in the original sentence B, and the rest words in the original sentence B are the positions of the wrong words in the sentence.
In the third step, the processed text B is a character string with all punctuations removed.
In the fourth step, the segmentation length m is any number smaller than the sentence length n.
In the fifth step, LiIs determined by the number of remaining near words after removal of the rare words。
The invention has the beneficial effects that: the method solves the speed problem and the accuracy problem of word segmentation results caused by word segmentation during text error detection, saves time consumed during word segmentation and probability calculation by using an N-gram, and can detect the position error of the text more quickly.
Drawings
FIG. 1 is a flow chart of the steps of the present invention;
FIG. 2 is a flow chart of a fourth step of the present invention.
Detailed Description
The invention is further described below with reference to the following detailed description (figures and accompanying drawings).
Example 1: as shown in fig. 1-2, a method for positioning text shape-near word errors, first dividing a long sentence into n short sentences with length of m, then finding out shape-near words corresponding to each word in each short sentence by using a Chinese character shape-near word library, forming candidate word vectors with original characters, removing the uncommon words in the vectors by using a common word library, forming a candidate matrix by using the candidate word vectors of all the words, and thus obtaining a candidate word matrix of each short sentence; secondly, binding adjacent vectors in each candidate matrix into words, adding the combined correct words into a word set w, extracting the vectors which cannot be combined into the words, and adding the inactive characters in the vectors into an inactive character set d; extracting characters of the connecting parts of two adjacent short sentences, combining the characters, and adding the characters into the set w if the characters have words; finally, comparing the words in the set w and the set d with the original text, and eliminating the words from the original text, wherein the rest are positions where error words exist.
The steps are as follows:
the first step is as follows: and establishing a database which comprises a font near word stock X, a corpus Y, a frequently used word stock Q and a disabled word stock T.
The second step is that: and selecting a to-be-processed sample sentence A.
The third step: preprocessing the sentence A, removing non-character characters in the sentence, and obtaining a new character string B ═ c1c2...cnAnd n is the length of the character string B.
The fourth step: dividing the character string B into g ═ { n/m } short character strings by taking the length of the character string as m, wherein (n/m) represents the minimum integer not less than the number, and combining the divided sentences into a short character string matrix L ═ L [ L/m ]1L2…Li]Where i ═ g, length (L)i-1)=m,LiIs determined by length (L) when (n/m) is divisiblei) When (n/m) is not divisible, the last string LiIs equal to the remainder, L1=c1c2…cm,L2=cm+1cm+2…c2m
The fifth step: converting the short string matrix L ═ L1L2…Li]In each short sentence LiFinding out the corresponding shape near character in the shape near character library X, and eliminating the unusual characters in the shape near characters by using the common character library Q to obtain LiVector matrices of candidate words, e.g. c1Shape of Chinese character' Yuan1c11c12…c1j]Then the candidate word vector matrix is L1=[c1c11c12…c1j]…[cmcm1cm2…cmk]。
And a sixth step: mixing L withiArranging and combining adjacent vectors in the matrix to form a series of words, judging whether the words belong to a corpus Y, if not, removing the words to obtain a set L of all words w ═ w { (w)1,w2,…wnAnd if a certain vector is not combined with an adjacent vector into a word or the length of the vector is 1, comparing the word with the deactivated word bank T, removing the non-deactivated words, and obtaining a deactivated word set d ═ d { (d)1,d2…dn}。
The seventh step: mixing L withiLast character of (1) and Li+1The first character of the two characters is extracted, the shape near characters of the two characters are found out, whether the characters can be combined into a word or not is judged, and if the characters can be combined into the word, the word is added into the set w.
Eighth step: set of words w ═ w1,w2,…wnEach word in (b) } andset of stop words d ═ d1,d2…dnAnd comparing each stop word in the sentence with the original sentence B, if the word exists in the original sentence B, rejecting the word in the original sentence B, and the rest words in the original sentence B are the positions of the wrong words in the sentence.
In the first step, the shape near word stock X contains the shape near words of all Chinese characters, the language stock Y is a corpus after word segmentation processing and statistics, the common word stock Q is a primary word stock and a secondary word stock, stop words refer to functional words and have no practical significance, for example, 'I' is 'if' and the like, and the stop word stock T contains the words
In the second step, the input sentence a may be a long sentence or a short sentence.
In the third step, the processed text B is a character string with all punctuations removed.
In the fourth step, the segmentation length m may be any number smaller than the sentence length n.
In the fifth step, LiThe length of each candidate word vector of (a) depends on the number of remaining near words after removal of the infrequent words, so the length of each vector is not necessarily equal.
Said in said step six, LiThe combination of adjacent vectors in the matrix is in the form: example a ═ aa1],b=[bb1]For two adjacent vectors in Li, the combined result { ab, ab1,a1b,a1b1}。
Said step six wherein no combinatory words means the result of the combination { ab, ab1,a1b,a1b1And if no correct words exist in the words, judging whether a and b have stop words, and if so, adding the stop words into a stop word set d.
Said at step seven is described asiLast character of (1) and Li+1The first character of (1), i.e. the connection of the divided short characters, example L1=c1c2…cm,L2=cm+1cm+2…c2mThe two characters at the junction are { c }m,cm+1And combining the shapes of the two characters, and adding the combined characters into a word set w if the word exists.
Example 2: a method for positioning text form near word errors comprises the following specific steps:
step1, establishing a database which comprises a shape near word stock X, a corpus Y, a common word stock Q and a stop word stock T.
Step2, selecting the sentence A to be processed, the example is simply dare not to believe own eye. The' error character is clear.
Step3, preprocessing the sentence a, removing punctuation marks in the sentence to obtain a new character string, wherein B is simply dare not to believe own eye' n is 11 as the length of the character string B.
Step4, dividing B into 5 segments with length m, g being { n/m }, where (n/m) represents the minimum integer not less than this number, g being 3, L being [ L ═ m [, L ═ L [, n [ ], and m [, n [ ], n [ ], n [1L2L3]' simply dare phase ', ' eye believed to be ' fine '],L1、L2Length of 5, L3Is 1.
Step5, finding L respectively1、L2、L3Vector matrix of candidate words, e.g. L1Is 'simple cylinder', 'straight', 'don' and 'dare' and 'phase cabinet'],L2The candidate vector matrix of (a) is [ 'belief', 'self-denier and', 'already-barred', 'of'],L3Is 'clear' of 'a candidate vector matrix']。
Step6, combine neighboring vectors in each matrix, example L1The adjacent vector combination in for { simply straight, a section of thick bamboo is straight, and is not dare, desquamation, dare under, dare mutually, dare the cabinet, scatter mutually, scatter the cabinet }, compares the word and the word in the corpus of the inside, draws correct word { simply straight, dare } and adds w in, correct word is w { simply straight in the three matrix, dare not, oneself }, stops that word d ═ {.
Step7, extracting the conjunctive words of two adjacent short sentences, where the conjunctive words are [ believe ], [ eye ], finding out the words with correct composition from the shape of each word, and adding the words into the matrix w, w ═ simple, dare, self, believe, eye, d ═ of {
Step8, comparing w and d with the original sentence B ', which is simply dare not to believe own eye', eliminating the words in w and d, and B ', which is simply dare not to believe own eye', locating the position of the error word in the original sentence.
While the present invention has been described in detail with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, and various changes can be made without departing from the spirit and scope of the present invention.

Claims (4)

1. A positioning method for text form near word errors is characterized in that: firstly, a long sentence is divided into a plurality of short sentences, then a Chinese character shape near character library is used for finding out the shape near character corresponding to each character in each short sentence, the shape near character and the original character form a candidate character vector, an uncommon character in the vector is removed by using a common character library, and candidate character vectors of all characters form a candidate matrix, so that a candidate character matrix of each short sentence is obtained; secondly, binding adjacent vectors in each candidate matrix into words, adding the combined correct words into a word set, extracting inactive words in the word set, and adding the inactive words into the inactive word set; extracting characters of the connecting parts of the two adjacent short sentences, combining the characters, and adding the characters into a word set if the words exist; finally, words in the word set and the stop word set are compared with the original text, the words are removed from the original text, and the rest are positions where error words exist.
2. The method for positioning the errors of the text font near word according to claim 1, characterized by comprising the following steps:
the first step is as follows: establishing a database comprising a font near word stock X, a language database Y, a common word stock Q and a stop word stock T;
the second step is that: selecting a to-be-processed sample sentence A;
third stepThe method comprises the following steps: preprocessing the sentence A, removing non-character characters in the sentence, and obtaining a new character string B ═ c1c2...cnN is the length of the character string B, and the processed text B is the character string after all punctuations are removed;
the fourth step: dividing the character string B into g ═ { n/m } short character strings by taking the length of the character string as m, wherein (n/m) represents the minimum integer not less than the number, and combining the divided sentences into a short character string matrix L ═ L [ L/m ]1L2…Li]Where i ═ g, length (L)i-1)=m,LiIs determined by length (L) when (n/m) is divisiblei) When (n/m) is not divisible, the last string LiIs equal to the remainder, L1=c1c2…cm,L2=cm+1cm+2…c2m
The fifth step: converting the short string matrix L ═ L1L2…Li]In each short sentence LiFinding out the corresponding shape near character in the shape near character library X, and eliminating the unusual characters in the shape near characters by using the common character library Q to obtain LiThe candidate word vector matrix of (2);
and a sixth step: mixing L withiAdjacent vectors in the candidate character vector matrix are arranged and combined to form a series of words, whether the words belong to the corpus Y or not is judged, if not, the words are removed, and all word sets w ═ w corresponding to L are obtained1,w2,…wnAnd if a certain vector is not combined with an adjacent vector into a word or the length of the vector is 1, comparing the word with the deactivated word bank T, removing the non-deactivated words, and obtaining a deactivated word set d ═ d { (d)1,d2…dn};
The seventh step: mixing L withiLast character of (1) and Li+1The first character of the two characters is extracted, the shape similar characters of the two characters are found out, and if the characters can be combined into a word, the word is added into the set w;
eighth step: set of words w ═ w1,w2,…wnEvery word in the Chinese character } and deactivationSet of words d ═ d1,d2…dnAnd comparing each stop word in the sentence with the original sentence B, if the word exists in the original sentence B, rejecting the word in the original sentence B, and the rest words in the original sentence B are the positions of the wrong words in the sentence.
3. The method for locating a text-shaped near-word error according to claim 1, wherein: in the fourth step, the segmentation length m is any number smaller than the sentence length n.
4. The method for locating a text-shaped near-word error according to claim 1, wherein: in the fifth step, LiThe length of each candidate word vector of (a) depends on the number of remaining near words after the removal of the infrequent words.
CN201810709763.1A 2018-07-02 2018-07-02 Positioning method for text form near word error Active CN109101482B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810709763.1A CN109101482B (en) 2018-07-02 2018-07-02 Positioning method for text form near word error

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810709763.1A CN109101482B (en) 2018-07-02 2018-07-02 Positioning method for text form near word error

Publications (2)

Publication Number Publication Date
CN109101482A CN109101482A (en) 2018-12-28
CN109101482B true CN109101482B (en) 2021-08-20

Family

ID=64845410

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810709763.1A Active CN109101482B (en) 2018-07-02 2018-07-02 Positioning method for text form near word error

Country Status (1)

Country Link
CN (1) CN109101482B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112818108B (en) * 2021-02-24 2023-10-13 中国人民大学 Text semantic misinterpretation chat robot based on shape and near words and data processing method thereof

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101067809A (en) * 2007-06-22 2007-11-07 蒋贤春 Independent word segmentation
CN107679032A (en) * 2017-09-04 2018-02-09 百度在线网络技术(北京)有限公司 Voice changes error correction method and device
CN108038098A (en) * 2017-11-28 2018-05-15 苏州市东皓计算机系统工程有限公司 A kind of computword correcting method
CN108132917A (en) * 2017-12-04 2018-06-08 昆明理工大学 A kind of document error correction flag method

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101067809A (en) * 2007-06-22 2007-11-07 蒋贤春 Independent word segmentation
CN107679032A (en) * 2017-09-04 2018-02-09 百度在线网络技术(北京)有限公司 Voice changes error correction method and device
CN108038098A (en) * 2017-11-28 2018-05-15 苏州市东皓计算机系统工程有限公司 A kind of computword correcting method
CN108132917A (en) * 2017-12-04 2018-06-08 昆明理工大学 A kind of document error correction flag method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
"Review of Real-word Error Detection and Correction Methods in Text Documents";Shashank Singh 等;《ICECA2018》;20180331;第1076-1081页 *
"繁体中文拼写检错研究";王勇;《中国优秀硕士学位论文全文数据库 信息科技辑》;20170215(第2期);第I138-4598页 *

Also Published As

Publication number Publication date
CN109101482A (en) 2018-12-28

Similar Documents

Publication Publication Date Title
Munteanu et al. Improving machine translation performance by exploiting non-parallel corpora
CN109086266B (en) Error detection and correction method for text-shaped near characters
Oh et al. An English-Korean transliteration model using pronunciation and contextual rules
CN105279149A (en) Chinese text automatic correction method
CN110046348B (en) Method for recognizing main body in subway design specification based on rules and dictionaries
CN105068997B (en) The construction method and device of parallel corpora
Honnet et al. Machine translation of low-resource spoken dialects: Strategies for normalizing Swiss German
CN104375988A (en) Word and expression alignment method and device
CN103678287A (en) Method for unifying keyword translation
Saluja et al. Error detection and corrections in Indic OCR using LSTMs
CN110929510A (en) Chinese unknown word recognition method based on dictionary tree
Tran et al. A character level based and word level based approach for Chinese-Vietnamese machine translation
Hangya et al. Unsupervised parallel sentence extraction from comparable corpora
CN109101482B (en) Positioning method for text form near word error
Cabot et al. SIBM at CLEF eHealth Evaluation Lab 2017: Multilingual Information Extraction with CIM-IND.
CN113420766B (en) Low-resource language OCR method fusing language information
CN109543023B (en) Document classification method and system based on trie and LCS algorithm
CN104572632A (en) Method for determining translation direction of word with proper noun translation
CN103714053A (en) Japanese verb identification method for machine translation
US10515148B2 (en) Arabic spell checking error model
Lee et al. Alignment of bilingual named entities in parallel corpora using statistical models and multiple knowledge sources
CN103177125A (en) Method for realizing fast-speed short text bi-cluster
Sloto et al. Findings of the WMT 2023 shared task on parallel data curation
Chiu et al. Chinese spell checking based on noisy channel model
CN116306594A (en) Medical OCR recognition error correction method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant