CN108132917B - Document error correction marking method - Google Patents

Document error correction marking method Download PDF

Info

Publication number
CN108132917B
CN108132917B CN201711257231.0A CN201711257231A CN108132917B CN 108132917 B CN108132917 B CN 108132917B CN 201711257231 A CN201711257231 A CN 201711257231A CN 108132917 B CN108132917 B CN 108132917B
Authority
CN
China
Prior art keywords
document
word
error
corrected
len
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201711257231.0A
Other languages
Chinese (zh)
Other versions
CN108132917A (en
Inventor
龙华
祁俊辉
毕丹红
唐菁敏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Kunming University of Science and Technology
Original Assignee
Kunming University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Kunming University of Science and Technology filed Critical Kunming University of Science and Technology
Priority to CN201711257231.0A priority Critical patent/CN108132917B/en
Publication of CN108132917A publication Critical patent/CN108132917A/en
Application granted granted Critical
Publication of CN108132917B publication Critical patent/CN108132917B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/103Formatting, i.e. changing of presentation of documents
    • G06F40/117Tagging; Marking up; Designating a block; Setting of attributes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/232Orthographic correction, e.g. spell checking or vowelisation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)
  • Document Processing Apparatus (AREA)

Abstract

The invention relates to a document error correction marking method, and belongs to the technical field of information processing. The method comprises the steps of dividing a document to be corrected into small particle sets by separators, performing word segmentation operation on all set elements through a specific word segmentation algorithm, searching whether spelling errors of English words exist according to an English word database and recording the spelling errors, calculating word lengths of all word sets, recording the word lengths into an error word set if a plurality of continuous independent words exist, namely the lengths of the plurality of continuous words are all 1, and finally performing error marking on the document to be corrected through data in the error word set to generate and export an error-corrected document. Compared with the prior art, the method mainly solves the problems that the prior art has poor support for other languages except English, particularly the phenomena of imperfect error correction marking, poor support and the like of Chinese documents, and aims to increase the support for error correction marking of the Chinese documents by computers at present.

Description

Document error correction marking method
Technical Field
The invention relates to a document error correction marking method, and belongs to the technical field of information processing.
Background
The document error correction marking is a very important common technology in information processing technology, and generally, WORD error correction marking functions are available in WORD or WPS, which are mainly used for reminding a document editor that a WORD spelling error or a WORD logic error may occur somewhere in a document.
At present, the traditional document error correction marking method is mainly used for marking wrong English words, the support of non-English languages such as Chinese and the like is not good, because English does not relate to word segmentation, the independence between words is strong, Chinese does not need to segment words according to spaces like English, and whether the words are correct can be judged by searching an English word database. The Chinese relates to word segmentation, and no matter from the viewpoint of statistics or grammar, no certain rule exists between words, so that the error correction marking of the Chinese document is difficult to implement once.
Disclosure of Invention
The technical problem to be solved by the present invention is to provide a method for marking a document with error correction to solve the above-mentioned problems.
The technical scheme of the invention is as follows: a method for marking document error correction includes dividing a document to be corrected into small particle set forms, dividing words of set elements, searching whether spelling errors of English words exist or not and recording the spelling errors, carrying out word length calculation on all word sets, recording the word length calculation result into an error word set if a plurality of continuous independent words exist, and marking the document to be corrected with errors through data in the error word set.
The method specifically comprises the following steps:
step 1: and acquiring the document X to be corrected.
Step 2: dividing the document X to be corrected into sets by using separators, namely splitting the document X to be corrected into X: { X1,X2,…,Xn}。
Step 3: set element X of document X to be correctedi,i∈[1,n]The element is participled through a participle algorithm to obtain a set element X of the document X to be correctediCorresponding word set Xi:{xi1,xi2,…,xim}。
Step 4: traversal word set Xi:{xi1,xi2,…,ximElement x in (b) }ij,j∈[1,m]If xijAnd searching an English word database if the word is an English word, if the word exists, ignoring the word, and if the word does not exist, recording the word subscript ij into an ERROR word set ERROR.
Step 5: traversal word set Xi:{xi1,xi2,…,ximElement x in (b) }ij,j∈[1,m]For element xijCalculate its length lenijAnd generating a set of words Xi:{xi1,xi2,…,ximLength set len corresponding toi:{leni1,leni2,…,lenim}。
Step 6: define an error correction threshold P, traverse the length set leni:{leni1,leni2,…,lenimIf there are a plurality of independent words in succession, i.e., P and more than P lens in successionij,j∈[1,m]To 1, then the P lensijThe corresponding subscript ij is recorded in the ERROR word set ERROR.
Step 7: traversal of X set form X of document to be corrected1,X2,…,XnAll elements X ini,i∈[1,n]The operations of steps 3, 4, 5 and 6 are performed.
Step 8: and adding red wavy lines below the text at the specified position of the document X to be corrected according to the subscript data ij in the ERROR word set ERROR to generate and export the corrected document X'.
In Step1-8, the involved N, m ∈ N+
Further, in Step1, the document X to be corrected may be a full chinese document, or a combination document of chinese and english, or more specifically, a full english document.
Further, in Step2, the separators are periods, exclamations, question marks, ellipses, line breaks and page breaks in the Chinese or English state.
Further, in Step3, the word segmentation algorithm should satisfy a) that the Chinese sentences can be normally segmented; b) the English words can be normally segmented; c) a string of numbers may be listed separately; d) punctuation and space can be removed.
Further, in Step6, the error correction threshold P may be set to a value according to a specific use environment, as shown in equation (3), where P is generally equal to 3.
P≥3 (3)
The invention has the beneficial effects that: compared with the prior art, the method mainly solves the problems that the prior art has poor support for other languages except English, particularly the phenomena of imperfect error correction marking, poor support and the like of Chinese documents, and aims to increase the support for error correction marking of the Chinese documents by computers at present.
Drawings
FIG. 1 is a schematic flow diagram of the present invention.
Detailed Description
The invention is further described with reference to the following drawings and detailed description.
Example 1: the invention considers that after the Chinese sentence is normally segmented, if a plurality of continuous words are independent words, the logic structure of the normal sentence is not met, so that errors are necessary to be made at the position. Based on the method, a document to be corrected is divided into small particle set forms by separators, word segmentation operation is carried out on all set elements through a specific word segmentation algorithm, then whether spelling errors of English words exist or not is searched according to an English word database and recorded, then word lengths of all word sets are calculated, if a plurality of continuous independent words exist, namely the lengths of the plurality of continuous words are all 1, the words are recorded into an error word set, and finally error marking is carried out on the document to be corrected through data in the error word set, so that an error-corrected document is generated and exported.
A document error correction marking method specifically comprises the following steps:
step 1: acquiring a document X to be corrected; specifically, the method comprises the following steps:
assume that the content of the document X to be corrected is "university of Kunming university of Science and Technology," Kunming university of Science and Technology, abbreviated as Kunming, located in mingmen city of province of Yunnan province, created in 1954, named Kunming institute of Technology, and renamed as university of Kunming university of Technology in 1995. In 1999, the former Kunming Miller university and the former Yunnan Industrial university are combined to establish a new Kunming Miller university. Has been developed to become a major and combined science and engineering, which is a provincial major university with coordinated development of multiple subjects. ".
Step 2: dividing the document X to be corrected into sets by using separators, namely splitting the document X to be corrected into X: { X1,X2,…,Xn}; specifically, the method comprises the following steps:
after the division into the set form, the length n of the set is 3, X1The Kunming university of Science and Technology, named Kunming institute of Technology and Kunming university of Kunming, is called Kunming worker for short, is located in Kunming province Ming Kunming City, and is created in 1954, named Kunming institute of Technology, and is renamed to Kunming university of Technology in 1995. ", X2Is the combination of original Kunming university and original Yunnan industry university in 1999 to establish new Kunming university. ", X3The combination of the science and technology is a provincial major university with coordinated development of multiple subjects, so that the development becomes a major worker. ".
Step 3: set element X of document X to be correctedi,i∈[1,n]The element is participled through a participle algorithm to obtain a set element X of the document X to be correctediCorresponding word set Xi:{xi1,xi2,…,xim}。
For collection element X1Performing word segmentation to obtain a word set X1"Kunming, university of science, kunming, univelsity, of science, and, technology, acronym, Kunming, worker, location, Yunnan province, Ming, Kunming, City, creation, 1954, renamed, Kunming, university of science".
Step 4: traversal word set Xi:{xi1,xi2,…,ximElement x in (b) }ij,j∈[1,m]If xijSearching an English word database if the word is an English word, if the word exists, ignoring the word, and if the word does not exist, recording the word subscript ij into an ERROR word set ERROR; specifically, the method comprises the following steps:
traversal word set X1Finding word x1`3、x1`4、x1`5、x1`6、x1`7、x1`8All are English words, and finding out the word x by looking up the English word databaseijThe word index ij ═ 1' 4 is recorded in the ERROR word set ERROR, which is not present.
Step 5: traversal word set Xi:{xi1,xi2,…,ximElement x in (b) }ij,j∈[1,m]For element xijCalculate its length lenijAnd generating a set of words Xi:{xi1,xi2,…,ximLength set len corresponding toi:{leni1,leni2,…,lenim}。
Calculating a set of words X1Middle set element xijLength of (len)ijGenerated length set len1Is "2, 4, 7, 10, 2, 7, 3, 10, 2, 1, 1, 3, 3, 2, 1, 1, 1, 2, 1, 5, 2, 1, 2, 4".
Step 6: define an error correction threshold P, traverse the length set leni:{leni1,leni2,…,lenimIf there are a plurality of independent words in succession, i.e., P and more than P lens in successionij,j∈[1,m]To 1, then the P lensijThe corresponding subscript ij is recorded in the ERROR word set ERROR.
Defining error correction threshold P as 3, traversing length set, finding len1`15、len1`16、len1`17Are 3 consecutive independent words, and the corresponding subscripts ij ═ 1 ' 15, ij ═ 1 ' 16, and ij ═ 1 ' 17 are recorded in the ERROR word set ERROR.
Step 7: traversal of X set form X of document to be corrected1,X2,…,XnAll elements X ini,i∈[1,n]The operations of steps 3, 4, 5 and 6 are performed.
For collection element X2、X3Similarly, steps 3, Step4, Step5 and Step6 are performed to obtain subscripts ij ═ 2 ' 9, ij ═ 2 ' 10, ij ═ 2 ' 11, ij ═ 2 ' 12, ij ═ 3 ' 1, ij ═ 3 ' 2 and ij ═ 3 ' 3 of consecutive independent words, and the subscripts ij ═ 2, ij ═ 2 and Step6 are recorded into the ERROR word set ERROR.
Step 8: and adding red wavy lines below the text at the specified position of the document X to be corrected according to the subscript data ij in the ERROR word set ERROR to generate and export the corrected document X'.
The corrected document X' is Kunming university of Kunming (Kunming)Universityof Science and Technology), abbreviated as Kunming, was located in province Ming Kunming City of Yunnan province, was created in 1954, named Kunming institute of Technology, and more named Kunming university of Science in 1995. In 1999, the former Kunming Miller university and the former Yunnan Industrial university are combined to establish a new Kunming Miller university. Has been developed to become a major and combined science and engineering, which is a provincial major university with coordinated development of multiple subjects. ".
The embodiment result shows that the method adopted by the invention can better perform document error correction marking on Chinese, and can perform error correction marking on documents combining Chinese and English at the same time by combining the method for performing error correction marking on English in the traditional algorithm.
While the present invention has been described in detail with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, and various changes can be made without departing from the spirit and scope of the present invention.

Claims (5)

1. A document error correction marking method is characterized by comprising the following steps:
step 1: acquiring a document X to be corrected;
step 2: dividing the document X to be corrected into sets by using separators, namely splitting the document X to be corrected into X: { X1,X2,…,Xn};
Step 3: set element X of document X to be correctedi,i∈[1,n]The element is participled through a participle algorithm to obtain a set element X of the document X to be correctediCorresponding word set Xi:{xi1,xi2,…,xim};
Step 4: traversal word set Xi:{xi1,xi2,…,ximElement x in (b) }ij,j∈[1,m]If xijSearching an English word database if the word is an English word, if the word exists, ignoring the word, and if the word does not exist, recording the word subscript ij into an ERROR word set ERROR;
step 5: traversal word set Xi:{xi1,xi2,…,ximElement x in (b) }ij,j∈[1,m]For element xijCalculate its length lenijAnd generating a set of words Xi:{xi1,xi2,…,ximLength set len corresponding toi:{leni1,leni2,…,lenim};
Step 6: define an error correction threshold P, traverse the length set leni:{leni1,leni2,…,lenimIf there are a plurality of independent words in succession, i.e., P and more than P lens in successionij,j∈[1,m]To 1, then the P lensijRecording the corresponding subscript ij into an ERROR word set ERROR;
step 7: traversal of X set form X of document to be corrected1,X2,…,XnAll elements X ini,i∈[1,n]Performing the operations of Step3, Step4, Step5 and Step 6;
step 8: adding red wavy lines below the text at the specified position of the document X to be corrected according to the subscript data ij in the ERROR word set ERROR to generate and export an ERROR-corrected document X';
in Step1-8, the involved N, m ∈ N+
2. The document error correction marking method according to claim 1, characterized in that: in Step1, the document X to be corrected is a full chinese document, a chinese-english combination document, or a full english document.
3. The document error correction marking method according to claim 1, characterized in that: in Step2, the separators are periods, exclamations, question marks, ellipses, line breaks and page breaks in Chinese or English.
4. The document error correction marking method according to claim 1, characterized in that: in Step3, the word segmentation algorithm should satisfy:
a. the Chinese sentences can be normally segmented;
b. the English words can be normally segmented;
c. a string of numbers may be listed separately;
d. punctuation and space can be removed.
5. The document error correction marking method according to claim 1, characterized in that: in Step6, the value range of the error correction threshold P is P ≥ 3.
CN201711257231.0A 2017-12-04 2017-12-04 Document error correction marking method Active CN108132917B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711257231.0A CN108132917B (en) 2017-12-04 2017-12-04 Document error correction marking method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711257231.0A CN108132917B (en) 2017-12-04 2017-12-04 Document error correction marking method

Publications (2)

Publication Number Publication Date
CN108132917A CN108132917A (en) 2018-06-08
CN108132917B true CN108132917B (en) 2021-12-17

Family

ID=62389957

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711257231.0A Active CN108132917B (en) 2017-12-04 2017-12-04 Document error correction marking method

Country Status (1)

Country Link
CN (1) CN108132917B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109101482B (en) * 2018-07-02 2021-08-20 昆明理工大学 Positioning method for text form near word error
CN109460455B (en) * 2018-10-25 2020-04-28 第四范式(北京)技术有限公司 Text detection method and device
CN112036136A (en) * 2020-09-01 2020-12-04 文思海辉智科科技有限公司 Quality inspection data processing method and device, electronic equipment and readable storage medium
CN113705203A (en) * 2021-09-02 2021-11-26 上海极链网络科技有限公司 Text error correction method and device, electronic equipment and computer readable storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102867040A (en) * 2012-08-31 2013-01-09 中国科学院计算技术研究所 Chinese search engine mixed speech-oriented query error corrosion method and system
CN104750672A (en) * 2013-12-27 2015-07-01 重庆新媒农信科技有限公司 Chinese word error correction method used in search and device thereof
CN105279149A (en) * 2015-10-21 2016-01-27 上海应用技术学院 Chinese text automatic correction method
CN107193921A (en) * 2017-05-15 2017-09-22 中山大学 The method and system of the Sino-British mixing inquiry error correction of Search Engine-Oriented

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7092567B2 (en) * 2002-11-04 2006-08-15 Matsushita Electric Industrial Co., Ltd. Post-processing system and method for correcting machine recognized text

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102867040A (en) * 2012-08-31 2013-01-09 中国科学院计算技术研究所 Chinese search engine mixed speech-oriented query error corrosion method and system
CN104750672A (en) * 2013-12-27 2015-07-01 重庆新媒农信科技有限公司 Chinese word error correction method used in search and device thereof
CN105279149A (en) * 2015-10-21 2016-01-27 上海应用技术学院 Chinese text automatic correction method
CN107193921A (en) * 2017-05-15 2017-09-22 中山大学 The method and system of the Sino-British mixing inquiry error correction of Search Engine-Oriented

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
一种支持混合语言的并行查询纠错方法;颛悦 等;《中文信息学报》;20160331;第30卷(第2期);第1-8页 *

Also Published As

Publication number Publication date
CN108132917A (en) 2018-06-08

Similar Documents

Publication Publication Date Title
Liu et al. A survey of CRF algorithm based knowledge extraction of elementary mathematics in Chinese
CN108132917B (en) Document error correction marking method
CN108614898B (en) Document analysis method and device
CN105224640B (en) Method and equipment for extracting viewpoint
US8843815B2 (en) System and method for automatically extracting metadata from unstructured electronic documents
EP2653982A1 (en) Method and system for statistical misspelling correction
Wang et al. A beam-search decoder for normalization of social media text with application to machine translation
CN105975625A (en) Chinglish inquiring correcting method and system oriented to English search engine
CN105279149A (en) Chinese text automatic correction method
CN101876965B (en) Method and system used for processing text
CN102955773B (en) For identifying the method and system of chemical name in Chinese document
CN104778256A (en) Rapid incremental clustering method for domain question-answering system consultations
CN107797995A (en) A kind of Chinese and English fragment language material generation method
Chea et al. Khmer word segmentation using conditional random fields
CN106383814A (en) Word segmentation method of English social media short text
US20120290602A1 (en) Method and system for identifying traditional arabic poems
Li et al. Improving named entity recognition in tweets via detecting non-standard words
CN116468009A (en) Article generation method, apparatus, electronic device and storage medium
Sitaula A hybrid algorithm for stemming of Nepali text
CN113255329A (en) English text spelling error correction method and device, storage medium and electronic equipment
CN106650803B (en) The method and device of similarity between a kind of calculating character string
CN104933030A (en) Uygur language spelling examination method and device
KR20190090636A (en) Method for automatically editing pattern of document
Hu et al. CSCD-IME: correcting spelling errors generated by pinyin IME
Sajjad et al. Comparing two techniques for learning transliteration models using a parallel corpus

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant