CN108132917B - Document error correction marking method - Google Patents
Document error correction marking method Download PDFInfo
- Publication number
- CN108132917B CN108132917B CN201711257231.0A CN201711257231A CN108132917B CN 108132917 B CN108132917 B CN 108132917B CN 201711257231 A CN201711257231 A CN 201711257231A CN 108132917 B CN108132917 B CN 108132917B
- Authority
- CN
- China
- Prior art keywords
- document
- word
- error
- corrected
- len
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/103—Formatting, i.e. changing of presentation of documents
- G06F40/117—Tagging; Marking up; Designating a block; Setting of attributes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/232—Orthographic correction, e.g. spell checking or vowelisation
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Machine Translation (AREA)
- Document Processing Apparatus (AREA)
Abstract
The invention relates to a document error correction marking method, and belongs to the technical field of information processing. The method comprises the steps of dividing a document to be corrected into small particle sets by separators, performing word segmentation operation on all set elements through a specific word segmentation algorithm, searching whether spelling errors of English words exist according to an English word database and recording the spelling errors, calculating word lengths of all word sets, recording the word lengths into an error word set if a plurality of continuous independent words exist, namely the lengths of the plurality of continuous words are all 1, and finally performing error marking on the document to be corrected through data in the error word set to generate and export an error-corrected document. Compared with the prior art, the method mainly solves the problems that the prior art has poor support for other languages except English, particularly the phenomena of imperfect error correction marking, poor support and the like of Chinese documents, and aims to increase the support for error correction marking of the Chinese documents by computers at present.
Description
Technical Field
The invention relates to a document error correction marking method, and belongs to the technical field of information processing.
Background
The document error correction marking is a very important common technology in information processing technology, and generally, WORD error correction marking functions are available in WORD or WPS, which are mainly used for reminding a document editor that a WORD spelling error or a WORD logic error may occur somewhere in a document.
At present, the traditional document error correction marking method is mainly used for marking wrong English words, the support of non-English languages such as Chinese and the like is not good, because English does not relate to word segmentation, the independence between words is strong, Chinese does not need to segment words according to spaces like English, and whether the words are correct can be judged by searching an English word database. The Chinese relates to word segmentation, and no matter from the viewpoint of statistics or grammar, no certain rule exists between words, so that the error correction marking of the Chinese document is difficult to implement once.
Disclosure of Invention
The technical problem to be solved by the present invention is to provide a method for marking a document with error correction to solve the above-mentioned problems.
The technical scheme of the invention is as follows: a method for marking document error correction includes dividing a document to be corrected into small particle set forms, dividing words of set elements, searching whether spelling errors of English words exist or not and recording the spelling errors, carrying out word length calculation on all word sets, recording the word length calculation result into an error word set if a plurality of continuous independent words exist, and marking the document to be corrected with errors through data in the error word set.
The method specifically comprises the following steps:
step 1: and acquiring the document X to be corrected.
Step 2: dividing the document X to be corrected into sets by using separators, namely splitting the document X to be corrected into X: { X1,X2,…,Xn}。
Step 3: set element X of document X to be correctedi,i∈[1,n]The element is participled through a participle algorithm to obtain a set element X of the document X to be correctediCorresponding word set Xi:{xi1,xi2,…,xim}。
Step 4: traversal word set Xi:{xi1,xi2,…,ximElement x in (b) }ij,j∈[1,m]If xijAnd searching an English word database if the word is an English word, if the word exists, ignoring the word, and if the word does not exist, recording the word subscript ij into an ERROR word set ERROR.
Step 5: traversal word set Xi:{xi1,xi2,…,ximElement x in (b) }ij,j∈[1,m]For element xijCalculate its length lenijAnd generating a set of words Xi:{xi1,xi2,…,ximLength set len corresponding toi:{leni1,leni2,…,lenim}。
Step 6: define an error correction threshold P, traverse the length set leni:{leni1,leni2,…,lenimIf there are a plurality of independent words in succession, i.e., P and more than P lens in successionij,j∈[1,m]To 1, then the P lensijThe corresponding subscript ij is recorded in the ERROR word set ERROR.
Step 7: traversal of X set form X of document to be corrected1,X2,…,XnAll elements X ini,i∈[1,n]The operations of steps 3, 4, 5 and 6 are performed.
Step 8: and adding red wavy lines below the text at the specified position of the document X to be corrected according to the subscript data ij in the ERROR word set ERROR to generate and export the corrected document X'.
In Step1-8, the involved N, m ∈ N+。
Further, in Step1, the document X to be corrected may be a full chinese document, or a combination document of chinese and english, or more specifically, a full english document.
Further, in Step2, the separators are periods, exclamations, question marks, ellipses, line breaks and page breaks in the Chinese or English state.
Further, in Step3, the word segmentation algorithm should satisfy a) that the Chinese sentences can be normally segmented; b) the English words can be normally segmented; c) a string of numbers may be listed separately; d) punctuation and space can be removed.
Further, in Step6, the error correction threshold P may be set to a value according to a specific use environment, as shown in equation (3), where P is generally equal to 3.
P≥3 (3)
The invention has the beneficial effects that: compared with the prior art, the method mainly solves the problems that the prior art has poor support for other languages except English, particularly the phenomena of imperfect error correction marking, poor support and the like of Chinese documents, and aims to increase the support for error correction marking of the Chinese documents by computers at present.
Drawings
FIG. 1 is a schematic flow diagram of the present invention.
Detailed Description
The invention is further described with reference to the following drawings and detailed description.
Example 1: the invention considers that after the Chinese sentence is normally segmented, if a plurality of continuous words are independent words, the logic structure of the normal sentence is not met, so that errors are necessary to be made at the position. Based on the method, a document to be corrected is divided into small particle set forms by separators, word segmentation operation is carried out on all set elements through a specific word segmentation algorithm, then whether spelling errors of English words exist or not is searched according to an English word database and recorded, then word lengths of all word sets are calculated, if a plurality of continuous independent words exist, namely the lengths of the plurality of continuous words are all 1, the words are recorded into an error word set, and finally error marking is carried out on the document to be corrected through data in the error word set, so that an error-corrected document is generated and exported.
A document error correction marking method specifically comprises the following steps:
step 1: acquiring a document X to be corrected; specifically, the method comprises the following steps:
assume that the content of the document X to be corrected is "university of Kunming university of Science and Technology," Kunming university of Science and Technology, abbreviated as Kunming, located in mingmen city of province of Yunnan province, created in 1954, named Kunming institute of Technology, and renamed as university of Kunming university of Technology in 1995. In 1999, the former Kunming Miller university and the former Yunnan Industrial university are combined to establish a new Kunming Miller university. Has been developed to become a major and combined science and engineering, which is a provincial major university with coordinated development of multiple subjects. ".
Step 2: dividing the document X to be corrected into sets by using separators, namely splitting the document X to be corrected into X: { X1,X2,…,Xn}; specifically, the method comprises the following steps:
after the division into the set form, the length n of the set is 3, X1The Kunming university of Science and Technology, named Kunming institute of Technology and Kunming university of Kunming, is called Kunming worker for short, is located in Kunming province Ming Kunming City, and is created in 1954, named Kunming institute of Technology, and is renamed to Kunming university of Technology in 1995. ", X2Is the combination of original Kunming university and original Yunnan industry university in 1999 to establish new Kunming university. ", X3The combination of the science and technology is a provincial major university with coordinated development of multiple subjects, so that the development becomes a major worker. ".
Step 3: set element X of document X to be correctedi,i∈[1,n]The element is participled through a participle algorithm to obtain a set element X of the document X to be correctediCorresponding word set Xi:{xi1,xi2,…,xim}。
For collection element X1Performing word segmentation to obtain a word set X1"Kunming, university of science, kunming, univelsity, of science, and, technology, acronym, Kunming, worker, location, Yunnan province, Ming, Kunming, City, creation, 1954, renamed, Kunming, university of science".
Step 4: traversal word set Xi:{xi1,xi2,…,ximElement x in (b) }ij,j∈[1,m]If xijSearching an English word database if the word is an English word, if the word exists, ignoring the word, and if the word does not exist, recording the word subscript ij into an ERROR word set ERROR; specifically, the method comprises the following steps:
traversal word set X1Finding word x1`3、x1`4、x1`5、x1`6、x1`7、x1`8All are English words, and finding out the word x by looking up the English word databaseijThe word index ij ═ 1' 4 is recorded in the ERROR word set ERROR, which is not present.
Step 5: traversal word set Xi:{xi1,xi2,…,ximElement x in (b) }ij,j∈[1,m]For element xijCalculate its length lenijAnd generating a set of words Xi:{xi1,xi2,…,ximLength set len corresponding toi:{leni1,leni2,…,lenim}。
Calculating a set of words X1Middle set element xijLength of (len)ijGenerated length set len1Is "2, 4, 7, 10, 2, 7, 3, 10, 2, 1, 1, 3, 3, 2, 1, 1, 1, 2, 1, 5, 2, 1, 2, 4".
Step 6: define an error correction threshold P, traverse the length set leni:{leni1,leni2,…,lenimIf there are a plurality of independent words in succession, i.e., P and more than P lens in successionij,j∈[1,m]To 1, then the P lensijThe corresponding subscript ij is recorded in the ERROR word set ERROR.
Defining error correction threshold P as 3, traversing length set, finding len1`15、len1`16、len1`17Are 3 consecutive independent words, and the corresponding subscripts ij ═ 1 ' 15, ij ═ 1 ' 16, and ij ═ 1 ' 17 are recorded in the ERROR word set ERROR.
Step 7: traversal of X set form X of document to be corrected1,X2,…,XnAll elements X ini,i∈[1,n]The operations of steps 3, 4, 5 and 6 are performed.
For collection element X2、X3Similarly, steps 3, Step4, Step5 and Step6 are performed to obtain subscripts ij ═ 2 ' 9, ij ═ 2 ' 10, ij ═ 2 ' 11, ij ═ 2 ' 12, ij ═ 3 ' 1, ij ═ 3 ' 2 and ij ═ 3 ' 3 of consecutive independent words, and the subscripts ij ═ 2, ij ═ 2 and Step6 are recorded into the ERROR word set ERROR.
Step 8: and adding red wavy lines below the text at the specified position of the document X to be corrected according to the subscript data ij in the ERROR word set ERROR to generate and export the corrected document X'.
The corrected document X' is Kunming university of Kunming (Kunming)Universityof Science and Technology), abbreviated as Kunming, was located in province Ming Kunming City of Yunnan province, was created in 1954, named Kunming institute of Technology, and more named Kunming university of Science in 1995. In 1999, the former Kunming Miller university and the former Yunnan Industrial university are combined to establish a new Kunming Miller university. Has been developed to become a major and combined science and engineering, which is a provincial major university with coordinated development of multiple subjects. ".
The embodiment result shows that the method adopted by the invention can better perform document error correction marking on Chinese, and can perform error correction marking on documents combining Chinese and English at the same time by combining the method for performing error correction marking on English in the traditional algorithm.
While the present invention has been described in detail with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, and various changes can be made without departing from the spirit and scope of the present invention.
Claims (5)
1. A document error correction marking method is characterized by comprising the following steps:
step 1: acquiring a document X to be corrected;
step 2: dividing the document X to be corrected into sets by using separators, namely splitting the document X to be corrected into X: { X1,X2,…,Xn};
Step 3: set element X of document X to be correctedi,i∈[1,n]The element is participled through a participle algorithm to obtain a set element X of the document X to be correctediCorresponding word set Xi:{xi1,xi2,…,xim};
Step 4: traversal word set Xi:{xi1,xi2,…,ximElement x in (b) }ij,j∈[1,m]If xijSearching an English word database if the word is an English word, if the word exists, ignoring the word, and if the word does not exist, recording the word subscript ij into an ERROR word set ERROR;
step 5: traversal word set Xi:{xi1,xi2,…,ximElement x in (b) }ij,j∈[1,m]For element xijCalculate its length lenijAnd generating a set of words Xi:{xi1,xi2,…,ximLength set len corresponding toi:{leni1,leni2,…,lenim};
Step 6: define an error correction threshold P, traverse the length set leni:{leni1,leni2,…,lenimIf there are a plurality of independent words in succession, i.e., P and more than P lens in successionij,j∈[1,m]To 1, then the P lensijRecording the corresponding subscript ij into an ERROR word set ERROR;
step 7: traversal of X set form X of document to be corrected1,X2,…,XnAll elements X ini,i∈[1,n]Performing the operations of Step3, Step4, Step5 and Step 6;
step 8: adding red wavy lines below the text at the specified position of the document X to be corrected according to the subscript data ij in the ERROR word set ERROR to generate and export an ERROR-corrected document X';
in Step1-8, the involved N, m ∈ N+。
2. The document error correction marking method according to claim 1, characterized in that: in Step1, the document X to be corrected is a full chinese document, a chinese-english combination document, or a full english document.
3. The document error correction marking method according to claim 1, characterized in that: in Step2, the separators are periods, exclamations, question marks, ellipses, line breaks and page breaks in Chinese or English.
4. The document error correction marking method according to claim 1, characterized in that: in Step3, the word segmentation algorithm should satisfy:
a. the Chinese sentences can be normally segmented;
b. the English words can be normally segmented;
c. a string of numbers may be listed separately;
d. punctuation and space can be removed.
5. The document error correction marking method according to claim 1, characterized in that: in Step6, the value range of the error correction threshold P is P ≥ 3.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711257231.0A CN108132917B (en) | 2017-12-04 | 2017-12-04 | Document error correction marking method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711257231.0A CN108132917B (en) | 2017-12-04 | 2017-12-04 | Document error correction marking method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108132917A CN108132917A (en) | 2018-06-08 |
CN108132917B true CN108132917B (en) | 2021-12-17 |
Family
ID=62389957
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201711257231.0A Active CN108132917B (en) | 2017-12-04 | 2017-12-04 | Document error correction marking method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108132917B (en) |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109101482B (en) * | 2018-07-02 | 2021-08-20 | 昆明理工大学 | Positioning method for text form near word error |
CN109460455B (en) * | 2018-10-25 | 2020-04-28 | 第四范式(北京)技术有限公司 | Text detection method and device |
CN112036136A (en) * | 2020-09-01 | 2020-12-04 | 文思海辉智科科技有限公司 | Quality inspection data processing method and device, electronic equipment and readable storage medium |
CN113705203A (en) * | 2021-09-02 | 2021-11-26 | 上海极链网络科技有限公司 | Text error correction method and device, electronic equipment and computer readable storage medium |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102867040A (en) * | 2012-08-31 | 2013-01-09 | 中国科学院计算技术研究所 | Chinese search engine mixed speech-oriented query error corrosion method and system |
CN104750672A (en) * | 2013-12-27 | 2015-07-01 | 重庆新媒农信科技有限公司 | Chinese word error correction method used in search and device thereof |
CN105279149A (en) * | 2015-10-21 | 2016-01-27 | 上海应用技术学院 | Chinese text automatic correction method |
CN107193921A (en) * | 2017-05-15 | 2017-09-22 | 中山大学 | The method and system of the Sino-British mixing inquiry error correction of Search Engine-Oriented |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7092567B2 (en) * | 2002-11-04 | 2006-08-15 | Matsushita Electric Industrial Co., Ltd. | Post-processing system and method for correcting machine recognized text |
-
2017
- 2017-12-04 CN CN201711257231.0A patent/CN108132917B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102867040A (en) * | 2012-08-31 | 2013-01-09 | 中国科学院计算技术研究所 | Chinese search engine mixed speech-oriented query error corrosion method and system |
CN104750672A (en) * | 2013-12-27 | 2015-07-01 | 重庆新媒农信科技有限公司 | Chinese word error correction method used in search and device thereof |
CN105279149A (en) * | 2015-10-21 | 2016-01-27 | 上海应用技术学院 | Chinese text automatic correction method |
CN107193921A (en) * | 2017-05-15 | 2017-09-22 | 中山大学 | The method and system of the Sino-British mixing inquiry error correction of Search Engine-Oriented |
Non-Patent Citations (1)
Title |
---|
一种支持混合语言的并行查询纠错方法;颛悦 等;《中文信息学报》;20160331;第30卷(第2期);第1-8页 * |
Also Published As
Publication number | Publication date |
---|---|
CN108132917A (en) | 2018-06-08 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Liu et al. | A survey of CRF algorithm based knowledge extraction of elementary mathematics in Chinese | |
CN108132917B (en) | Document error correction marking method | |
CN108614898B (en) | Document analysis method and device | |
CN105224640B (en) | Method and equipment for extracting viewpoint | |
US8843815B2 (en) | System and method for automatically extracting metadata from unstructured electronic documents | |
EP2653982A1 (en) | Method and system for statistical misspelling correction | |
Wang et al. | A beam-search decoder for normalization of social media text with application to machine translation | |
CN105975625A (en) | Chinglish inquiring correcting method and system oriented to English search engine | |
CN105279149A (en) | Chinese text automatic correction method | |
CN101876965B (en) | Method and system used for processing text | |
CN102955773B (en) | For identifying the method and system of chemical name in Chinese document | |
CN104778256A (en) | Rapid incremental clustering method for domain question-answering system consultations | |
CN107797995A (en) | A kind of Chinese and English fragment language material generation method | |
Chea et al. | Khmer word segmentation using conditional random fields | |
CN106383814A (en) | Word segmentation method of English social media short text | |
US20120290602A1 (en) | Method and system for identifying traditional arabic poems | |
Li et al. | Improving named entity recognition in tweets via detecting non-standard words | |
CN116468009A (en) | Article generation method, apparatus, electronic device and storage medium | |
Sitaula | A hybrid algorithm for stemming of Nepali text | |
CN113255329A (en) | English text spelling error correction method and device, storage medium and electronic equipment | |
CN106650803B (en) | The method and device of similarity between a kind of calculating character string | |
CN104933030A (en) | Uygur language spelling examination method and device | |
KR20190090636A (en) | Method for automatically editing pattern of document | |
Hu et al. | CSCD-IME: correcting spelling errors generated by pinyin IME | |
Sajjad et al. | Comparing two techniques for learning transliteration models using a parallel corpus |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |