CN108132917B

CN108132917B - Document error correction marking method

Info

Publication number: CN108132917B
Application number: CN201711257231.0A
Authority: CN
Inventors: 龙华; 祁俊辉; 毕丹红; 唐菁敏
Original assignee: Kunming University of Science and Technology
Current assignee: Kunming University of Science and Technology
Priority date: 2017-12-04
Filing date: 2017-12-04
Publication date: 2021-12-17
Anticipated expiration: 2037-12-04
Also published as: CN108132917A

Abstract

The invention relates to a document error correction marking method, and belongs to the technical field of information processing. The method comprises the steps of dividing a document to be corrected into small particle sets by separators, performing word segmentation operation on all set elements through a specific word segmentation algorithm, searching whether spelling errors of English words exist according to an English word database and recording the spelling errors, calculating word lengths of all word sets, recording the word lengths into an error word set if a plurality of continuous independent words exist, namely the lengths of the plurality of continuous words are all 1, and finally performing error marking on the document to be corrected through data in the error word set to generate and export an error-corrected document. Compared with the prior art, the method mainly solves the problems that the prior art has poor support for other languages except English, particularly the phenomena of imperfect error correction marking, poor support and the like of Chinese documents, and aims to increase the support for error correction marking of the Chinese documents by computers at present.

Description

Document error correction marking method

Technical Field

The invention relates to a document error correction marking method, and belongs to the technical field of information processing.

Background

The document error correction marking is a very important common technology in information processing technology, and generally, WORD error correction marking functions are available in WORD or WPS, which are mainly used for reminding a document editor that a WORD spelling error or a WORD logic error may occur somewhere in a document.

At present, the traditional document error correction marking method is mainly used for marking wrong English words, the support of non-English languages such as Chinese and the like is not good, because English does not relate to word segmentation, the independence between words is strong, Chinese does not need to segment words according to spaces like English, and whether the words are correct can be judged by searching an English word database. The Chinese relates to word segmentation, and no matter from the viewpoint of statistics or grammar, no certain rule exists between words, so that the error correction marking of the Chinese document is difficult to implement once.

Disclosure of Invention

The technical problem to be solved by the present invention is to provide a method for marking a document with error correction to solve the above-mentioned problems.

The technical scheme of the invention is as follows: a method for marking document error correction includes dividing a document to be corrected into small particle set forms, dividing words of set elements, searching whether spelling errors of English words exist or not and recording the spelling errors, carrying out word length calculation on all word sets, recording the word length calculation result into an error word set if a plurality of continuous independent words exist, and marking the document to be corrected with errors through data in the error word set.

The method specifically comprises the following steps:

step 1: and acquiring the document X to be corrected.

Step 2: dividing the document X to be corrected into sets by using separators, namely splitting the document X to be corrected into X: { X₁,X₂,…,X_n}。

Step 3: set element X of document X to be corrected_i,i∈[1,n]The element is participled through a participle algorithm to obtain a set element X of the document X to be corrected_iCorresponding word set X_i:{x_i1,x_i2,…,x_im}。

Step 4: traversal word set X_i:{x_i1,x_i2,…,x_imElement x in (b) }_ij,j∈[1,m]If x_ijAnd searching an English word database if the word is an English word, if the word exists, ignoring the word, and if the word does not exist, recording the word subscript ij into an ERROR word set ERROR.

Step 5: traversal word set X_i:{x_i1,x_i2,…,x_imElement x in (b) }_ij,j∈[1,m]For element x_ijCalculate its length len_ijAnd generating a set of words X_i:{x_i1,x_i2,…,x_imLength set len corresponding to_i:{len_i1,len_i2,…,len_im}。

Step 6: define an error correction threshold P, traverse the length set len_i:{len_i1,len_i2,…,len_imIf there are a plurality of independent words in succession, i.e., P and more than P lens in succession_ij,j∈[1,m]To 1, then the P lens_ijThe corresponding subscript ij is recorded in the ERROR word set ERROR.

Step 7: traversal of X set form X of document to be corrected₁,X₂,…,X_nAll elements X in_i,i∈[1,n]The operations of steps 3, 4, 5 and 6 are performed.

Step 8: and adding red wavy lines below the text at the specified position of the document X to be corrected according to the subscript data ij in the ERROR word set ERROR to generate and export the corrected document X'.

In Step1-8, the involved N, m ∈ N⁺。

Further, in Step1, the document X to be corrected may be a full chinese document, or a combination document of chinese and english, or more specifically, a full english document.

Further, in Step2, the separators are periods, exclamations, question marks, ellipses, line breaks and page breaks in the Chinese or English state.

Further, in Step3, the word segmentation algorithm should satisfy a) that the Chinese sentences can be normally segmented; b) the English words can be normally segmented; c) a string of numbers may be listed separately; d) punctuation and space can be removed.

Further, in Step6, the error correction threshold P may be set to a value according to a specific use environment, as shown in equation (3), where P is generally equal to 3.

P≥3 (3)

The invention has the beneficial effects that: compared with the prior art, the method mainly solves the problems that the prior art has poor support for other languages except English, particularly the phenomena of imperfect error correction marking, poor support and the like of Chinese documents, and aims to increase the support for error correction marking of the Chinese documents by computers at present.

Drawings

FIG. 1 is a schematic flow diagram of the present invention.

Detailed Description

The invention is further described with reference to the following drawings and detailed description.

Example 1: the invention considers that after the Chinese sentence is normally segmented, if a plurality of continuous words are independent words, the logic structure of the normal sentence is not met, so that errors are necessary to be made at the position. Based on the method, a document to be corrected is divided into small particle set forms by separators, word segmentation operation is carried out on all set elements through a specific word segmentation algorithm, then whether spelling errors of English words exist or not is searched according to an English word database and recorded, then word lengths of all word sets are calculated, if a plurality of continuous independent words exist, namely the lengths of the plurality of continuous words are all 1, the words are recorded into an error word set, and finally error marking is carried out on the document to be corrected through data in the error word set, so that an error-corrected document is generated and exported.

A document error correction marking method specifically comprises the following steps:

step 1: acquiring a document X to be corrected; specifically, the method comprises the following steps:

assume that the content of the document X to be corrected is "university of Kunming university of Science and Technology," Kunming university of Science and Technology, abbreviated as Kunming, located in mingmen city of province of Yunnan province, created in 1954, named Kunming institute of Technology, and renamed as university of Kunming university of Technology in 1995. In 1999, the former Kunming Miller university and the former Yunnan Industrial university are combined to establish a new Kunming Miller university. Has been developed to become a major and combined science and engineering, which is a provincial major university with coordinated development of multiple subjects. ".

Step 2: dividing the document X to be corrected into sets by using separators, namely splitting the document X to be corrected into X: { X₁,X₂,…,X_n}; specifically, the method comprises the following steps:

after the division into the set form, the length n of the set is 3, X₁The Kunming university of Science and Technology, named Kunming institute of Technology and Kunming university of Kunming, is called Kunming worker for short, is located in Kunming province Ming Kunming City, and is created in 1954, named Kunming institute of Technology, and is renamed to Kunming university of Technology in 1995. ", X₂Is the combination of original Kunming university and original Yunnan industry university in 1999 to establish new Kunming university. ", X₃The combination of the science and technology is a provincial major university with coordinated development of multiple subjects, so that the development becomes a major worker. ".

For collection element X₁Performing word segmentation to obtain a word set X₁"Kunming, university of science, kunming, univelsity, of science, and, technology, acronym, Kunming, worker, location, Yunnan province, Ming, Kunming, City, creation, 1954, renamed, Kunming, university of science".

Step 4: traversal word set X_i:{x_i1,x_i2,…,x_imElement x in (b) }_ij,j∈[1,m]If x_ijSearching an English word database if the word is an English word, if the word exists, ignoring the word, and if the word does not exist, recording the word subscript ij into an ERROR word set ERROR; specifically, the method comprises the following steps:

traversal word set X₁Finding word x_1`3、x_1`4、x_1`5、x_1`6、x_1`7、x_1`8All are English words, and finding out the word x by looking up the English word database_ijThe word index ij ═ 1' 4 is recorded in the ERROR word set ERROR, which is not present.

Calculating a set of words X₁Middle set element x_ijLength of (len)_ijGenerated length set len₁Is "2, 4, 7, 10, 2, 7, 3, 10, 2, 1, 1, 3, 3, 2, 1, 1, 1, 2, 1, 5, 2, 1, 2, 4".

Defining error correction threshold P as 3, traversing length set, finding len_1`15、len_1`16、len_1`17Are 3 consecutive independent words, and the corresponding subscripts ij ═ 1 ' 15, ij ═ 1 ' 16, and ij ═ 1 ' 17 are recorded in the ERROR word set ERROR.

For collection element X₂、X₃Similarly, steps 3, Step4, Step5 and Step6 are performed to obtain subscripts ij ═ 2 ' 9, ij ═ 2 ' 10, ij ═ 2 ' 11, ij ═ 2 ' 12, ij ═ 3 ' 1, ij ═ 3 ' 2 and ij ═ 3 ' 3 of consecutive independent words, and the subscripts ij ═ 2, ij ═ 2 and Step6 are recorded into the ERROR word set ERROR.

The corrected document X' is Kunming university of Kunming (Kunming)Universityof Science and Technology), abbreviated as Kunming, was located in province Ming Kunming City of Yunnan province, was created in 1954, named Kunming institute of Technology, and more named Kunming university of Science in 1995. In 1999, the former Kunming Miller university and the former Yunnan Industrial university are combined to establish a new Kunming Miller university. Has been developed to become a major and combined science and engineering, which is a provincial major university with coordinated development of multiple subjects. ".

The embodiment result shows that the method adopted by the invention can better perform document error correction marking on Chinese, and can perform error correction marking on documents combining Chinese and English at the same time by combining the method for performing error correction marking on English in the traditional algorithm.

While the present invention has been described in detail with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, and various changes can be made without departing from the spirit and scope of the present invention.

Claims

1. A document error correction marking method is characterized by comprising the following steps:

step 1: acquiring a document X to be corrected;

step 2: dividing the document X to be corrected into sets by using separators, namely splitting the document X to be corrected into X: { X₁,X₂,…,X_n}；

Step 3: set element X of document X to be corrected_i,i∈[1,n]The element is participled through a participle algorithm to obtain a set element X of the document X to be corrected_iCorresponding word set X_i:{x_i1,x_i2,…,x_im}；

Step 4: traversal word set X_i:{x_i1,x_i2,…,x_imElement x in (b) }_ij,j∈[1,m]If x_ijSearching an English word database if the word is an English word, if the word exists, ignoring the word, and if the word does not exist, recording the word subscript ij into an ERROR word set ERROR;

step 5: traversal word set X_i:{x_i1,x_i2,…,x_imElement x in (b) }_ij,j∈[1,m]For element x_ijCalculate its length len_ijAnd generating a set of words X_i:{x_i1,x_i2,…,x_imLength set len corresponding to_i:{len_i1,len_i2,…,len_im}；

Step 6: define an error correction threshold P, traverse the length set len_i:{len_i1,len_i2,…,len_imIf there are a plurality of independent words in succession, i.e., P and more than P lens in succession_ij,j∈[1,m]To 1, then the P lens_ijRecording the corresponding subscript ij into an ERROR word set ERROR;

step 7: traversal of X set form X of document to be corrected₁,X₂,…,X_nAll elements X in_i,i∈[1,n]Performing the operations of Step3, Step4, Step5 and Step 6;

step 8: adding red wavy lines below the text at the specified position of the document X to be corrected according to the subscript data ij in the ERROR word set ERROR to generate and export an ERROR-corrected document X';

in Step1-8, the involved N, m ∈ N⁺。

2. The document error correction marking method according to claim 1, characterized in that: in Step1, the document X to be corrected is a full chinese document, a chinese-english combination document, or a full english document.

3. The document error correction marking method according to claim 1, characterized in that: in Step2, the separators are periods, exclamations, question marks, ellipses, line breaks and page breaks in Chinese or English.

4. The document error correction marking method according to claim 1, characterized in that: in Step3, the word segmentation algorithm should satisfy:

a. the Chinese sentences can be normally segmented;

b. the English words can be normally segmented;

c. a string of numbers may be listed separately;

d. punctuation and space can be removed.

5. The document error correction marking method according to claim 1, characterized in that: in Step6, the value range of the error correction threshold P is P ≥ 3.