CN102479174A - Chinese character automatic checking and error-correcting system aiming at GBK (Chinese Internal Code Specification) encoding and method thereof - Google Patents

Chinese character automatic checking and error-correcting system aiming at GBK (Chinese Internal Code Specification) encoding and method thereof Download PDF

Info

Publication number
CN102479174A
CN102479174A CN2010105555696A CN201010555569A CN102479174A CN 102479174 A CN102479174 A CN 102479174A CN 2010105555696 A CN2010105555696 A CN 2010105555696A CN 201010555569 A CN201010555569 A CN 201010555569A CN 102479174 A CN102479174 A CN 102479174A
Authority
CN
China
Prior art keywords
error correction
chinese character
charnum
chinese
character
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2010105555696A
Other languages
Chinese (zh)
Other versions
CN102479174B (en
Inventor
陈运文
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Zhangmen Science and Technology Co Ltd
Original Assignee
Shengle Information Technolpogy Shanghai Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shengle Information Technolpogy Shanghai Co Ltd filed Critical Shengle Information Technolpogy Shanghai Co Ltd
Priority to CN201010555569.6A priority Critical patent/CN102479174B/en
Publication of CN102479174A publication Critical patent/CN102479174A/en
Application granted granted Critical
Publication of CN102479174B publication Critical patent/CN102479174B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Character Discrimination (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a Chinese character automatic checking and error-correcting system aiming at GBK (Chinese Internal Code Specification) encoding, which comprises an encoding abnormal detection module, an error correction trying module and an error correction judging module. The invention also discloses a Chinese character automatic checking and error-correcting method implemented on the basis of the system. Abnormal GBK encoding can be identified and corrected and correct text characters can be recovered by the system and the method thereof. When the system and the method thereof are applied, firstly, the texts which are abnormally encoded are identified by the encoding abnormal detection module; then various error correction schemes are tried on the texts by the error correction trying module; and finally, an optimal error correction scheme is selected by the error correction judging module and the texts are recovered into texts which are normally encoded, so that the text characters can be correctly displayed.

Description

The automatic verification of Chinese character and error correction system and method thereof to the GBK coding
Technical field
The present invention relates to a kind of automatic verification of Chinese character and error correction system of the GBK of being applicable to coding.The invention still further relates to the application process of this system.
Background technology
GBK, full name " Chinese Internal Code Specification " is the encode Chinese characters for computer standard that national information technology standardization technology committee member works out, it is the ISN extension specification on the GB2312 standard base, and is therefore, compatible fully with the GB2312 standard.
The GBK coded system amounts to and has included 21003 Chinese characters, and each Chinese character is with 2 byte representations, and wherein, the scope of first byte is 0x81-0xFE, and the scope of second byte is 0x40-0xFE.For example, the GBK coding that " Wan Gang affirms fully science and technology " this phrase is corresponding is expressed as " CD F2 B8 D6 B3E4 B7 D6 BF CF B6 A8 BF C6 BC BC " with 16 systems; Wherein, " CD F2 " corresponding to " ten thousand " word, " B8 D6 " corresponding to " steel " word, the rest may be inferred.
But there is the shortcoming of poor fault tolerance in the GBK coding, after occurring the byte omission in the coding; The coding of all follow-up Chinese characters all can be made mistakes; For example, above-mentioned " Wan Gang affirms fully science and technology " this phrase, when its GBK coded strings for various reasons; After causing first byte " CD " to be lost; System can be that a Chinese character shows with per two bytes to next code still, and like this, above-mentioned phrase just is shown as " buffalo gnat grows duckweed Ji, the sincere crack of pouring "; All literal have all been upset in the phrase, and this phenomenon is called " avalanche effect " of Chinese character code.The problems referred to above often occur when Chinese internet transmission text, and the tiny mistake when it can make the text transmission is extremely enlarged, and causes full text correctly to read.
Summary of the invention
The technical matters that the present invention will solve provides a kind of automatic verification of Chinese character and error correction system to the GBK coding, and it can discern and correct unusual GBK coding, recovers correct text character.
For solving the problems of the technologies described above, automatic verification of Chinese character and error correction system to the GBK coding of the present invention include:
Coding abnormality detection module, whether the Chinese character string that is used to detect the GBK coding exists the unusual situation of coding;
Module is attempted in error correction, is used for the unusual Chinese character string of coding that coding abnormality detection module is identified, carries out the GBK code correction and attempts, and will attempt recognition result and pass to the error correction discrimination module;
The error correction discrimination module, whether being used to differentiate error correction, to attempt the trial recognition result that module transmits reasonable, and according to rational trial recognition result, the Chinese character string is carried out correction process, the text after the output error correction.
The unusual basis for estimation of said coding is the mess code degree of this Chinese character string.The mess code degree obtains through the difference of calculating be of little use in the Chinese character string Chinese character and Chinese characters in common use quantity.Whether the trial recognition result of error correction trial module is reasonable, and before and after can attempting according to error correction, the mess code degree of Chinese character string is differentiated.
It is that high low byte to the GBK coding of this Chinese character string reconfigures that said GBK code correction is attempted.
Another technical matters that the present invention will solve provides automatic verification of a kind of Chinese character based on said system and error correction method.
For solving the problems of the technologies described above, automatic verification of Chinese character and error correction method to the GBK coding of the present invention comprise the steps:
1) head from Chinese text to be detected begins; Travel through text successively; Two continuous bytes in the GBK coding of judgement text, whether satisfy condition: first byte belongs to 0x81-0xFE, and second byte belongs to 0x40-0xFE; If satisfy this condition, in then these two byte records being gone here and there to checking character; If do not satisfy this condition, then second byte is set at the starting point of follow-up traversal;
2) repeating step 1), travel through follow-up text successively, when the length of the string of checking character reaches predefined byte number, forward step 3) to;
3) initial value with two counter count_1 and count_2 is made as 0, judges whether each Chinese character in the string of checking character belongs to high frequency Chinese character, if belong to, then the numerical value with count_1 adds 1; If do not belong to, then judge this Chinese character more whether in the BOA1 to F7FE of GB2312 standard character range, if not in this character range, then the numerical value with count_2 adds 1;
4) the mess code degree of calculation check character string: charnum=count_2-count_1;
5) determining step 4) the charnum numerical value that obtains, if charnum<3, think that then the string encoding of checking character is normal, forwards step 8) to; If charnum >=3, think that then the string encoding of checking character is wrong, forward step 6) to;
6) remove first and last byte of checking character and going here and there, according to step 3), the numerical value of statistics count_1 and count_2, the mess code degree charnum_new after computing error correction is attempted;
7) numerical values recited of comparison charnum and charnum_new, if charnum-charnum_new>8, then error correction success, the text after the output error correction; If 4<charnum-charnum_new≤8 serve as the traversal starting point with this first follow-up byte of string of checking character then, repeating step 1) to 7); Judge the next one checks character to go here and there whether satisfy 4<charnum-charnum_new≤8; If satisfy, then error correction success, the text after the output error correction;
8) the successive character string is accordinged to step 1) to 7) travel through, until the alphabet that has traveled through this Chinese text.
The automatic verification of Chinese character and recovery system and method thereof to the GBK coding of the present invention; Utilize the statistical law of Chinese text, it is unusual to detect the GBK coding automatically, attempts various error correction methods then; Again through the statistical classification method; Judge whether error correction scheme is suitable, the wrong text of encoding the most at last reverts to the normal text of coding, thereby has overcome the easy defective that " avalanche effect " takes place of text of GBK coding.
Description of drawings
Below in conjunction with accompanying drawing and embodiment the present invention is done further detailed explanation:
Accompanying drawing is a system construction drawing of the present invention.
Embodiment
Understand for technology contents of the present invention, characteristics and effect being had more specifically, combine illustrated embodiment at present, details are as follows:
Shown in accompanying drawing, automatic verification of the Chinese character of the embodiment of the invention and error correction system include:
Coding abnormality detection module, whether the Chinese character string that is used for discerning the employing GBK coding that is input to automatic verification of Chinese character and error correction system exists the unusual situation of coding;
Module is attempted in error correction, and the GBK coding of the Chinese character string of the existence coding abnormal conditions that are used for coding abnormality detection module is assert carry out reconfiguring of high low byte, and the trial recognition result after will making up is passed to the error correction discrimination module;
The error correction discrimination module is used to receive error correction and attempts the trial recognition result that module transmits, according to Chinese statistical nature; Whether the error correction scheme of differentiating error correction trial module is reasonable; And select best error correction scheme, the Chinese character string is carried out correction process, the text after the output error correction.
Use the automatic verification of Chinese character and the error correction system of the foregoing description, when the Chinese text of GBK coding is carried out verification and error correction, carry out according to following steps:
Step 1 begins from the text head, travels through text successively, and two continuous bytes (are used B respectively in the GBK coding of judgement text 1And B 2Expression) whether satisfy condition:
0x81≤B 1≤0xFE
0x40≤B 2≤0xFE
If satisfy above-mentioned condition, then this double byte is write down the string (verify_str) of into checking character, continue traversal successive character string then; If do not satisfy above-mentioned condition, be step-length then, promptly from B with 1 byte 2Beginning travels through along text-string successively backward.
When the length of the string of checking character reaches 40 bytes (i.e. 20 Chinese characters), forward step 2 to.
Step 2; The Chinese character that identification obtains to step 1, the Chinese character in the string of promptly checking character carries out type identification successively; Judge whether belong to high frequency Chinese character by wherein each Chinese character (the high frequency Chinese character table of present embodiment is gathered " which the most frequently used Chinese character is " book with State Bureau of Standardization establishment, in February, 1986 publication of reform of a writing system publishing house from Committee for Reforming the Chinese Written Language; Amount to 700 Chinese characters in common use), if belong to, then the numerical value with counter 1 (count_1) adds 1; If do not belong to, then judge this Chinese character whether in the simplified form of Chinese Character coding range (being BOA1 to F7FE) of GB2312, if this Chinese character has exceeded above-mentioned scope, then the numerical value with counter 2 (count_2) adds 1.The initial value of above-mentioned count_1 and count_2 is 0.
Then, according to the numerical value of count_1 and count_2, calculate this illustration and text juxtaposed setting mess code degree originally of checking character, computing formula is:
charnum=count_2-count_1
Step 3, coding abnormality detection module is judged the numerical value of the charnum that calculates in the step 2, if charnum<3 think that then this this coding of illustration and text juxtaposed setting of checking character is normal, does not need its coding to be adjusted again; If charnum >=3 are thought that then this illustration and text juxtaposed setting coding originally of checking character is wrong, and proceeded step 4, attempt the code error that module attempts correcting this section text by error correction.
Step 4; Error correction is attempted module first byte of verification character string is removed with last byte; Then according to the method for step 2; The numerical value of statistics count_1 and count_2, and calculate the numerical value of new text mess code degree (charnum_new) with these two numerical value, pass to the error correction discrimination module.
Step 5; The error correction discrimination module compares the size of charnum and charnum_new, if charnum-charnum_new>8 represent that then the GBK coding of this section text has obtained effective correction; The error correction discrimination module is judged to be the success of GBK code correction, and the text after the output error correction; If 4<charnum-charnum_new≤8; Then according to the flow process of step 1 to step 4, continue to judge whether the string of checking character that follow-up 20 Chinese characters by text constitute also satisfies 4<charnum-charnum_new≤8, if satisfy; Think that then error correction is successful; The text of output after the error correction if do not satisfy, then thought the error correction failure; If charnum-charnum_new≤4, then error correction failure is also thought by system.Under the error correction failure scenarios, automatic verification of Chinese character and error correction system will not made correction process to this section text.
Repeat above-mentioned steps 1 to 5, continue the successive character string of traversal text, until the alphabet that has traveled through this Chinese text.

Claims (6)

1. automatic verification of Chinese character and error correction system to a GBK coding is characterized in that, include:
Coding abnormality detection module, whether the Chinese character string that is used to detect the GBK coding exists the unusual situation of coding;
Module is attempted in error correction, is used for the unusual Chinese character string of coding that coding abnormality detection module is identified, carries out the GBK code correction and attempts, and will attempt recognition result and pass to the error correction discrimination module;
The error correction discrimination module, whether being used to differentiate error correction, to attempt the trial recognition result that module transmits reasonable, and according to rational trial recognition result, the Chinese character string is carried out correction process, the text after the output error correction.
2. automatic verification of Chinese character as claimed in claim 1 and error correction system is characterized in that: the unusual basis for estimation of said coding is the mess code degree of this Chinese character string.
3. automatic verification of Chinese character as claimed in claim 2 and error correction system is characterized in that: said mess code degree is the difference of be of little use in this Chinese character string Chinese character and Chinese characters in common use quantity.
4. automatic verification of Chinese character as claimed in claim 1 and error correction system is characterized in that: it is that high low byte to the GBK coding of this Chinese character string reconfigures that said GBK code correction is attempted.
5. like claim 2 or automatic verification of 3 described Chinese characters and error correction system, it is characterized in that: recognition result is attempted in said differentiation, and whether rational foundation is to carry out error correction to attempt front and back, the mess code degree of this Chinese character string.
6. automatic verification of Chinese character and error correction method to a GBK coding is characterized in that, comprise the steps:
1) head from Chinese text to be detected begins; Travel through text successively; Two continuous bytes in the GBK coding of judgement text, whether satisfy condition: first byte belongs to 0x81-0xFE, and second byte belongs to 0x40-0xFE; If satisfy this condition, in then these two byte records being gone here and there to checking character; If do not satisfy this condition, then second byte is set at the starting point of follow-up traversal;
2) repeating step 1), travel through follow-up text successively, when the length of the string of checking character reaches predefined byte number, forward step 3) to;
3) initial value with two counter count_1 and count_2 is made as 0, judges whether each Chinese character in the string of checking character belongs to high frequency Chinese character, if belong to, then the numerical value with count_1 adds 1; If do not belong to, then judge this Chinese character more whether in the BOA1 to F7FE of GB2312 standard character range, if not in this character range, then the numerical value with count_2 adds 1;
4) the mess code degree of calculation check character string: charnum=count_2-count_1;
5) determining step 4) the charnum numerical value that obtains, if charnum<3, think that then the string encoding of checking character is normal, forwards step 8) to; If charnum >=3, think that then the string encoding of checking character is wrong, forward step 6) to;
6) remove first and last byte of checking character and going here and there, according to step 3), the numerical value of statistics count_1 and count_2, the mess code degree charnum_new after computing error correction is attempted;
7) numerical values recited of comparison charnum and charnum_new, if charnum-charnum_new>8, then error correction success, the text after the output error correction; If 4<charnum-charnum_new≤8 serve as the traversal starting point with this first follow-up byte of string of checking character then, repeating step 1) to 7); Judge the next one checks character to go here and there whether satisfy 4<charnum-charnum_new≤8; If satisfy, then error correction success, the text after the output error correction;
8) the successive character string is accordinged to step 1) to 7) travel through, until the alphabet that has traveled through this Chinese text.
CN201010555569.6A 2010-11-23 2010-11-23 For Chinese character automatic Verification and error correction system and the method thereof of GBK coding Active CN102479174B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201010555569.6A CN102479174B (en) 2010-11-23 2010-11-23 For Chinese character automatic Verification and error correction system and the method thereof of GBK coding

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201010555569.6A CN102479174B (en) 2010-11-23 2010-11-23 For Chinese character automatic Verification and error correction system and the method thereof of GBK coding

Publications (2)

Publication Number Publication Date
CN102479174A true CN102479174A (en) 2012-05-30
CN102479174B CN102479174B (en) 2016-03-16

Family

ID=46091824

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201010555569.6A Active CN102479174B (en) 2010-11-23 2010-11-23 For Chinese character automatic Verification and error correction system and the method thereof of GBK coding

Country Status (1)

Country Link
CN (1) CN102479174B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104317663A (en) * 2014-09-29 2015-01-28 广东欧珀移动通信有限公司 Automatic character testing method and device
CN104360988A (en) * 2014-10-17 2015-02-18 北京锐安科技有限公司 Method and device for identifying coding mode of Chinese characters
CN105808370A (en) * 2014-12-31 2016-07-27 航天信息股份有限公司 Method for discovering half Chinese character in character string
CN107918608A (en) * 2017-12-12 2018-04-17 广东欧珀移动通信有限公司 Entry processing method, mobile terminal and computer-readable recording medium
CN108271041A (en) * 2016-12-30 2018-07-10 北京国双科技有限公司 Mess code treating method and apparatus
US10025812B2 (en) 2016-03-03 2018-07-17 International Business Machines Corporation Identifying corrupted text segments

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101256556A (en) * 2008-03-17 2008-09-03 无敌科技(西安)有限公司 Method for detecting Thai data
CN101276245A (en) * 2008-04-16 2008-10-01 北京搜狗科技发展有限公司 Reminding method and system for coding to correct error in input process

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101256556A (en) * 2008-03-17 2008-09-03 无敌科技(西安)有限公司 Method for detecting Thai data
CN101276245A (en) * 2008-04-16 2008-10-01 北京搜狗科技发展有限公司 Reminding method and system for coding to correct error in input process

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
张淑平: "《程序员教程》", 31 August 2009, article "计算机中数据的表示及运算", pages: 10-11 *
张淑平: "《程序员教程》", 31 August 2009, 清华大学出版社 *
赵薇: "《计算机组成原理》", 31 July 2005, article "非数值型数据的编码方法", pages: 25-27 *
赵薇: "《计算机组成原理》", 31 July 2005, 机械工业出版社 *
陈翔: "面向文本数字化的自动纠错方法", 《计算机应用研究》 *

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104317663A (en) * 2014-09-29 2015-01-28 广东欧珀移动通信有限公司 Automatic character testing method and device
CN104317663B (en) * 2014-09-29 2017-11-21 广东欧珀移动通信有限公司 Automate test alphabetic method and apparatus
CN104360988A (en) * 2014-10-17 2015-02-18 北京锐安科技有限公司 Method and device for identifying coding mode of Chinese characters
CN104360988B (en) * 2014-10-17 2017-10-20 北京锐安科技有限公司 The recognition methods of the coded system of Chinese character and device
CN105808370A (en) * 2014-12-31 2016-07-27 航天信息股份有限公司 Method for discovering half Chinese character in character string
US10025812B2 (en) 2016-03-03 2018-07-17 International Business Machines Corporation Identifying corrupted text segments
US10169398B2 (en) 2016-03-03 2019-01-01 International Business Machines Corporation Identifying corrupted text segments
US10318650B2 (en) 2016-03-03 2019-06-11 International Business Machines Corporation Identifying corrupted text segments
US10402392B2 (en) 2016-03-03 2019-09-03 International Business Machines Corporation Identifying corrupted text segments
CN108271041A (en) * 2016-12-30 2018-07-10 北京国双科技有限公司 Mess code treating method and apparatus
CN108271041B (en) * 2016-12-30 2021-01-22 北京国双科技有限公司 Method and device for processing messy codes
CN107918608A (en) * 2017-12-12 2018-04-17 广东欧珀移动通信有限公司 Entry processing method, mobile terminal and computer-readable recording medium

Also Published As

Publication number Publication date
CN102479174B (en) 2016-03-16

Similar Documents

Publication Publication Date Title
CN102479174A (en) Chinese character automatic checking and error-correcting system aiming at GBK (Chinese Internal Code Specification) encoding and method thereof
US8036889B2 (en) Systems and methods for filtering dictated and non-dictated sections of documents
CN101989289B (en) Data clustering method and device
CN102737012B (en) text information comparison method and system
WO2019075969A1 (en) Method for extracting form information in a structured manner, electronic device, and computer-readable storage medium
CN103744905A (en) Junk mail judgment method and device
US8352857B2 (en) Methods and apparatuses for intra-document reference identification and resolution
CN105912514B (en) Text copy detection system and method based on fingerprint characteristic
CN102956231B (en) Voice key information recording device and method based on semi-automatic correction
CN101630283A (en) System and method for automatically generating report
US20160239467A1 (en) Method and system for selecting encoding format for reading target document
CN107609097B (en) Data integration and classification method
CN104994128A (en) Data coding type identifying and transcoding method and device
CN104064182A (en) A voice recognition system and method based on classification rules
CN105302924A (en) File management method and device
CN111476375B (en) Method and device for determining identification model, electronic equipment and storage medium
CN104021179A (en) Fast recognition algorithm of similarity data in big data set
CN112668292B (en) Method for automatically extracting tracking matrix from system configuration rule and application thereof
CN101727440A (en) Sensitive word correcting method and system
CN104360988A (en) Method and device for identifying coding mode of Chinese characters
CN115422125B (en) Electronic document automatic archiving method and system based on intelligent algorithm
CN111291535A (en) Script processing method and device, electronic equipment and computer readable storage medium
CN116342305A (en) Travel expense reimbursement method, device, computer equipment and storage medium
CN114155914B (en) Detection and correction system based on metagenome splicing errors
US20160253374A1 (en) Data file writing method and system, and data file reading method and system

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20190220

Address after: 201203 7, 1 Lane 666 lane, Zhang Heng Road, Pudong New Area, Shanghai.

Patentee after: SHANGHAI ZHANGMEN TECHNOLOGY CO., LTD.

Address before: 201203 No. 356 GuoShoujing Road, Zhangjiang High-tech Park, Pudong New Area, Shanghai

Patentee before: Shengle Information Technology (Shanghai) Co., Ltd.

TR01 Transfer of patent right