CN101887519B - Character recognition and modification method - Google Patents

Character recognition and modification method Download PDF

Info

Publication number
CN101887519B
CN101887519B CN2010102535633A CN201010253563A CN101887519B CN 101887519 B CN101887519 B CN 101887519B CN 2010102535633 A CN2010102535633 A CN 2010102535633A CN 201010253563 A CN201010253563 A CN 201010253563A CN 101887519 B CN101887519 B CN 101887519B
Authority
CN
China
Prior art keywords
identification
literal
adapting
image
software
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN2010102535633A
Other languages
Chinese (zh)
Other versions
CN101887519A (en
Inventor
瞿洋
袁仁慧
梁洵
张振海
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
" academic magazine (CD-ROM) " company limited of e-magazine society
Original Assignee
TONGFANG KNOWLEDGE NETWORK (BEIJING) TECHNOLOGY Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by TONGFANG KNOWLEDGE NETWORK (BEIJING) TECHNOLOGY Co Ltd filed Critical TONGFANG KNOWLEDGE NETWORK (BEIJING) TECHNOLOGY Co Ltd
Priority to CN2010102535633A priority Critical patent/CN101887519B/en
Publication of CN101887519A publication Critical patent/CN101887519A/en
Application granted granted Critical
Publication of CN101887519B publication Critical patent/CN101887519B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Character Discrimination (AREA)

Abstract

The invention discloses a character recognition and modification method, which comprises the following steps: selecting different recognition software and adopting the plug-in way for carrying out recognition on characters in a file; comparing the results of the character recognition; carrying out longitudinal modification, transverse modification, correction and quality inspection on the recognized different characters; and synthesizing the characters passing the quality inspection into the file and outputting the file. The method can improve the modification efficiency of the file which takes the normal Chinese characters as a main body by more than 7 times and lead the modification efficiency to reach 0.7 million characters/8 hours; simultaneously, the method can reduce the modification error rate by 60% and lead the modification error rate to be lower than 4/10000.

Description

Literal identification, the method for adapting
Technical field
The method of the present invention relates to the identification of document electronic process Chinese words, adapting, the method for relate in particular to Chinese block letter identification, adapting.
Background technology
In the process of file electronization made of paper, the literal after the OCR identification is adapted work and has been expended great manpower, and it is the work of a manpower intensive, and labour intensity is also very high.Present present condition for application is: carry out image recognition with common OCR software, once adapt correction again, guaranteeing that the error rate of adapting also can surpass 1/1000 usually under the speed that hour normally adapt everyone 80,000 words/8.
Summary of the invention
To adapt efficient low in order to solve existing manual, the present situation that error rate is high, the method for the invention provides a kind of literal identification, adapting.This method can greatly improve the efficient that manual work is adapted, and reduces cost, and its technical scheme is following:
Literal identification, the method for adapting comprise:
Select different identification softwares for use and adopt plug-in mode that the literal in the document is discerned;
The result of the comparison literal of discerning;
The different literal of identification is adapted check and correction and carried out quality inspection;
Synthetic document of literal with quality inspection after qualified and output.
The beneficial effect of technical scheme provided by the invention is:
Is that its efficient of adapting of document of main body can improve more than 7 times through the present invention to normal Chinese characters, reaches 700,000 words/8 hour; Adapt error rate simultaneously and reduce by 60%, reach below 4/10000.
Description of drawings
Fig. 1 is an implementation method process flow diagram of the present invention.
Embodiment
For making the object of the invention, technical scheme and advantage clearer, will combine accompanying drawing that embodiment of the present invention is done to describe in detail further below:
The method that present embodiment provides a kind of literal identification, adapted specifically comprises following flow process (referring to Fig. 1):
File scanning and processing
For improving OCR identification software recognition correct rate, the unified 300DPI precision that adopts of document is scanned, subsequently image is carried out processing such as necessary inclination rectification, image deblurring denoising.
Cut figure by paragraph
For guaranteeing that two-way OCR identification software has identical printed page analysis result, must carry out paragraph to file and picture and cut figure, it cuts the natural order that figure abides by the article paragraph in proper order, and name automatically, so that the result uses when exporting.
Printed page analysis and inspection
Image to cutting carries out automatic printed page analysis with " Chinese king " OCR identification software; The automatic printed page analysis result of hand inspection, the result corrects a mistake.During inspection, image deflects are carried out necessary repairing, guarantee that paragraph and row analysis are correct.If desired, carry out artificial printed page analysis.We are with the result of " Chinese king " the OCR identification software printed page analysis foundation as last reorganization paragraph.
" Chinese king " and " literary composition is logical " plug-in identification of two-way OCR identification software
The image of cutting figure to paragraph carries out " row is cut figure " and is cut into the several rows image one by one, imports " Chinese king " and " literary composition leads to " two-way identification software respectively into, carries out plug-in identification.
Plug-in identification is exactly not change original OCR identification software, writes the process of new procedures analog manual operation OCR identification software, so that accomplish image recognition work.Plug-in program and OCR program are the software of independent operating separately.The plug-in program recognition image does not need the recognition interface of OCR program, and plug-in program utilizes the OCR program to carry out image recognition.
Adopt plug-in identification can practice thrift the expense of buying two-way OCR identification SDK software effectively, reduce the system constructing cost, the problem that also can avoid SDK software to fall behind with respect to its certified products software engineering.
Why pass through " row cut figure ", send into the reason that the two-way identification software discerns again line by line and be: even to paragraph image very clearly, because the printed page analysis algorithm of two identification softwares is different, the result of printed page analysis also maybe be different.Through " row is cut figure ", we just can guarantee the correctness of the capable analysis of two-way identification software.
The comparison of two-way recognition result
" Chinese king " and " literary composition logical " domesticly all has the OCR system of higher discrimination to Chinese with English, they to definition printing body Chinese character image discrimination all more than 98%.More valuable is the contrast test through us; " Chinese king " and " literary composition is logical " identification software has very strong complementarity; Utilize their recognition result and carry out single file and word for word compare, filter out word, do not give manual work and adapt with identical recognition result; Give manual work the different words of identification and adapt check and correction.
The practical application statistical description is the document of main body to normal printed Chinese character, and we do not adapt the literal rate of dishing out and reach 95%, the error rate of this part literal reaches<and 3/10000.
Before the two-way comparison,, also the normalization processing that necessary double byte character changes the half-angle character done in some characters to its application demand.These characters comprise A-Z, a-z, 0-9, "! ", " [", "] " etc., amount to 80 characters.
The capable contrast of two-way algorithm use adopts Horizon Search to seek Optimum Matching based on state space search A* algorithm.If two row text strings to be contrasted is S1 and S2, their length is respectively m and n, and m≤n; S1 comprise character (Cs1, Cs2 ..., Csm), S2 comprise character (Cl1, Cl2 ..., Cln).Alignment algorithm is following:
(1) to each literal Csi of short essay word string S1, and 0≤i≤m, in long article word string S2, seek characters matched, and in S2, putting into the S set Mi that possibly mate with the be complementary index of character of Csi; In SMi, increase by one-1 index subsequently, representative does not match.Process is following:
F0R?i=1?TO?m
begin
F0R?j=1?TO?n
begin
if?Csi=Clj?then?SMi←j
end
SMi←-1
end
Thus, obtain the search volume (SM1, SM2 ..., SMm)
(2) for reducing the size of search volume, for each possible coupling, calculating comprises the possible subsequently maximum matching number MaxMatchAfter of itself (being called for short MMA), is used for next step heuristic search.To-1 possibly mate among the SMi, promptly Csi not with any one character match of S2, its MMA=m-i; To other possible couplings among the SMi, its MMA of recursive calculation, calculating will utilize sequence constraint and length constraint to get rid of obvious irrational coupling.
(3) carry out horizontal heuristic recursive search, find out big the separating of number of matches fast.
The vertical volume
Contradictory and repeat to occur twice above word and give manual work earlier and vertically adapt check and correction to two-way identification.It is all red in the paragraph acceptance of the bid that all need indulge the word of volume, and the sign of compiling is blue, and the picture and text contrast.By a collection of formation task of 700,000 words batch, basic this batch that guarantee accomplished in one day.
Under the normal condition, the amount of adapting of this process only accounts for all should adapt 5% of workload.Vertical volume has improved effectively adapts efficient, alleviates and adapts labour intensity.
In order to improve the accuracy of entire system, we have also initiatively added some easy gibberish and easy wrongly written character, and they are all indulged volume.Like 20 words such as " people ", " going into ", " one ", " two ", " fore-telling ", " in vain ", ". ", " youngsters ".
Horizontal volume
Through vertical editorial afterword, system carries out horizontal volume process, and all need the literal of horizontal volume all red in the paragraph acceptance of the bid, and the vertical word of compiling is green in the paragraph acceptance of the bid, and the sign of compiling is blue, and the picture and text contrast.
Under the normal operation, the amount of adapting of this process is less than all adapting 1% of workload.In adapting process, require the person of adapting to check the correct of paragraph simultaneously.
Quality inspection
Adapt the people and reach routine and adapt quality for supervising, designed and adapted the sampling observation post, data are adapted in each batch manual work inspected by random samples.General sampling observation 1/10 is guaranteed to adapt mistake and is lower than 1/1000.
Merge output
Cut figure information according to paragraph, text adapted in synthetic normal article.Its system mistake rate: 3/10000*95%+1/1000*5%=3.35/10000.
The above; Be merely the preferable embodiment of the present invention, but protection scope of the present invention is not limited thereto, any technician who is familiar with the present technique field is in the technical scope that the present invention discloses; The variation that can expect easily or replacement all should be encompassed within protection scope of the present invention.Therefore, protection scope of the present invention should be as the criterion with the protection domain of claim.

Claims (3)

1. literal identification, the method for adapting is characterized in that said method comprises
File scanning is handled and is promptly adopted high-precision instrument scanned document, image is carried out slant correction, image deblurring denoising;
Cut figure by paragraph and have identical printed page analysis result to guarantee two-way OCR identification software;
Image to cutting carries out automatic printed page analysis;
The image of paragraph being cut figure carries out " row is cut figure ", selects Chinese king and Wen Tong identification software for use and adopts plug-in mode that the literal in the document is discerned; Said plug-in identification is exactly not change original OCR identification software, writes the process of new procedures analog manual operation OCR identification software, so that accomplish image recognition work;
The result of the comparison literal of discerning;
The different literal of identification is adapted check and correction and carried out quality inspection, and the adapting of said literal comprises vertically adapts and laterally adapts, through vertical compile promptly contradictory and repeat to occur twice above word and give manual work and vertically adapt check and correction to two-way identification; Horizontal volume is promptly carried out horizontal volume through vertical editorial afterword, and checks the correct of paragraph simultaneously;
Synthetic document of literal with quality inspection after qualified and output.
2. literal identification according to claim 1, the method for adapting is characterized in that said Chinese king OCR identification software and Wen Tong OCR identification software are the complementary identification softwares of two kinds of recognition results.
3. literal identification according to claim 1 and 2, the method for adapting is characterized in that said identification also comprises the identification to English and other characters.
CN2010102535633A 2010-08-16 2010-08-16 Character recognition and modification method Active CN101887519B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2010102535633A CN101887519B (en) 2010-08-16 2010-08-16 Character recognition and modification method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2010102535633A CN101887519B (en) 2010-08-16 2010-08-16 Character recognition and modification method

Publications (2)

Publication Number Publication Date
CN101887519A CN101887519A (en) 2010-11-17
CN101887519B true CN101887519B (en) 2012-04-18

Family

ID=43073434

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2010102535633A Active CN101887519B (en) 2010-08-16 2010-08-16 Character recognition and modification method

Country Status (1)

Country Link
CN (1) CN101887519B (en)

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP5833956B2 (en) * 2012-03-06 2015-12-16 インターナショナル・ビジネス・マシーンズ・コーポレーションInternational Business Machines Corporation Information processing apparatus, method, and program for proofreading document
CN102855543B (en) * 2012-08-03 2016-03-02 深圳市一览网络股份有限公司 The delivering method of resume and system
CN102929843B (en) * 2012-09-14 2015-10-14 《中国学术期刊(光盘版)》电子杂志社有限公司 A kind of method that word is adapted system and adapted
CN102855232B (en) * 2012-09-14 2016-02-24 同方知网数字出版技术股份有限公司 A kind of tabular analysis adapts job operation
CN103425976B (en) * 2013-07-17 2016-12-28 中国中医科学院 A kind of case report table identification system and recognition methods
CN103425975B (en) * 2013-07-17 2016-05-18 中国中医科学院 A kind of clinical case data collecting system and acquisition method
CN106940798A (en) * 2017-03-08 2017-07-11 深圳市金立通信设备有限公司 The modification method and terminal of a kind of Text region
CN108710855A (en) * 2018-05-22 2018-10-26 山西同方知网数字出版技术有限公司 A kind of Text region editing method
CN111753717B (en) * 2020-06-23 2023-07-28 北京百度网讯科技有限公司 Method, device, equipment and medium for extracting structured information of text

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR100286163B1 (en) * 1994-08-08 2001-04-16 가네꼬 히사시 Address recognition method, address recognition device and paper sheet automatic processing system
CN1142471C (en) * 1997-11-21 2004-03-17 资通电脑股份有限公司 Method and apparatus for operation by hand written alphabets and symbols
CN1710592A (en) * 2004-09-03 2005-12-21 威锋数位开发股份有限公司 Character identifying method, device and character picture/text converting service method
CN101794278A (en) * 2009-09-21 2010-08-04 广东省标准化研究院 Method and software for digitalizing full text of standard document

Also Published As

Publication number Publication date
CN101887519A (en) 2010-11-17

Similar Documents

Publication Publication Date Title
CN101887519B (en) Character recognition and modification method
CN109241894B (en) Bill content identification system and method based on form positioning and deep learning
JP6484333B2 (en) Intelligent scoring method and system for descriptive problems
CN111563509B (en) Tesseract-based substation terminal row identification method and system
CN110929580A (en) Financial statement information rapid extraction method and system based on OCR
CN104624509B (en) A kind of express delivery Automated Sorting System and automatic sorting method
CN107562918A (en) A kind of mathematical problem knowledge point discovery and batch label acquisition method
CN109657665A (en) A kind of invoice batch automatic recognition system based on deep learning
CN101458770A (en) Character recognition method and system
CN112818785B (en) Rapid digitization method and system for meteorological paper form document
CN104966097A (en) Complex character recognition method based on deep learning
CN102855232A (en) Table analysis and edit processing method
CN109002768A (en) Medical bill class text extraction method based on the identification of neural network text detection
CN104915668A (en) Character information identification method for medical image and device thereof
CN110704649B (en) Method and system for constructing flow image data set
CN112016481B (en) OCR-based financial statement information detection and recognition method
CN102045268B (en) A kind of e-mail data restoration methods and device
CN103136302A (en) Method and device of test question repeat output
CN112364837A (en) Bill information identification method based on target detection and text identification
CN110347709A (en) A kind of construction method and system of regulation engine
CN117037198A (en) Bank statement identification method
CN101655911A (en) Mode identification method based on immune antibody network
CN111104159A (en) Annotation positioning method based on program analysis and neural network
CN101908147B (en) Character recognizing and adapting system
CN107992508A (en) A kind of Chinese email signature extracting method and system based on machine learning

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
ASS Succession or assignment of patent right

Owner name: CHINESE ACADEMIC JOURNAL (CD) ELECTRONIC PUBLISHIN

Free format text: FORMER OWNER: TONGFANG KNOWLEDGE NETWORK (BEIJING) TECHNOLOGY CO., LTD.

Effective date: 20120615

C41 Transfer of patent application or patent right or utility model
TR01 Transfer of patent right

Effective date of registration: 20120615

Address after: 100084 Beijing city Haidian District Tsinghua University Tsinghua Yuan 36 zone B1410, Huaye building 1412, room 1414

Patentee after: "Chinese Academic Journals (CD)" Electronic Magazine

Address before: 100084 Beijing city Haidian District Tsinghua University Tsinghua Yuan 36 zone B1410, Huaye building 1412, room 1414

Patentee before: Tongfang Knowledge Network (Beijing) Technology Co., Ltd.

C56 Change in the name or address of the patentee

Owner name: CHINA ACADEMIC JOURNAL (CD) ELECTRONIC PUBLISHING

Free format text: FORMER NAME: CHINA ACADEMIC JOURNAL (CD) ELECTRONIC PUBLISHING HOUSE

CP03 Change of name, title or address

Address after: 100084 Haidian District Tsinghua Yuan Tsinghua University Beijing District 1407, 1408, 36, 1409

Patentee after: " academic magazine (CD-ROM) " company limited of e-magazine society

Address before: 100084 Beijing city Haidian District Tsinghua University Tsinghua Yuan 36 zone B1410, Huaye building 1412, room 1414

Patentee before: "Chinese Academic Journals (CD)" Electronic Magazine