CN101887519B

CN101887519B - Character recognition and modification method

Info

Publication number: CN101887519B
Application number: CN2010102535633A
Authority: CN
Inventors: 瞿洋; 袁仁慧; 梁洵; 张振海
Original assignee: TONGFANG KNOWLEDGE NETWORK (BEIJING) TECHNOLOGY Co Ltd
Current assignee: " academic magazine (CD-ROM) " company limited of e-magazine society
Priority date: 2010-08-16
Filing date: 2010-08-16
Publication date: 2012-04-18
Anticipated expiration: 2030-08-16
Also published as: CN101887519A

Abstract

The invention discloses a character recognition and modification method, which comprises the following steps: selecting different recognition software and adopting the plug-in way for carrying out recognition on characters in a file; comparing the results of the character recognition; carrying out longitudinal modification, transverse modification, correction and quality inspection on the recognized different characters; and synthesizing the characters passing the quality inspection into the file and outputting the file. The method can improve the modification efficiency of the file which takes the normal Chinese characters as a main body by more than 7 times and lead the modification efficiency to reach 0.7 million characters/8 hours; simultaneously, the method can reduce the modification error rate by 60% and lead the modification error rate to be lower than 4/10000.

Description

Literal identification, the method for adapting

Technical field

The method of the present invention relates to the identification of document electronic process Chinese words, adapting, the method for relate in particular to Chinese block letter identification, adapting.

Background technology

In the process of file electronization made of paper, the literal after the OCR identification is adapted work and has been expended great manpower, and it is the work of a manpower intensive, and labour intensity is also very high.Present present condition for application is: carry out image recognition with common OCR software, once adapt correction again, guaranteeing that the error rate of adapting also can surpass 1/1000 usually under the speed that hour normally adapt everyone 80,000 words/8.

Summary of the invention

To adapt efficient low in order to solve existing manual, the present situation that error rate is high, the method for the invention provides a kind of literal identification, adapting.This method can greatly improve the efficient that manual work is adapted, and reduces cost, and its technical scheme is following:

Literal identification, the method for adapting comprise:

Select different identification softwares for use and adopt plug-in mode that the literal in the document is discerned;

The result of the comparison literal of discerning;

The different literal of identification is adapted check and correction and carried out quality inspection;

Synthetic document of literal with quality inspection after qualified and output.

The beneficial effect of technical scheme provided by the invention is:

Is that its efficient of adapting of document of main body can improve more than 7 times through the present invention to normal Chinese characters, reaches 700,000 words/8 hour; Adapt error rate simultaneously and reduce by 60%, reach below 4/10000.

Description of drawings

Fig. 1 is an implementation method process flow diagram of the present invention.

Embodiment

For making the object of the invention, technical scheme and advantage clearer, will combine accompanying drawing that embodiment of the present invention is done to describe in detail further below:

The method that present embodiment provides a kind of literal identification, adapted specifically comprises following flow process (referring to Fig. 1):

File scanning and processing

For improving OCR identification software recognition correct rate, the unified 300DPI precision that adopts of document is scanned, subsequently image is carried out processing such as necessary inclination rectification, image deblurring denoising.

Cut figure by paragraph

For guaranteeing that two-way OCR identification software has identical printed page analysis result, must carry out paragraph to file and picture and cut figure, it cuts the natural order that figure abides by the article paragraph in proper order, and name automatically, so that the result uses when exporting.

Printed page analysis and inspection

Image to cutting carries out automatic printed page analysis with " Chinese king " OCR identification software; The automatic printed page analysis result of hand inspection, the result corrects a mistake.During inspection, image deflects are carried out necessary repairing, guarantee that paragraph and row analysis are correct.If desired, carry out artificial printed page analysis.We are with the result of " Chinese king " the OCR identification software printed page analysis foundation as last reorganization paragraph.

" Chinese king " and " literary composition is logical " plug-in identification of two-way OCR identification software

The image of cutting figure to paragraph carries out " row is cut figure " and is cut into the several rows image one by one, imports " Chinese king " and " literary composition leads to " two-way identification software respectively into, carries out plug-in identification.

Plug-in identification is exactly not change original OCR identification software, writes the process of new procedures analog manual operation OCR identification software, so that accomplish image recognition work.Plug-in program and OCR program are the software of independent operating separately.The plug-in program recognition image does not need the recognition interface of OCR program, and plug-in program utilizes the OCR program to carry out image recognition.

Adopt plug-in identification can practice thrift the expense of buying two-way OCR identification SDK software effectively, reduce the system constructing cost, the problem that also can avoid SDK software to fall behind with respect to its certified products software engineering.

Why pass through " row cut figure ", send into the reason that the two-way identification software discerns again line by line and be: even to paragraph image very clearly, because the printed page analysis algorithm of two identification softwares is different, the result of printed page analysis also maybe be different.Through " row is cut figure ", we just can guarantee the correctness of the capable analysis of two-way identification software.

The comparison of two-way recognition result

" Chinese king " and " literary composition logical " domesticly all has the OCR system of higher discrimination to Chinese with English, they to definition printing body Chinese character image discrimination all more than 98%.More valuable is the contrast test through us; " Chinese king " and " literary composition is logical " identification software has very strong complementarity; Utilize their recognition result and carry out single file and word for word compare, filter out word, do not give manual work and adapt with identical recognition result; Give manual work the different words of identification and adapt check and correction.

The practical application statistical description is the document of main body to normal printed Chinese character, and we do not adapt the literal rate of dishing out and reach 95%, the error rate of this part literal reaches＜and 3/10000.

Before the two-way comparison,, also the normalization processing that necessary double byte character changes the half-angle character done in some characters to its application demand.These characters comprise A-Z, a-z, 0-9, "! ", " [", "] " etc., amount to 80 characters.

The capable contrast of two-way algorithm use adopts Horizon Search to seek Optimum Matching based on state space search A* algorithm.If two row text strings to be contrasted is S1 and S2, their length is respectively m and n, and m≤n; S1 comprise character (Cs1, Cs2 ..., Csm), S2 comprise character (Cl1, Cl2 ..., Cln).Alignment algorithm is following:

(1) to each literal Csi of short essay word string S1, and 0≤i≤m, in long article word string S2, seek characters matched, and in S2, putting into the S set Mi that possibly mate with the be complementary index of character of Csi; In SMi, increase by one-1 index subsequently, representative does not match.Process is following:

F0R?i＝1?TO?m

begin

F0R?j＝1?TO?n

begin

if?Csi＝Clj?then?SMi←j

end

SMi←-1

end

Thus, obtain the search volume (SM1, SM2 ..., SMm)

(2) for reducing the size of search volume, for each possible coupling, calculating comprises the possible subsequently maximum matching number MaxMatchAfter of itself (being called for short MMA), is used for next step heuristic search.To-1 possibly mate among the SMi, promptly Csi not with any one character match of S2, its MMA=m-i; To other possible couplings among the SMi, its MMA of recursive calculation, calculating will utilize sequence constraint and length constraint to get rid of obvious irrational coupling.

(3) carry out horizontal heuristic recursive search, find out big the separating of number of matches fast.

The vertical volume

Contradictory and repeat to occur twice above word and give manual work earlier and vertically adapt check and correction to two-way identification.It is all red in the paragraph acceptance of the bid that all need indulge the word of volume, and the sign of compiling is blue, and the picture and text contrast.By a collection of formation task of 700,000 words batch, basic this batch that guarantee accomplished in one day.

Under the normal condition, the amount of adapting of this process only accounts for all should adapt 5% of workload.Vertical volume has improved effectively adapts efficient, alleviates and adapts labour intensity.

In order to improve the accuracy of entire system, we have also initiatively added some easy gibberish and easy wrongly written character, and they are all indulged volume.Like 20 words such as " people ", " going into ", " one ", " two ", " fore-telling ", " in vain ", ". ", " youngsters ".

Horizontal volume

Through vertical editorial afterword, system carries out horizontal volume process, and all need the literal of horizontal volume all red in the paragraph acceptance of the bid, and the vertical word of compiling is green in the paragraph acceptance of the bid, and the sign of compiling is blue, and the picture and text contrast.

Under the normal operation, the amount of adapting of this process is less than all adapting 1% of workload.In adapting process, require the person of adapting to check the correct of paragraph simultaneously.

Quality inspection

Adapt the people and reach routine and adapt quality for supervising, designed and adapted the sampling observation post, data are adapted in each batch manual work inspected by random samples.General sampling observation 1/10 is guaranteed to adapt mistake and is lower than 1/1000.

Merge output

Cut figure information according to paragraph, text adapted in synthetic normal article.Its system mistake rate: 3/10000*95%+1/1000*5%=3.35/10000.

The above; Be merely the preferable embodiment of the present invention, but protection scope of the present invention is not limited thereto, any technician who is familiar with the present technique field is in the technical scope that the present invention discloses; The variation that can expect easily or replacement all should be encompassed within protection scope of the present invention.Therefore, protection scope of the present invention should be as the criterion with the protection domain of claim.

Claims

1. literal identification, the method for adapting is characterized in that said method comprises

File scanning is handled and is promptly adopted high-precision instrument scanned document, image is carried out slant correction, image deblurring denoising;

Cut figure by paragraph and have identical printed page analysis result to guarantee two-way OCR identification software;

Image to cutting carries out automatic printed page analysis;

The image of paragraph being cut figure carries out " row is cut figure ", selects Chinese king and Wen Tong identification software for use and adopts plug-in mode that the literal in the document is discerned; Said plug-in identification is exactly not change original OCR identification software, writes the process of new procedures analog manual operation OCR identification software, so that accomplish image recognition work;

The result of the comparison literal of discerning;

The different literal of identification is adapted check and correction and carried out quality inspection, and the adapting of said literal comprises vertically adapts and laterally adapts, through vertical compile promptly contradictory and repeat to occur twice above word and give manual work and vertically adapt check and correction to two-way identification; Horizontal volume is promptly carried out horizontal volume through vertical editorial afterword, and checks the correct of paragraph simultaneously;

2. literal identification according to claim 1, the method for adapting is characterized in that said Chinese king OCR identification software and Wen Tong OCR identification software are the complementary identification softwares of two kinds of recognition results.

3. literal identification according to claim 1 and 2, the method for adapting is characterized in that said identification also comprises the identification to English and other characters.