CN1131301A

CN1131301A - Word cutting method

Info

Publication number: CN1131301A
Application number: CN 95105634
Authority: CN
Inventors: 江政钦; 戴光良
Original assignee: Industrial Technology Research Institute ITRI
Current assignee: Industrial Technology Research Institute ITRI
Priority date: 1995-03-13
Filing date: 1995-05-30
Publication date: 1996-09-18
Also published as: JP2781150B2; JPH08263589A

Abstract

The word cutting method includes: searching all the image element-connecting elements, using iterative connecting element and with a combined calculating method to process all the image element-connecting elements and making them into several independent word elements and ordering all the indepdendent word elements, in which the iterative connecting element with combined calculating method uses the geometric relationship of all the image element-connecting elements to combine the appropriate image element-connecting elements into an independent word element according to the condition set by using automatically-estimated reference values of word width, word gap, line measure and line space. Besides, said invention is convenient for file reconstruction.

Description

A kind of method of literal cutting

The present invention relates to writing identifying system, particularly a kind of in order to produce independent character literal cutting (Segmentation of characters) method in order to identification.

General optical character identification (Optical Character Recognition, OCR) in the system, the literal cutting process occupies considerable status, optical character identification program as shown in Figure 1 is visible one spot.In identification process, at first need a literal figure is partly separated with literal, make the literal partly can be independent for identification.Be then that most important cutting partly cuts out each literal in the file in the identification process, make literal partly in each literal all with independent pattern for the contrast identification, so the result of literal cutting has a strong impact on the feasibility and the correctness of identification.Independent character after the image pre-treatment is then cut apart each is done further processing, for example smoothing etc. in addition, and identification work is easier to be carried out in the hope of making.Language aftertreatment after promptly carrying out text-recognition and identification and finish then with the various recognition modes that developed.

In the past, for the method that literal cutting aspect is adopted, as sciagraphy, zonule split plot design and run length code method (ruc length coding) etc., all not breaking away from regularly arranged literal served as to handle the most important condition.That is, though the whole bag of tricks has the advantage of himself, yet all can't carry out the literal cutting process to the file of following three kinds of situations, comprising: one, contain to tilt or distort the file of row, as shown in Figure 2; Two, contain partly overlapping but do not connect the file of literal, as shown in Figure 3; And three, the literal in the file is not of uniform size, as shown in Figure 4.This means that except file carefully and neatly done as printed text the irregular status that common people's manual documentation is caused will be difficult to carry out literal and cut apart, more leisure opinion has given identification.

Therefore, fundamental purpose of the present invention provides a kind of method of literal cutting, utilize pixel connection unit (pixcl connccted component) to be connected unit's merging algorithm with iterative (iterative) and carry out the literal cutting process, make overlapping but unconnected literal and the formed file of literal not of uniform size, all can be partitioned into independent character, to propose identification.

Another object of the present invention is handled via special ranks and character ordering (ordering) then in the method that a kind of literal cutting is provided, and makes the literal of various inclinations or distortion ranks also can be cut apart the back reorganization for identification.

A further object of the present invention provides a kind of method of literal cutting, and the file that handwriting is formed also can be writing identifying system institute's identification and processing.

In order to reach above-mentioned each purpose, the method for literal cutting of the present invention comprises: search for all pixels connection units on the file; With the first algorithm that merges of iterative connection all pixels are connected unit's several independent characters of processing; And with each independent character ordering processing.Wherein, the first merging of iterative connection algorithm is utilized the geometric relationship between each pixel connection unit, the condition that word according to automatic estimation is wide, character-spacing, line width and line-spacing reference value are set, repeated comparison after, and pixel that will be suitable connection unit merges into independent character.The present invention also provides the character sort method of crooked ranks, so that the reorganization of file.

For above-mentioned purpose of the present invention, feature and advantage can be become apparent, a most preferred embodiment cited below particularly, and conjunction with figs. are described in detail below:

Fig. 1 represents the process flow diagram of optical character identification.

Fig. 2 represents that portion has the file of crooked ranks.

Fig. 3 represents that portion contains partly overlapping but do not connect text file.

Fig. 4 represents the file that a literal is not of uniform size.

Fig. 5 represents the process flow diagram according to the iterative merging algorithm of the present invention.

Fig. 6 represents that two pixels connect parameter-definition between the unit.

Fig. 7 A to Fig. 7 C represents that two pixels connect the possible situation of various overlapping areas between the unit.

Fig. 8 A to Fig. 8 D represents the embodiment according to a literal cutting of the present invention.

Fig. 9 represents according to method of the present invention Fig. 2 file to be carried out the result of literal cutting.

Figure 10 represents according to method of the present invention Fig. 3 file to be carried out the result of literal cutting.

Figure 11 represents according to method of the present invention Fig. 4 file to be carried out the result of literal cutting.

Figure 12 and Figure 13 represent according to the cutting result of method of the present invention to two parts of more complicated files.

Literal cutting method of the present invention is based on search and the merging that each independent pixel connects unit, is different from tupe in the past.Because all kinds of pictographs based on Chinese words, comprise Japanese and Korean etc., its each literal is mostly removable to be split into some independent pixels and to connect unit, and for example " bright " word itself comprises two separate connection units: " day " and " moon ", and " day " itself be literal also be separate connection unit.Therefore, if these independent literal units are searched for, just go far towards the execution of literal cutting process from file.

At first, find out arbitrary stain pixel in the image of file, pixel begins to find out all stain pixels that directly link to each other or link to each other indirectly with this stain pixel and becomes an independent pixel and is connected first thus then.And then in other stain pixel, hunt out all pixels connection units one by one, connect first search work and finish pixel.The method that above-mentioned search independent pixel connects unit is called the pixel back tracking method, any those of ordinary skill in the art also can any other method find out all pixels connection units from file, therefore, connecting first way of search for pixel, is not purpose of the present invention.

Next, the present invention adopts the iterative intelligent unit that connects to merge each independent pixel connection unit that the algorithm merging belongs to same character.The principle of work of this iterative merging algorithm mainly be do in each loop that some add up that the word of estimating in the file is wide, important parameter such as character-spacing, line width and line-spacing, use some rules (RULE) simultaneously and merge different connection units.The characteristic of this process of iteration is that more to the loop of back, estimated parameter can be accurate more, thereby can make more accurate merging and obtain more accurate cutting result.

Please refer to the flow process about this iterative merging algorithm shown in Figure 5.It comprises simple merging (step 201), estimate that word is wide, line width, character-spacing and line-spacing (step 203), thin portion merge (step 205) and whether the check merging handling procedures such as (steps 207) takes place.

Merge in (step 201) simple, will in file, isolated each pixel connect unit, do preliminary merging according to lap size therebetween.So-called lap can be divided into horizontal lap and vertically superposed amount, and as shown in Figure 6, connecting first i is oh with the horizontal lap that is connected between first j, and vertically superposed amount is ov.The connection unit that only meets following rule in preliminary the merging just can carry out merging:

Oh=min (Wi, Wj), and Ov=min (Hi, Hj), wherein, Wi is the width that is connected first j with Wj, and Hi and Hj then are these two height that are connected unit, and above condition means has only wherein one to connect first regional letter and cover when another connects the unit zone and just can merge.For example, " " have in the word three independent pixels connect unit's " in vain ", " spoon " and ", ".Wherein, ", " promptly meet above-mentioned condition, will in step 201, be merged into a pixel, but " in vain " and " spoon " can't merge with " spoon ".

So, merge for making each pixel in the same character, must adopt that some criterions come that relation between the standard pixel, the present invention utilize promptly that character-spacing, word are wide, the estimation of line width and line-spacing, the portion's of running business into particular one merging foundation.Wherein, method of estimation wide to word and line width is with the width of all connection units and highly takes statistics, and determines a value after finding it to distribute and obtains.Suppose that the wide value of words maximum in the distribution is W, therefore just set Cw=1.2w (the 1.2nd, empirical value), the line width value is also with similar method decision.Estimation as for character-spacing and line-spacing is then comparatively complicated, and plan is illustrated the estimation of character-spacing earlier.Estimate character-spacing, each connects the left and right sides adjacent connection unit of unit decision earlier.To connect the left side coordinate of first i be Li in supposition for this reason, and the right coordinate is Ri, and the top coordinate is Ti, and following coordinate is Bi (initial point is the upper left corner at file), a left side that then connects first i face connect first j look for method as follows:

(a) find out all k of the connection that meets following condition unit:

\frac{Bk - Ti + Bk - Tk - \max (Bi, Bk) + \min (Ti, Tk)}{Bk - Tk} &GreaterEqual; \frac{1}{3}

(b) establish the connection unit that step (a) finds out and constitute a set N, a left side that then connects first i is faced and connected first j is the connection unit that meets following condition:

Li-Rj=min{Li-Rn|Li-Rn＞0 ∧ connects first n in set N } molecule in the condition of step (a) is exactly to calculate two vertically superposed amounts that connect unit in fact, therefore, the meaning of this condition is meant that having vertically superposed amount only surpasses 1/3 o'clock of its height and just list limit of consideration in.Step (b) then means finds out right margin connects first i near (but not overlapping) connection unit.The condition that then needs only step (b) as if the right connection unit that will find out the first i of connection is amended as follows:

Lj-Ri=min (Ln-Ri|Ln-Ri＞0 ∧ connects first n in set N } method of estimation of line-spacing is quite similar with the estimation of character-spacing in fact, find out to face to connect to face under first the reaching according to horizontal lap earlier and connect unit, its condition can compare step (a) to be drawn up with the condition (b), does not therefore repeat them here.After finding out adjacent connection unit (upper and lower, left and right), just can find out all and connect the distribution that unit is adjacent distance between the connection unit, in the present embodiment, be to set out estimated value to account for 4/5 distribution place.

Utilize above-mentioned estimated value, can merge the foundation of doing further merging for pixel.Merge in (step 205) in this thin portion, show following three judgment rules whether condition merges as needs greatly:

Condition one:

(1) merges the word wide (height) that back beam overall (height) degree is not more than 1.5 times of estimations;

(2) half that the lap of level (or vertical) must wide greater than the first word of less connection (or high); And

(3) the two level and vertical interval are less than 3/4 of the character-spacing of estimation and character-spacing, and it is with symbolic representation:

max(Bi，Bj)－min(Ti，Tj)≤1.5Lw∧max(Ri，Rj)－min(Li，Lj)≤1.5Cw

Condition two:

(1) merge back beam overall (height) degree greater than the word of 1.5 times of estimations wide (height), but less than the word of 2 times of estimations wide (height);

(2) half that the lap of level (or vertical) must wide greater than the first word of less connection (or high);

(3) the two level and vertical interval are less than 3/4 of character-spacing of estimating and line-spacing; And

(4) breadth length ratio after the merging is between 0.6 and 2.5.

It is as follows with symbolic representation: [2.0Lw＞max (Bi, Bj)-min (Ti, Tj)＞1.5Lw ∨ 2.0Cw＞max (Ri, Rj)-min (Li, Lj)＞1.5Cw]

Condition three:

(1) single connection unit wide (height) degree is less than the word of 0.25 times of estimation wide (height);

(3) breadth length ratio after the merging is between 0.6 and 2.5.

It is as follows to change symbolic representation into: [Bi-Ti≤0.25Lw ∨ Bj-Tj≤0.25Lw ∨ Ri-Li≤0.25Cw ∨ Rj-Lj≤0.25Cw]

Utilize above-mentioned three conditions to connect unit and detect, to find out satisfactory all connection units at any pixel.Wherein, the purpose of condition one is that height and the width after choosing those and merging is no more than 1.5 times of Lw and Cw, and level interval and vertical interval all are no more than the connection unit of 3/4 times of estimated character-spacing and line-spacing, in addition, to stipulate that also its level (or vertical) lap needs greater than a half width of the connection unit of narrower (or short) (or height).And condition second be consider may comprise in the file some bigger words therefore its merging close height and width may be higher than 1.5 times of Lw or 1.5 times of Cw.Under this situation, similarly must the limit levels spacing and vertical interval all be no more than 3/4 times of estimated character-spacing and line-spacing, and to outside the overlapping quantitative limitation of level (vertically), but also need become approximate square shape (to become tall and thin shape after needing restriction to merge owing to general handwritten word is normal, so 0.6 setting value is arranged), the purpose of condition three is to handle " two ", " three ", " river " .... these contain the word of elongated connection unit.

So, with the qualified connection formation one set C of unit, find out the K of connection unit among the set C then, make to connect first i and the overlapping area maximum that is connected between first K, wherein, the algorithm of overlapping area (A) is as follows:

A＝s*|Bi－Ti＋Bk－Tk－max(Bi，Bk)＋min(Ti，Tk)|

*|Ri－Li＋Rk－Lk－max(Ri，Rk)＋min(Li，Lk)|

Wherein s is defined as:

1, if Bi-Ti+Bk-Tk-max (Bi, Bk)+min (Ti, Tk)＞0

s＝∧Ri－Li＋Rk－Lk－max(Ri，Rk)＋min(Li，Lk)＞0，

Otherwise be-1.

Then, will connect first i and be connected first k merging.Overlapping area may be for just, shown in Fig. 7 A, also may be for negative, and shown in Fig. 7 B and Fig. 7 C, reference values that can merge as character all.

Step 207 is whether check merging work is finished, forms set C if still can find suitable connection unit, and then above-mentioned steps 201 to 205 is just heavily covered execution, till C is null set.

About the real work situation of above-mentioned iterative merging algorithm, please refer to the synoptic diagram of Fig. 8 A to Fig. 8 D.Wherein, in Fig. 8 A, each independent pixel connection unit searches out through the pixel back tracking method in the file, behind simple merging for the first time and thin portion merging supervisor, forms the state as Fig. 8 B.Owing to being has merging to take place, simple for the second time merging and the merging of thin portion are proceeded, and produce the result of Fig. 8 C.And after last the merging, shown in Fig. 8 D, no longer include the connection unit that meets the merging condition to exist, iterative merging work promptly stops.

Cut out each independently behind the character, in fact there is no the ordinal relation between character, therefore, following step is exactly the ordinal relation (ranks relation) that will discharge between these characters, could give identification unit in regular turn with character and do identification.

With right by a left side, horizontal type file from top to bottom, the ordering practice that the present invention adopts is earlier with the up projection of all words, the word that the number of pixels that those and other word is overlapped each other is no more than a setting value takes out, suppose that these words form a set A, with Fig. 2 is example, among the A={, China, the people, state, shadow, picture, the place, reason, know, not, association, meeting, reach }, in A, find out the soprano then, the soprano is " shadow ", in A, find out the word that surpasses another setting value (as 5 pixels) with soprano's vertically superposed amount then and become a word collection Be, and then in the word of A-Be, choose the word that surpasses setting value with the vertically superposed amount of Be and add Be and form new word collection B1, so heavily cover till not having new word to be added into.According to this method, can find among the word collection B={ among Fig. 2, China, the people, state, shadow, as, locate, manage, know, not, association, meeting, so B is exactly first text line, and in one was listed as, the order of word was then arranged with the value of left margin.After finding out row, the word that just will be listed as is rejected, and finds out next column according to same method then, and the rest may be inferred, all is discharged from up to all word prefaces.

According to cutting of above-mentioned literal and sort method, though the file of Fig. 2 to Fig. 4 has special irregular alignment pattern, however all can be through dividing processing, and produce result as Fig. 9 to Figure 11.Can find out among Fig. 9 to Figure 11 that each handwriting is all a square frame and includes, that is each literal cut apart suitably all independently, and can further have been handled for follow-up identification step.Even as the more complicated file of Figure 12 and Figure 13, it also can be literal cutting method of the present invention and handles, and produces each independent literal for identification.

Though the present invention discloses as above with most preferred embodiment; right its is not in order to limit the present invention; any those of ordinary skill in the art; under the premise without departing from the spirit and scope of the present invention; when can doing a little change and retouching, so protection scope of the present invention is when being defined by the accompanying Claim book.

Claims

1. the method for a literal cutting is applicable to the file that is formed by a plurality of letter symbols, the method is characterized in that to comprise the following steps:

(a) search out all pixels connection units on this document;

(b) utilize the first merging of iterative connection algorithm to merge described pixel and connect unit, to form several independent characters; And

(c) with described independent character ordering; Wherein, the first algorithm that merges of this iterative connection more comprises the following steps:

I will be contained in inner second pixel connection unit of first pixel connection unit fully and merge into one;

II sets that word is wide, character-spacing, line width and line-spacing reference value;

The described reference value that will meet III defines the described pixel that belongs to same character and connects unit and merge into one; And

IV connects first combination situation according to described pixel, judges heavily to cover execution in step I to III.

2. the method for claim 1 is characterized in that, should (c) step comprise:

(I) get in this document that the horizontal projection lap is a set less than the first effective range person in the superiors' character;

(II) remove in the interior character of this set, the vertical projection lap that is adjacent character is less than the second effective range person;

(III) will gather interior character and arrange in regular turn, and word for word in this document, reject; And

(IV) repeat this (I), (II), (III) step, all in this document, reject up to all characters.

3. the method for claim 1 wherein is characterised in that, should (a) step be the pixel back tracking method.

4. the method for claim 1 is characterized in that, the first algorithm that merges of this iterative connection more comprises a statistics step, sets described reference value so that this II step to be provided.