CN1741035A - Blocks letter Arabic character set text dividing method - Google Patents

Blocks letter Arabic character set text dividing method Download PDF

Info

Publication number
CN1741035A
CN1741035A CN 200510086478 CN200510086478A CN1741035A CN 1741035 A CN1741035 A CN 1741035A CN 200510086478 CN200510086478 CN 200510086478 CN 200510086478 A CN200510086478 A CN 200510086478A CN 1741035 A CN1741035 A CN 1741035A
Authority
CN
China
Prior art keywords
character
baseline
point
contact
character block
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN 200510086478
Other languages
Chinese (zh)
Other versions
CN1332348C (en
Inventor
丁晓春
靳简明
王�华
彭良瑞
刘长松
方驰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tsinghua University
Original Assignee
Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University filed Critical Tsinghua University
Priority to CNB2005100864781A priority Critical patent/CN1332348C/en
Publication of CN1741035A publication Critical patent/CN1741035A/en
Application granted granted Critical
Publication of CN1332348C publication Critical patent/CN1332348C/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Character Input (AREA)

Abstract

The present invention is characterized by that on the basis of character block classification selecting partial character blocks, making horizontal projection, dividing text image into subregions, then detecting multiple line subregion, combining character blocks and implementing literal line segmentation of multiple line subregion, then estimating baseline and top line positions of literal line, segmenting the literal line into body-conjunct character field, finally, utilizing feature of character tangent point to find the tangent points positioned on baseline, over baseline and under baseline, and utilizing structural rule to delete overtangential points. The invented character segmentation accuracy can be up to above 99%.

Description

Blocks letter Arabic character set text dividing method
Technical field
Blocks letter Arabic character set text dividing method belongs to the character cutting field in the optical character identification (OCR).
Background technology
Arabic is one of the United Nations's working language, is extensive use of in countries such as Egypt, Algeria, Morocco, Saudi Arabia.Uygur, Kazak, Kirgiz language are the important minority languages of China, though do not belong to the same family of languages with Arabic, all use Arabic character and write, and just character set is slightly different.In the present invention, the Arabic character set text written of use such as Arab, Uygur, Kazak are referred to as Arabic character set text, and brief note is the A Wen text.No matter be domestic or the world, identification A Wen text all has urgent current demand and application prospects.
In the A Wen text, Ah's Chinese character is write from right to left, and the literal of other language (for example Chinese character, English, numeral) is write from left to right.The standard Arabic has 28 base characters, and Uighur has 32 base characters.Each Ah's Chinese character has 1-4 kind written form: afterbody is write form with the head that next character is connected; The intermediate form that head and the tail are connected with adjacent character; Stem is write form with the tail that a last character is connected; Head and the tail and all disjunct absolute version of adjacent character.Each Ah's Chinese character is made up of a main part and several and the unconnected extention of main part, the position of extention can above the main part, below or the centre, the extention of standard Arabic character be 1-3 put and
Figure A20051008647800081
The extention of Uighur also comprises
Figure A20051008647800082
Deng.The main part of many characters is identical, only is different with the number or the shape of the disconnected extention of main part.No matter be hand-written or block letter A Wen text, the character that can connect always connects to be write or prints, and the part that character connects is called baseline.A word is made up of one or more disjunct conjuncted character fields.Fig. 1 is the base character table of standard Arabic, and Fig. 2 has reflected the Partial Feature of A Wen text.
General block letter A Wen text recognition system is as shown in Figure 3: preprocessing part strengthens the text image of input; Literal line cutting part is cut into literal line to the multiline text zone; Conjuncted character field cutting part is divided into disjunct conjuncted character field mutually to literal line; Cutting part resolves into conjuncted character field the basic elements of character such as character or stroke; Identification division extracts validity feature, the parts that utilize sorter identification to cut out; Aftertreatment partly utilizes means such as dictionary, language model, revises identification error, improves the text identification rate.
According to the relation between cutting and the identification, system can be divided into cut afterwards earlier and discern, the identification of limit cutting edge with do not cut integral body and discern three classes:
The first, cut afterwards earlier and discern (path among Fig. 3 (1) and path (2) is invalid).This type systematic at first cuts into character or stroke to conjuncted character field, the parts that obtain of identification cutting then, and to make up these parts be character identification result.These class methods are not subjected to the restriction of dictionary size, can discern the combination of arbitrary string, can handle big (infinitely) vocabulary situation.
The second, limit cutting edge identification (path among Fig. 3 (1) enabledisable all can, path (2) are effective).This type systematic must not carry out character cutting (when path 1 is invalid, then cutting in advance) before identification, but obtained the character cutting result by recognition result.The character boundary is the product of following of recognition result.These class methods depend on font, and calculated amount is big, to the deformation sensitivity.If middle certain character-recognition errors, perhaps part that character is other character, character identification result thereafter all can be affected.
The 3rd, do not cut whole identification (path among Fig. 3 (1) is effective, and path (2) are invalid).This type systematic does not need to identify each character, but the whole conjuncted character field of Direct Recognition, thereby avoided cutting.These class methods are come from the speech recognition development, can only handle little (limited) vocabulary situation.
Because general block letter A Wen text, have and have a large vocabulary, use font many, but set type the relatively characteristics of standard, be only so cut the method for afterwards discerning earlier.Therefore, design and printing body A Wen text dividing algorithm is the key that realizes block letter A Wen text recognition system.At present, the research of block letter A Wen text identification lags far behind the reason of other widely used literal (as the Latin alphabet, Chinese character, Japanese etc.), mainly with regard to the problem of the fine solution block letter A Wen text dividing that is to fail.For example: the method for horizontal projection cutting literal line can not be handled the situation that has the adhesion character between inclination text or the literal line; Boundary between a large amount of character appendix branch fuzzy literals is capable perhaps produces pseudo-literal line; The method of the conjuncted character field of vertical projection cutting can not be handled the overlapping situation of adjacent conjuncted character field level; General character cutting method is just sought the character point of contact on baseline, but in fact the part character may above the baseline or below adhesion.
Goal of the invention
Purpose of the present invention is exactly the method that a kind of reliable many font sizes of multi-font block letter A Wen text dividing will be provided.As shown in Figure 4: the present invention is on the basis of character block classification, and the cutting text image is a literal line, estimates a baseline and a line position of literal line then, and the cutting literal line becomes conjuncted character field, seeks all candidate point of contacts according to the characteristics of position, character point of contact at last.The block letter A Wen text dividing method of realizing according to the present invention has been applied to a block letter A Wen text recognition system.
Technical scheme
The invention is characterized in by following measure the block letter A Wen text image of input correctly is cut into single character picture.
1 literal line cutting
The purpose of literal line cutting is to be cut into continuous literal line text filed.Horizontal projection is the most direct capable cutting method, but can not handle the inclination text.In addition, owing to have a large amount of and the disjunct character extention of character main part in the A Wen text, so the blank gap that horizontal projection obtains both may be the blank between literal line, also may be the blank between interior character extention of literal line and the main part.Part projection and partial contour are followed the tracks of the method that combines can handle the constant text of vergence direction, but can not handle the situation of character adhesion in the ranks.
In order to address the above problem, the present invention proposes following literal line cutting method (Fig. 5): on the basis of character block classification, select part character block horizontal projection, the input text image segmentation is become subregion; Detect the multirow subregion then, the merger character block is realized the literal line cutting of multirow subregion; Cut the character block of adhesion in the ranks at last, and the small characters piece is included into affiliated literal line according to range information.This literal line cutting method not only can be handled nonangular text, can also handle inclination, there is the text of adhesion in character in the ranks.
1.1 character block classification
Make I represent the text image of importing, H is the height of I, and W is the width of I.Character block C on the I is expressed as
C = C ( 0,0 ) C ( 0,1 ) L C ( 0 , w - 1 ) C ( 1,0 ) C ( 1,1 ) L ( 1 , w - 1 ) M M L M C ( h - 1,0 ) C ( h - 1,1 ) L C ( h - 1 , w - 1 ) [ l , t , r , b ] .
Wherein (y x)=1 represents black pixel to C, and (y x)=0 represents white pixel to C, and l, t, r, b, w and h represent left margin, coboundary, right margin, lower boundary, width and the height (Fig. 6) of C respectively.The last all character block BLOCK={C of I NB| nB=1,2 ..., nBlock}.In the present invention, use and to have added various target C up and down and represent specific character block (C for example NB, C (line)Deng), added left margin, coboundary, right margin, lower boundary, width and the height that same target l, t, r, b, w and h up and down represent this character block respectively (h for example NB, r (line)Deng).
According to height and width information, can be divided into all character blocks three classes: the MIDDLE class is the general conjuncted character field of A Wen, and aspect ratio is more fixing; The BIG class comprises the character of adhesion in the ranks, and is highly the highest; The SMALL class comprises less isolated character, character extention and punctuate etc., and is highly short, and width is also short.
The present invention determines C by following formula NBClassification,
Figure A20051008647800102
Wherein
h ‾ = 1 nBlock Σ nB = 1 nBlock h nB ,
It is the average height of all character blocks.
1.2 subregion cutting
In order to remove character extention and the influence of adhesion character in the ranks, only keep MIDDLE class character block and carry out horizontal projection.Then to be 0 position be divided into a plurality of subregions text filed to the horizontal projection value.Each subregion comprises one or more literal line.
1.3 the multirow subregion detects
As shown in Figure 7, necessarily there is overlapping, the vertical nonoverlapping conjuncted character field of many groups level in the multirow subregion, and there is not the conjuncted character field group that satisfies condition in the single file subregion.
Therefore, in each subregion, find overlapping, the vertical nonoverlapping MIDDLE class character block group of all levels
CC={(C I,D J)|XOL(C I,C J)>0,YOL(C I,C J)=0,C I∈MIDDLE,C J∈MIDDLE}。
Wherein,
XOL(C I,C J)=max(min(r I,r J)-max(l I,l J),0),
Expression C IAnd C JOverlap length in the horizontal direction,
YOL(C I,C J)=max(min(b I,b J)-max(t I,t J),0),
Expression C IAnd C JThe overlap length of in the vertical direction.If ‖ CC is ‖>and 10, just can judge that this subregion is the multirow subregion.
1.4 multirow subregion cutting
Because the A Wen text is write from right to left, so according to the inner all MIDDLE class character blocks of right margin descending sort multirow subregion.Take out a character block in order, the character block of the n time taking-up is expressed as C at every turn nC nAll character block C that taken out 1, C 2..., C N-1Relatively, find overlapping maximum character block C on the vertical direction Y, promptly
C Y = arg max C i ( YOL ( C n , C i ) ) , i = 1,2 , . . . , n - 1 .
If C nAnd C YVertically lap is enough big, promptly
YOL(C n,C Y)> h/2,
C nAnd C YJust belong to same literal line, otherwise C nIt is capable to belong to new literacy.After taking out all character blocks, just obtained the literal line cutting result of multirow subregion.
1.5 cutting BIG class character block
Because BIG class character block is the character block of adhesion in the ranks, so need carry out cutting at leap literal line place.If character block C B∈ BIG has crossed over many literal lines, so in every literal line crossing over, at C BNear one establish a capital and have character block and C BVertically overlapping.
Therefore, the present invention adopts following way cutting BIG class character block:
For each character block C B∈ BIG is at literal line L (nL)The interior character block C that seeks (nL)(nL=1,2 .., nLine), promptly
C ( nL ) = arg min C NB ( nL ) ( | l B + r B 2 - l NB ( nL ) + r NB ( nL ) 2 | ) , C NB ( nL ) ∈ N B ( nL ) .
Wherein
N B ( nL ) = { C NB ( nL ) | C NB ( nL ) ∈ L ( nL ) , C NB ( nL ) ∈ MIDDLE ,
HDIS ( C NB ( nL ) , C B ) < h &OverBar; &times; 5 , YOL ( C NB ( nL ) , C B ) > h &OverBar; / 3 } ,
Represent in the nL bar literal line, at C BNear and C BVertical all overlapping MIDDLE class character blocks,
HDIS ( C NB ( nL ) , C B ) = max ( max ( l NB ( nL ) , l B ) - min ( r NB ( nL ) , r B ) , 0 ) ,
Expression C NB (nL)And C BBetween horizontal range.C like this (nL)Be exactly at NB (nL)In and C BThe character block that horizontal range is nearest.
If C (m)And C (m+1)Exist, C just is described BCrossed over m bar and m+1 bar literal line, need
y = b ( m ) + t ( m + 1 ) 2
Just cross over the local cutting C of literal line B
Cross over literal line place's cutting C at all B, each part of mark is the MIDDLE class, is assigned to corresponding literal line then.
1.6 insert SMALL class character block
For each character block C S∈ SMALL calculates C SWith the distance of every literal line, and it is assigned to nearest literal line.Calculate C SWith literal line L (nL)The method of distance as follows:
At first, at L (nL)Interior C SThe left side seek horizontal range C SNearest MIDDLE class character block C l, promptly
C l = arg min C LB ( nL ) ( l S - l LB ( nL ) ) , C LB ( nL ) &Element; L B ( nL ) ,
Wherein
L B ( nL ) = { C LB ( nL ) | C LB ( nL ) &Element; L ( nL ) , C LB ( nL ) &Element; MIDDLE , l S > l LB ( nL ) } .
Then, at L (nL)Interior C SThe right side seek horizontal range C SNearest MIDDLE class character block C r, promptly
C r = arg min C RB ( nL ) ( r RB ( nL ) - r S ) , C RB ( nL ) &Element; R B ( nL ) ,
Wherein
R B ( nL ) = { C RB ( nL ) | C RB ( nL ) &Element; L ( nL ) , C RB ( nL ) &Element; MIDDLE , r S > r RB ( nL ) } .
So, C SWith literal line L (nL)Distance be
2 baselines and a line are estimated
Because Ah's Chinese character connects on baseline, so literal line baseline location is very crucial for character cutting.When the literal line image of pre-treatment is represented H with L LBe the height of L, W LIt is the width of L.
2.1 baseline Height Estimation
As can be seen from Figure 2, the height of the baseline of concatenation character is identical among the literal line L.Therefore, the length of the vertically black pixel distance of swimming that the frequency of occurrences is the highest among the L is exactly the height of baseline.
The present invention tlv triple (y (s), y (e), x (se)) expression vertically black the pixel distance of swimming, wherein y (s)Be the vertical reference position of the distance of swimming, y (e)Be the vertical final position of the distance of swimming, x (se)It is the horizontal level of the distance of swimming. VRUN = { ( y nR ( s ) , y nR ( e ) , x nR ( se ) ) | nR = 1,2 , . . . , nRun } Be all the vertically black pixel distances of swimming among the L.
The number that then among the L highly is the vertically black pixel distance of swimming of runH is
VH ( runH ) = | | { y nR ( s ) , y nR ( e ) , x nR ( se ) | ( y nR ( s ) , y nR ( e ) , x nR ( se ) ) &Element; VRUN , y nR ( e ) - y nR ( s ) = runH } | | ,
runH=1,2,...,H L
The height that can get baseline thus is
H 0 = arg max runH ( VH ( runH ) ) , runH = 1,2 , . . . , H L .
2.2 baseline position is estimated
Ideally, the baseline of interior all the conjuncted character fields of literal line is all on same horizontal linear.But in the actual scan text image, literal line exists low-angle inclination or crooked deformation usually, causes the baseline position of different conjuncted character fields not necessarily identical.In order to address the above problem, the present invention is divided into a plurality of short parts to long literal line and estimates baseline position respectively, so just can think that the baseline position in each part is identical.
L is divided into literal line
Figure A20051008647800134
Individual part, the length of each part are α * H 0Make HP (nP)(y) (y=0,1 ..., H L-1) be the horizontal projection result that L only keeps the nP parts of images, the baseline position of nP part is so,
B Top ( nP ) = arg max y ( &Sigma; k = 0 H 0 - 1 HP ( nP ) ( y + k ) ) , y = 0,1 , . . . , H L - H 0 ,
B Btm ( nP ) = B Top ( nP ) + H 0 - 1 ,
B wherein Top (nP)And B Btm (nP)Coboundary and the lower boundary of representing baseline respectively.In real system, α can be between 10~15 value.So just can obtain the baseline position of L any place,
B Top ( x ) = B Top ( nP ) l ( nP ) x r ( nP ) B Btm ( x ) = B Btm ( nP ) l ( nP ) x r ( nP ) ,
L wherein (nP)And r (nP)It is respectively the left and right border of nP part.
2.3 a line position is estimated
As shown in Figure 2, after the removal extention, the distance of the top of Ah's Chinese character and baseline coboundary can obviously be divided into two classes: a class is the distance of high character top and baseline coboundary; One class is the distance of low character top and baseline coboundary.Line position refers to the position on low character top.
Make U L(x) (x=0,1 ..., W L-1) is last profile after literal line L removes extention, then
E L={x|U L(x)<U L(x-1),U L(x)<U L(x+1),1≤x<W L-1},
Be U L(x) set of minimal point.
The present invention utilizes the method for asking apart from average twice, distinguishes high character top and the low character top distance to the baseline coboundary.
Definition
BU ( x ) = B Top ( x ) - U L ( x ) x &Element; E L 0 x &NotElement; E L ,
Then can get h &prime; = 1 | | E L | | &Sigma; { x | x &Element; E L } BU ( x ) . Make E '={ x|BU (x)<h ', x ∈ E again L, can get H 1 = 1 | | E &prime; | | &Sigma; { x | x &Element; E &prime; } BU ( x ) .
H 1It is exactly the distance between a line and the baseline coboundary.
3 conjuncted character field cuttings
After obtaining literal line, need be cut into conjuncted character field to literal line.Each conjuncted character field is made up of a main part and some extentions.The method of vertical projection can not be handled the overlapping in the horizontal direction situation of adjacent conjuncted character field, therefore the present invention is according to the main part of baseline position and the conjuncted character field of character block classification mark, then according to the main part and the extention of the conjuncted character field of range information merger.
Two types character block is the main part of conjuncted character field below the face identification of the present invention: the MIDDLE class character block that is passed by baseline; Passed by baseline and SMALL class character block that other character blocks of getting along well in the horizontal direction are overlapping.Why part SMALL class character block is considered to conjuncted character field main part, is because these character blocks may be punctuate or isolated character.Remaining SMALL class character block then belongs to the extention of conjuncted character field.
Each is belonged to the SMALL class character block C of conjuncted character field extention SSearch for the nearest character block C that belongs to conjuncted character field main part M, C then SAnd C MBe the extention and the main part of same conjuncted character field.C SAnd C MDistance definition be
DIS(C S,C M)=|l S+r S-l M-r M|。
4 character cuttings
The character cutting is the process that conjuncted character field is cut into single character.When the conjuncted character field image of pre-treatment is represented H with P PBe the height of P, W PIt is the width of P.The present invention is divided into point of contact on baseline to the character point of contact, at the point of contact above the baseline and point of contact three classes below baseline.According to shown in Figure 8, the present invention searches for position, all possible point of contact of every class successively, checks the legitimacy at point of contact then, at last extention is distributed to corresponding character main part.
4.1 the candidate point of contact on baseline
As shown in Figure 9, place, the point of contact on baseline satisfies one of following situation:
(A) distance of swimming number changes.The character left margin, vertically distance of swimming number is increased to more than 2 from 1; The character right margin, vertically distance of swimming number is from being reduced to 1 more than 2.
(B) distance of swimming invariable number, but bigger variation takes place in length.The character left margin, the position of last profile and/or bottom profiled is away from b extent; The character right margin, the position of last profile and/or bottom profiled returns b extent.
(C) distance of swimming invariable number, last outline position gradually changes, and causes the outline position accumulation that bigger variation takes place.The character left margin, last outline position is gradually away from b extent; The character right margin, last outline position returns b extent gradually.
Defined function D (x) describes the upper and lower profile of P and the distance between the baseline,
D(x)=max(B Top(x)-U P(x),0)+max(V P(x)-B Btm(x),0),x=0,1,...,W P-1,
U wherein P(x) and V P(x) be last profile and bottom profiled after P removes extention respectively.
Like this, the candidate point of contact x on baseline OSatisfy following condition (1) and (2) or condition (1) and (3), "+" computing during condition (1) (2) (3) is various and "-" computing be the point of contact, left side and the point of contact, right side of corresponding character respectively:
(1) point of contact x OOn baseline:
D(x O)≤2;
(2) point of contact situation (A) or (B):
D(x O±1)-D(x O)>1.5×H 0
(3) point of contact situation (C):
D ( x O &PlusMinus; i ) - D ( x O ) > 0.75 &times; H 0 i = e D ( x O &PlusMinus; i ) &GreaterEqual; D ( x O &PlusMinus; iml ) i = 1,2 , K , e ,
Wherein e satisfies x O± e ∈ E P, E PBe R P(x) set of minimal point;
4.2 the candidate point of contact of baseline top
As shown in figure 10, if position x TBe the candidate point of contact of baseline top, x so TMust satisfy following condition:
(1) point of contact itself is a utmost point low spot of going up profile:
U P(x T)>U P(x T-1),U P(x T)>U P(x T+1);
(2) go up profile above the position, point of contact is in a line:
U P(x T)<B Top(x T)-H 1
(3) minimum point of left side, point of contact bottom profiled is below baseline:
max x < x T ( V P ( x ) ) > B Btm ( x T ) ;
(4) minimum point of right side, point of contact bottom profiled is below baseline:
max x T < x < w P - 1 ( V P ( x ) ) > B Btm ( x T ) .
4.3 the candidate point of contact of baseline below
As shown in figure 11, if position x BBe the candidate point of contact of baseline below, x so BMust satisfy following condition:
(1) point of contact left side 3 list profile peak TL above baseline:
( TL = min x B - 3 < x < x B ( V P ( x ) ) ) < B Top ( x B ) ;
(2) point of contact itself and right side 3 list profile peak TR below baseline:
( TR = min x B &le; k < x B + 3 ( V P ( x ) ) ) < B Btm ( x B ) ;
(3) TL and TR difference in height are greater than twice baseline height:
TR-TL>2×H 0
4.4 check the legitimacy at point of contact
Check the legitimacy at each point of contact from right to left successively, and delete illegal point of contact.If less than 2 times of baseline height, Zuo Ce point of contact is exactly an illegal point of contact to the distance between adjacent two point of contacts so less than the character height between the height of baseline and two point of contacts.Concrete grammar is as follows: establish x rAnd x lBe two adjacent point of contacts, wherein x r>x l, if
x r - x l &le; H 0 2 max x l < x < x r ( V P ( x ) ) - min x l < x < x r ( U P ( x ) ) &le; H 0 &times; 2 ,
Then delete point of contact x l
4.5 extention is distributed
When continuous two characters all had extention, extention (Figure 12) also might adhesion.If the main part of an extention and a plurality of characters is overlapping in the horizontal direction, then this extention is adhesion.If this extention is crossed over x RAnd x LTwo adjacent point of contacts are then at x BAnd x LBetween the minimum place's cutting of extention vertical projection value extention.
After the extention cutting of adhesion finishes, will distribute to nearest character main part to each extention.Extention C AWith character main part C MDistance be
DIS(C A,C M)=|l A+r A-l M-r M|。
The A Wen text dividing algorithm of realizing according to the present invention has been applied to a block letter A Wen text recognition system.The black and white two-value Arabic of 300DPI resolution scan and the newspaper of Uighur and periodical are used for testing this system, recognition result such as table 1.Wherein have only the mistake about 30% to cause because of the cutting mistake, the accuracy rate that is to say character cutting is more than 99%.Test findings can illustrate the validity of the block letter A Wen text dividing method that the present invention proposes.
Number of characters Discrimination
Test set 1 11966 95.38%
Test set 2 11790 97.79%
Test set 3 9912 96.75%
Test set 4 10788 95.80%
Amount to 44456 96.43%
Table 1 test findings
Many font sizes of multi-font block letter A Wen text dividing method that the present invention proposes has obtained excellent text dividing performance in experiment, be with a wide range of applications.In sum, the present invention has the following advantages:
First, the present invention propose on the basis of character block classification, elder generation's horizontal projection is the text filed subregion that is divided into, realize that at the inner merger character block of multirow subregion the method for literal line cutting not only can the nonangular text of processing horizontal then, it is big to handle the angle of inclination, has the text of character adhesion in the ranks.
The second, baseline, a line position method of estimation accuracy rate height that the present invention proposes.
The 3rd, the conjuncted character field cutting method that the present invention proposes can be handled the overlapping in the horizontal direction situation of adjacent conjuncted character field.
The 4th, the present invention can seek the point of contact on baseline, reaches the point of contact below baseline at the point of contact above the baseline, can cut the adhesion extention, and utilizes the cut-off of crossing of the frequent appearance of tactical rule deletion.Character point of contact recall ratio height, pseudo-point of contact number is few.
Description of drawings
Fig. 1 standard Arabic basic character set
Fig. 2 A Wen text characteristics
Fig. 3 block letter A Wen text recognition system structural drawing
The flow process of Fig. 4 INVENTION IN GENERAL
Fig. 5 literal line cutting flow process
Fig. 6 coordinate system and variable declaration
Fig. 7 multirow subregion detects
Fig. 8 character cutting flow process
The point of contact characteristics of Fig. 9 on baseline
Point of contact, Figure 10 baseline top characteristics
Point of contact, Figure 11 baseline below characteristics
The cutting of Figure 12 extention
Embodiment
A block letter A Wen text recognition system is made of image capture device and computing machine two parts on hardware.Image capture device generally is a scanner, is used for obtaining the digital picture of text to be identified.Computing machine is used for digital picture is handled, and finishes the final identification of text.
For one piece of A Wen text specimen page, at first it is swept computing machine by scanner, make it to become digital picture.Digital picture is taked pre-service measures such as binaryzation, removal noise, obtained bianry image.Capable cutting obtains line of text to input picture again, estimates baseline, a line position of line of text, and line of text is cut into conjuncted character field.On this basis each disjunctor character field is carried out the individual character cutting, obtain single character, discern single character then.The mistake in each stage mode by hand corrects.
Therefore, realize practical block letter A Wen text recognition system, aspect text dividing, need consider following several aspect: the literal line cutting; A baseline and a line position are estimated; Conjuncted character field cutting; Character cutting.Respectively these four aspects are described in detail below:
The first, the literal line cutting
1.1 character block classification
According to height and width information, all the character block { C on the text image I NB| nB=1,2 ..., nBlock} is divided into three classes:
Wherein
h &OverBar; = 1 nBlock &Sigma; nB = 1 nBlock h nB .
1.2 subregion cutting
Only keep MIDDLE class character block horizontal projection, to be 0 position be divided into a plurality of subregions text filed to the horizontal projection value.Each subregion comprises one or more literal line.
1.3 the multirow subregion detects
Each subregion is obtained
CC={(C I,C J)|XOL(C I,C J)>0,YOL(C I,C J)=0,C I∈MIDDLE,C J∈MIDDLE}。
Wherein,
XOL(C I,C J)=max(min(r I,r J)-max(l I,l J),0),
YOL(C I,C J)=max(min(b I,b J)-max(t I,t J),0)。
If ‖ CC is ‖>and 10, just judge that this subregion is the multirow subregion.
1.4 multirow subregion cutting
According to the inner all MIDDLE class character blocks of right margin descending sort multirow subregion.Take out a character block C in order at every turn nAll character block C that taken out 1, C 2..., C N-1Relatively, find overlapping maximum character block C on the vertical direction Y, promptly
C Y = arg max C i ( YOL ( C n , C i ) ) , i = 1,2 , . . . , n - 1 .
If C nAnd C YVertically lap is enough big, promptly
YOL(C n,C Y)> h/2,
C nAnd C YJust belong to same literal line, otherwise C nIt is capable to belong to new literacy.After taking out all character blocks, just obtained the literal line cutting result of multirow subregion.At last all literal lines of all subregions are arranged in accordance with the order from top to bottom.
1.5 cutting BIG class character block
For each character block C B∈ BIG is at literal line L (nL)The interior character block C that seeks (nL)(nL=1,2 .., nLine), promptly
C ( nL ) = arg min C NB ( nL ) ( | l B + r B 2 - l NB ( nL ) + r NB ( nL ) 2 | ) , C NB ( nL ) &Element; NB ( nL ) .
Wherein
NB ( nL ) = { C NB ( nL ) | C NB ( nL ) &Element; L ( nL ) , C NB ( nL ) &Element; MIDDLE ,
HDIS ( C NB ( nL ) , C B ) < h &OverBar; &times; 5 , YOL ( C NB ( nL ) , C B ) > h &OverBar; / 3 } ,
HDIS ( C NB ( nL ) , C B ) = max ( max ( l NB ( nL ) , l B ) - min ( r NB ( nL ) , r B ) , 0 ) .
If C (m)And C (m+1)Exist, C just is described BCrossed over m bar and m+1 bar literal line, need
y = b ( m ) + t ( m + 1 ) 2 ,
Place's cutting C BCross over the position of literal line C at all BBe cut into a plurality of parts, each part of mark is the MIDDLE class, is assigned to corresponding literal line then.
1.6 insert SMALL class character block
For each character block C S∈ SMALL calculates C SWith the distance of every literal line, and it is assigned to nearest literal line.
Calculate C SWith literal line L (nL)The method of distance as follows: seek the character block C that satisfies condition lAnd C r, promptly
C l = arg min C LB ( nL ) ( l S - l LB ( nL ) ) , C LB ( nL ) &Element; LB ( nL ) ,
C r = arg min C RB ( nL ) ( r RB ( nL ) - r S ) , C RB ( nL ) &Element; RB ( nL ) ,
Wherein
LB ( nL ) = { C LB ( nL ) | C LB ( nL ) &Element; L ( nL ) , C LB ( nL ) &Element; MIDDLE , l S > l LB ( nL ) } ,
RB ( nL ) = { C RB ( nL ) | C RB ( nL ) &Element; L ( nL ) , C RB ( nL ) &Element; MIDDLE , r S < r RB ( nL ) } .
C then SWith literal line L (nL)Distance be
Figure A20051008647800205
The second, a baseline and a line are estimated
When the literal line image of pre-treatment is represented H with L LBe the height of L, W LIt is the width of L.
2.1 baseline Height Estimation
Order VRUN = { ( y nR ( s ) , y nR ( e ) , x nR ( se ) ) | nR = 1,2 , . . . , nRun } Be all the vertically black pixel distances of swimming among the L,
VH ( runH ) = | | { ( y nR ( s ) , y nR ( e ) , x nR ( se ) ) | ( y nR ( s ) , y nR ( e ) , x nR ( se ) ) &Element; VRUN , y nR ( e ) - y nR ( s ) = runH } | | ,
runH=1,2,...,H L
The height that then can get baseline is
H 0 = arg max runH ( VH ( runH ) ) , runH = 1,2 , . . . , H L .
2.2 baseline position is estimated
L is divided into literal line
Figure A20051008647800209
Individual part, the length of each part are 15 * H 0Make HP (nP)(y) (y=0,1 ..., H L-1) be the horizontal projection result that L only keeps the nP parts of images, the baseline position of nP part is so,
B Top ( nP ) = arg max y ( &Sigma; k = 0 H 0 - 1 HP ( nP ) ( y + k ) ) , y = 0,1 , . . . , H L - H 0 ,
B Btm ( nP ) = B Top ( nP ) + H 0 - 1 ,
B wherein Top (nP)And B Btm (nP)Coboundary and the lower boundary of representing baseline respectively.So just can obtain the baseline position of L any place,
B Top ( x ) = B Top ( nP ) l ( nP ) &le; x < r ( nP ) B Btm ( x ) = B Btm ( nP ) l ( nP ) &le; x < r ( nP ) ,
L wherein (nP)And r (nP)It is respectively the left and right border of nP part.
2.3 a line position is estimated
Make U L(x) (x=0,1 ..., W L-1) is last profile after literal line L removes extention, then
E L={x|U L(x)<U L(x-1),U L(x)<U L(x+1),1≤x<W L-1},
Be U L(x) set of minimal point.
Definition
BU ( x ) = B Top ( x ) - U L ( x ) x &Element; E L 0 x &NotElement; E L ,
Then can get h &prime; = 1 | | E L | | &Sigma; { x | x &Element; E L } BU ( x ) . Make E '={ x|BU (x)<h ', x ∈ E again L, can get H 1 = 1 | | E &prime; | | &Sigma; { x | x &Element; E &prime; } BU ( x ) .
H 1It is exactly the distance between a line and the baseline coboundary.
The 3rd, conjuncted character field cutting
Each conjuncted character field is made up of a main part and some extentions.The MIDDLE class character block that is passed by baseline and passed by baseline and SMALL class that other character blocks of getting along well in the horizontal direction are overlapping is the main part of conjuncted character field.Remaining SMALL class character block then belongs to the extention of conjuncted character field.
Each is belonged to the SMALL class character block C of conjuncted character field extention SSearch for the nearest character block C that belongs to conjuncted character field main part M, C then SAnd C MBe the extention and the main part of same conjuncted character field.C SAnd C MDistance definition be
DIS(C S,C M)=|l S+r S-l M-r M|。
The 4th, character cutting
When the conjuncted character field image of pre-treatment is represented H with P PBe the height of P, W PIt is the width of P.
4.1 the candidate point of contact on baseline
Definition
D(x)=max(B Top(x)-U P(x),0)+max(V P(x)-B Btm(x),0),x=0,1,...,W P-1,
U wherein P(x) and V P(x)) be last profile and bottom profiled after P removes extention respectively.
Candidate point of contact x on baseline OSatisfy (1) and (2) or (1) and (3), below "+" computing and "-" computing in various distinguish the point of contact, left side and the point of contact, right side of corresponding character:
(1) point of contact x OOn baseline:
D(x O)≤2;
(2) point of contact situation (A) or (B):
D(x O±1)-D(x O)>1.5×H 0
(3) point of contact situation (C):
D ( x O &PlusMinus; i ) - D ( x O ) > 0.75 &times; H 0 i = e D ( x O &PlusMinus; i ) &GreaterEqual; D ( x O &PlusMinus; im 1 ) i = 1,2 , K , e ,
Wherein e satisfies x O± e ∈ E P, E PBe U P(x) set of minimal point.
4.2 the candidate point of contact of baseline top
The point of contact x of baseline top TMust satisfy following condition:
(1) point of contact itself is a utmost point low spot of going up profile:
U P(x T)>U P(x T-1),U P(x T)>U P(x T+1);
(2) go up profile above the position, point of contact is in a line:
U P(x T)<B Top(x T)-H 1
(3) minimum point of left side, point of contact bottom profiled is below baseline:
max x < x T ( V P ( x ) ) > B Btm ( x T ) ;
(4) minimum point of right side, point of contact bottom profiled is below baseline:
max x T < x < w P - 1 ( V P ( x ) ) > B Btm ( x T ) .
4.3 the candidate point of contact of baseline below
The point of contact x of baseline below BMust satisfy following condition:
(1) point of contact left side 3 list profile peak TL above baseline:
( TL = min x B - 3 < x < x B ( V P ( x ) ) ) < B Top ( x B ) ;
(2) point of contact itself and right side 3 list profile peak TR below baseline:
( TR = min x B &le; k < x B + 3 ( V P ( x ) ) ) > B Btm ( x B ) ;
(3) TL and TR difference in height are greater than twice baseline height:
TR-TL>2×H 0
4.4 check the legitimacy at point of contact
Check the legitimacy at each point of contact from right to left successively, and delete illegal point of contact.Concrete grammar is as follows: establish x rAnd x lBe two adjacent point of contacts, wherein x r>x l, if
x r - x l &le; H 0 2 max x l < x < x r ( V P ( x ) ) - min x l < x < x r ( U P ( x ) ) &le; H 0 &times; 2 '
Then delete point of contact x l
4.5 extention is distributed
If the main part of extention and a plurality of characters is overlapping in the horizontal direction, then extention is adhesion.If this extention is crossed over x RAnd x LTwo adjacent point of contacts are then at x RAnd x LBetween the minimum place's cutting of extention vertical projection value extention.
At last each extention is distributed to nearest character main part.Extention C AWith character main part C MThe distance calculation formula be
DIS(C A,C M)=|l A+r A-l M-r M|。

Claims (1)

1. the method for blocks letter Arabic Chinese character collection text dividing, it is characterized in that: at first on the basis of character block classification, select part character block horizontal projection that the input text image segmentation is become subregion earlier, detect the literal line cutting that multirow subregion merger character block is realized the multirow subregion then; Estimate a baseline and a line position of literal line then; Subsequently literal line is cut into conjuncted character field; Last characteristics according to the character point of contact are sought the point of contact on baseline, reach the point of contact below baseline at the point of contact above the baseline; In the system that is made up of image capture device and computing machine, this method contains following steps successively:
The 1st step, the literal line cutting
Make I represent the text image of importing, H is the height of I, W is the width of I, and the left margin of character block C, coboundary, right margin, lower boundary, width and be expressed as l, t, r, b, w and h have highly respectively added various target C up and down and represented specific character block (C for example NB, C (line)Deng), added left margin, coboundary, right margin, lower boundary, width and the height that identical up and down target l, t, r, b, w and h represent this character block respectively (h for example NB, r (line)Deng);
The 1.1st step, the character block classification
According to height and width information, character block { C all on the input text image I NB| nB=1,2 ..., nBlock} is divided into three classes:
Figure A2005100864780002C1
Wherein
h &OverBar; = 1 nBlock &Sigma; nB = 1 nBlock h nB ,
It is the average height of all character blocks;
The 1.2nd step, the subregion cutting
To MIDDLE class character block horizontal projection, to be 0 position be divided into a plurality of subregions text filed to the horizontal projection value, and each subregion comprises one or more literal line;
In the 1.3rd step, the multirow subregion detects
Each subregion is obtained overlapping, the vertical nonoverlapping MIDDLE class character block group of all levels
CC={(C I,C J|XOL(C I,C J)>0,YOL(C I,C J)=0,C I∈MIDDLE,C J∈MIDDLE},
Wherein,
XOL(C I,C J)=max(min(r I,r J)-max(l I,l J),0),
XOL (C I, C J) expression character block C IWith character block C JOverlap length in the horizontal direction,
YOL(C J,C J)=max(min(b I,b J)-max(t I,t J),0),
YOL (C I, C J) expression character block C IAnd character block C JThe overlap length of in the vertical direction;
If ‖ CC is ‖>and 10, just judge that this subregion is the multirow subregion;
The 1.4th step, the cutting of multirow subregion
According to the inner all MIDDLE class character blocks of right margin descending sort multirow subregion, take out a character block C in order at every turn nAll character block C that taken out 1, C 2..., C N-1Relatively, find overlapping maximum character block C on the vertical direction Y, promptly
C Y = arg max C i ( YOL ( C n , C i ) ) , i = 1,2 , . . . , n - 1 ,
If C nAnd C YVertically lap is enough big, promptly
YOL(C n,C Y)>h/2,
C nAnd C YJust belong to same literal line, otherwise C nIt is capable to belong to new literacy, take out all character blocks after, just obtained the literal line cutting result of multirow subregion;
The 1.5th step, cutting BIG class character block
For each character block C B∈ BIG is at literal line L (nL)The interior character block C that seeks (nL)(nL=1,2 .., nLine), promptly
C ( nL ) = arg max C NB ( nL ) ( | l B + r B 2 - l NB ( nL ) + r NB ( nL ) 2 | ) , C NB ( nL ) &Element; NB ( nL ) ,
Wherein
NB ( nL ) = { C NB ( nL ) | C NB ( nL ) &Element; L ( nL ) , C NB ( nL ) &Element; MIDDLE ,
HDIS ( C NB ( nL ) , C B ) < h &OverBar; &times; 5 , YOL ( C NB ( nL ) , C B ) > h &OverBar; / 3 } ,
NB (nL)Represent in the nL bar literal line, at C BNear and C BVertical all overlapping MIDDLE class character blocks,
HDIS ( C NB ( nL ) , C B ) = max ( max ( l NB ( nL ) , l B ) - min ( r NB ( nL ) , r B ) , 0 ) ,
HDIS (C NB (nL), C B) expression C NB (nL)And C BBetween horizontal range, C then (nL)Be at NB (nL)In and C BThe character block that horizontal range is nearest;
If C (m)And C (m+1)Exist, just explanation is in m bar and m+1 bar literal line, at C BNear all have character block and C BVertically overlapping, that is to say C BCrossed over m bar and m+1 bar literal line, need
y = b ( m ) + t ( m + 1 ) 2 ,
Place's cutting C B, cross over the position of literal line C at all BBe cut into a plurality of parts, each part of mark is the MIDDLE class, is assigned to corresponding literal line then;
In the 1.6th step, insert SMALL class character block
For each character block C S∈ SMALL calculates C SWith the distance of every literal line, and it is assigned to nearest literal line:
Calculate C SWith literal line L (nL)The method of distance as follows: seek the character block C that satisfies condition lAnd C r, promptly
C l = arg min C LB ( nL ) ( l S - l LB ( nL ) ) , C LB ( nL ) &Element; LB ( nL ) ,
C r = arg min C EB ( nL ) ( r RB ( nL ) - r S ) , C RB ( nL ) &Element; RB ( nL ) ,
Wherein
LB ( nL ) = { C LB ( nL ) | C LB ( nL ) &Element; L ( nL ) , C LB ( nL ) &Element; MIDDLE , l S > L LB ( nL ) } ,
RB ( nL ) = { C RB ( nL ) | C RB ( nL ) &Element; L ( nL ) , C RB ( nL ) &Element; MIDDLE , r S < r RB ( nL ) } ,
C then SWith literal line L (nL)Distance be
Figure A2005100864780004C5
In the 2nd step, a baseline and a line are estimated
When the literal line image of pre-treatment is represented H with L LBe the height of L, W LIt is the width of L;
The 2.1st step, the baseline Height Estimation
Order VRUN = { ( y nR ( s ) , y nR ( e ) , x nR ( se ) ) | nR = 1,2 , . . . , nRun } Be all the vertically black pixel distances of swimming among the L, then
VH ( runH ) = | | { ( y nR ( s ) , y nR ( e ) , x nR ( se ) ) | ( y nR ( s ) , y nR ( e ) , x nR ( se ) ) &Element; VRUN , y nR ( e ) - y nR ( s ) = runH } | | ,
runH=1,2,...,H L
The height that can get baseline thus is
H 0 = arg min runH ( VH ( runH ) ) , runH = 1,2 , . . . , H L ;
In the 2.2nd step, baseline position is estimated
L is divided into literal line
Figure A2005100864780004C9
Individual part, the length of each part are α * H 0, α is value between 10~15, makes HP (nP)(y) (y=0,1 ..., H L-1) be the horizontal projection result that L only keeps the nP parts of images, the baseline position of nP part is so,
B Top ( nP ) = arg max y ( &Sigma; k = 0 H 0 - 1 HP ( nP ) ( y + k ) ) , y = 0,1 , . . . , H L - H 0 ,
B Btm ( nP ) = B Top ( nP ) + H 0 - 1 ,
B wherein Top (nP)And B Btm (nP)Coboundary and the lower boundary of representing baseline respectively so just can obtain the baseline position of L any place,
B Top ( x ) = B Top ( nP ) l ( nP ) &le; x < r ( nP ) B Btm ( x ) = B Btm ( nP ) l ( nP ) &le; x < r ( nP ) ,
L wherein (nP)And r (nP)It is respectively the left and right border of nP part;
In the 2.3rd step, a line position is estimated
Make U L(x) (x=0,1 ..., W L-1) is last profile after literal line L removes extention, then
E L={ x|U L(x)<U L(x-1), U L(x)<U L(x+1), 1≤x<W L-1} is U L(x) set of minimal point;
Definition
BU ( x ) = B Top ( x ) - U L ( x ) x &Element; E L 0 x &NotElement; E L ,
Then can get h &prime; = 1 | | E L | | &Sigma; { x | x &Element; E L } BU ( x ) , Make E '={ x|BU (x)<h ', x ∈ E again L, can get H 1 = 1 | | E &prime; | | &Sigma; { x | x &Element; E &prime; } BU ( x ) , H 1It is exactly the distance between a line and the baseline coboundary;
The 3rd step, conjuncted character field cutting
Each conjuncted character field is made up of a main part and some extentions, the MIDDLE class character block that is passed by baseline and passed by baseline and SMALL class that other character blocks of getting along well in the horizontal direction are overlapping is the main part of conjuncted character field, remaining SMALL class character block then belongs to the extention of conjuncted character field;
Each is belonged to the SMALL class character block C of conjuncted character field extention SSearch for the nearest character block C that belongs to conjuncted character field main part M, C then SAnd C MBe the extention and the main part of same conjuncted character field, C SAnd C MDistance definition be
DIS(C S,C M)=|l S+r S-l M-r M|;
The 4th step, character cutting
When the image user of the conjuncted character field of pre-treatment represents H PBe the height of P, W PIt is the width of P;
The 4.1st step, the candidate point of contact on baseline
Place, point of contact on baseline satisfies one of following situation:
(A) distance of swimming number changes: the character left margin, and vertically distance of swimming number is increased to more than 2 from 1; The character right margin, vertically distance of swimming number is from being reduced to 1 more than 2;
(B) distance of swimming invariable number, but bigger variation takes place in length: the character left margin, the position of last profile and/or bottom profiled is away from b extent; The character right margin, the position of last profile and/or bottom profiled returns b extent;
(C) distance of swimming invariable number, last outline position gradually changes, and causes the outline position accumulation that bigger variation takes place: the character left margin, last outline position is gradually away from b extent; The character right margin, last outline position returns b extent gradually;
Defined function D (x) describes the upper and lower profile of P and the distance between the baseline,
D (x)=max (B Top(x)-U P(x), 0)+max (V P(x)-B Btm(x), 0), x=0,1 ..., W P-1, U wherein P(x) and V P(x) be last profile and bottom profiled after P removes extention respectively;
Candidate point of contact x on baseline OSatisfy following condition (1) and (2) or condition (1) and (3), "+" computing during condition (1) (2) (3) is various and "-" computing be the point of contact, left side and the point of contact, right side of corresponding character respectively:
(1) point of contact x OOn baseline:
D(x O)≤2;
(2) point of contact situation (A) or (B):
D(x O±1)-D(x O)>1.5×H 0
(3) point of contact situation (C):
D ( x O &PlusMinus; i ) - D ( x O ) > 0.75 &times; H i = e D ( x O &PlusMinus; i ) &GreaterEqual; D ( x O &PlusMinus; iml ) i = 1,2 , K , e ,
Wherein e satisfies x O± e ∈ E P, E PBe U P(x) set of minimal point;
The 4.2nd step, the candidate point of contact of baseline top
The point of contact x of baseline top TMust satisfy following condition:
(1) point of contact itself is a utmost point low spot of going up profile:
U P(x T)>U P(x T-1),U P(x T)>U P(x T+1);
(2) go up profile above the position, point of contact is in a line:
U P(x T)<B Top(x T)-H 1
(3) minimum point of left side, point of contact bottom profiled is below baseline:
max x < x T ( V P ( x ) ) > B Btm ( x T ) ;
(4) minimum point of right side, point of contact bottom profiled is below baseline:
max x T < x < w P - 1 ( V P ( x ) ) > B Btm ( x T ) ;
The 4.3rd step, the candidate point of contact of baseline below
The point of contact X of baseline below BMust satisfy following condition:
(1) point of contact left side 3 list profile peak TL above baseline:
( TL = min x B - 3 < x < x B ( V P ( x ) ) ) < B Top ( x B ) ;
(2) point of contact itself and right side 3 list profile peak TR below baseline:
( TR = min x B &le; k < x B + 3 ( V P ( x ) ) ) > B Btm ( x B ) ;
(3) TL and TR difference in height are greater than twice baseline height:
TR-TL>2×H 0
The 4.4th step, the legitimacy at inspection point of contact
Check the legitimacy at each point of contact from right to left successively, and delete illegal point of contact, basis for estimation is as follows: establish x rAnd x lBe two adjacent point of contacts, wherein x r>x l, if
x r - x l &le; H 0 2 max x l < x < x r ( V P ( x ) ) - min x l < x < x r ( U P ( x ) ) &le; H 0 &times; 2 ,
Then delete point of contact x l
In the 4.5th step, extention is distributed
If the main part of extention and a plurality of characters is overlapping in the horizontal direction, then extention is adhesion, if this extention is crossed over x RAnd x LTwo adjacent point of contacts are then at x RAnd x LBetween the minimum place's cutting of extention vertical projection value extention;
At last each extention is distributed to nearest character main part, extention C AWith character main part C MThe distance calculation formula be
DIS(C A,C M)=|l A+r A-l M-r M|。
CNB2005100864781A 2005-09-23 2005-09-23 Blocks letter Arabic character set text dividing method Expired - Fee Related CN1332348C (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CNB2005100864781A CN1332348C (en) 2005-09-23 2005-09-23 Blocks letter Arabic character set text dividing method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CNB2005100864781A CN1332348C (en) 2005-09-23 2005-09-23 Blocks letter Arabic character set text dividing method

Publications (2)

Publication Number Publication Date
CN1741035A true CN1741035A (en) 2006-03-01
CN1332348C CN1332348C (en) 2007-08-15

Family

ID=36093416

Family Applications (1)

Application Number Title Priority Date Filing Date
CNB2005100864781A Expired - Fee Related CN1332348C (en) 2005-09-23 2005-09-23 Blocks letter Arabic character set text dividing method

Country Status (1)

Country Link
CN (1) CN1332348C (en)

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN100440250C (en) * 2007-03-09 2008-12-03 清华大学 Recognition method of printed mongolian character
CN101814286A (en) * 2010-04-14 2010-08-25 深圳市茁壮网络股份有限公司 Restoration method and device of Arabic character display
CN101866417A (en) * 2010-06-18 2010-10-20 西安电子科技大学 Method for identifying handwritten Uigur characters
CN102063621A (en) * 2010-11-30 2011-05-18 汉王科技股份有限公司 Method and device for correcting geometric distortion of character lines
CN102142088A (en) * 2010-08-17 2011-08-03 穆罕默德S·卡尔希德 Effective Arabic feature extraction-based Arabic identification method and system
CN102314616A (en) * 2010-06-30 2012-01-11 汉王科技股份有限公司 Self-adaptation offline handwriting identification method and device
CN102446275A (en) * 2010-09-30 2012-05-09 汉王科技股份有限公司 Identification method and device for Arabic character
CN102982331A (en) * 2012-12-05 2013-03-20 曙光信息产业(北京)有限公司 Method for identifying character in image
CN106295631A (en) * 2016-07-27 2017-01-04 新疆大学 A kind of image Uighur word recognition methods and device
CN107730511A (en) * 2017-09-20 2018-02-23 北京工业大学 A kind of Tibetan language historical document line of text cutting method based on baseline estimations
CN108764155A (en) * 2018-05-30 2018-11-06 新疆大学 A kind of handwriting Uighur words cutting recognition methods
CN109145879A (en) * 2018-09-30 2019-01-04 金蝶软件(中国)有限公司 A kind of type fount knows method for distinguishing, equipment and storage medium
CN109919037A (en) * 2019-02-01 2019-06-21 汉王科技股份有限公司 A kind of text positioning method and device, text recognition method and device
CN110858317A (en) * 2018-08-24 2020-03-03 北京搜狗科技发展有限公司 Handwriting recognition method and device
CN111626302A (en) * 2020-05-25 2020-09-04 西北民族大学 Method and system for cutting adhered text lines of ancient book document images of Ujin Tibetan

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP5125573B2 (en) * 2008-02-12 2013-01-23 富士通株式会社 Region extraction program, character recognition program, and character recognition device

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1025764C (en) * 1992-05-12 1994-08-24 浙江大学 Characters recognition method and system
CN1065972C (en) * 1997-06-27 2001-05-16 郭熙凡 Digital graphic coding capable of writing by hand and its recognition method
US7283669B2 (en) * 2003-01-29 2007-10-16 Lockheed Martin Corporation Fine segmentation refinement for an optical character recognition system
JP2005141329A (en) * 2003-11-04 2005-06-02 Toshiba Corp Device and method for recognizing handwritten character
CN1266643C (en) * 2004-11-12 2006-07-26 清华大学 Printed font character identification method based on Arabic character set

Cited By (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN100440250C (en) * 2007-03-09 2008-12-03 清华大学 Recognition method of printed mongolian character
CN101814286A (en) * 2010-04-14 2010-08-25 深圳市茁壮网络股份有限公司 Restoration method and device of Arabic character display
CN101866417A (en) * 2010-06-18 2010-10-20 西安电子科技大学 Method for identifying handwritten Uigur characters
CN102314616B (en) * 2010-06-30 2013-05-29 汉王科技股份有限公司 Self-adaptation offline handwriting identification method and device
CN102314616A (en) * 2010-06-30 2012-01-11 汉王科技股份有限公司 Self-adaptation offline handwriting identification method and device
CN102142088A (en) * 2010-08-17 2011-08-03 穆罕默德S·卡尔希德 Effective Arabic feature extraction-based Arabic identification method and system
CN102142088B (en) * 2010-08-17 2013-01-23 穆罕默德S·卡尔希德 Effective Arabic feature extraction-based Arabic identification method and system
CN102446275A (en) * 2010-09-30 2012-05-09 汉王科技股份有限公司 Identification method and device for Arabic character
CN102446275B (en) * 2010-09-30 2014-04-16 汉王科技股份有限公司 Identification method and device for Arabic character
CN102063621B (en) * 2010-11-30 2013-01-09 汉王科技股份有限公司 Method and device for correcting geometric distortion of character lines
CN102063621A (en) * 2010-11-30 2011-05-18 汉王科技股份有限公司 Method and device for correcting geometric distortion of character lines
CN102982331A (en) * 2012-12-05 2013-03-20 曙光信息产业(北京)有限公司 Method for identifying character in image
CN106295631A (en) * 2016-07-27 2017-01-04 新疆大学 A kind of image Uighur word recognition methods and device
CN107730511A (en) * 2017-09-20 2018-02-23 北京工业大学 A kind of Tibetan language historical document line of text cutting method based on baseline estimations
CN107730511B (en) * 2017-09-20 2020-10-27 北京工业大学 Tibetan historical literature text line segmentation method based on baseline estimation
CN108764155A (en) * 2018-05-30 2018-11-06 新疆大学 A kind of handwriting Uighur words cutting recognition methods
CN108764155B (en) * 2018-05-30 2021-10-12 新疆大学 Handwritten Uyghur word segmentation recognition method
CN110858317A (en) * 2018-08-24 2020-03-03 北京搜狗科技发展有限公司 Handwriting recognition method and device
CN110858317B (en) * 2018-08-24 2024-06-14 北京搜狗科技发展有限公司 Handwriting recognition method and device
CN109145879A (en) * 2018-09-30 2019-01-04 金蝶软件(中国)有限公司 A kind of type fount knows method for distinguishing, equipment and storage medium
CN109145879B (en) * 2018-09-30 2021-01-12 金蝶软件(中国)有限公司 Method, equipment and storage medium for identifying printing font
CN109919037A (en) * 2019-02-01 2019-06-21 汉王科技股份有限公司 A kind of text positioning method and device, text recognition method and device
CN111626302A (en) * 2020-05-25 2020-09-04 西北民族大学 Method and system for cutting adhered text lines of ancient book document images of Ujin Tibetan

Also Published As

Publication number Publication date
CN1332348C (en) 2007-08-15

Similar Documents

Publication Publication Date Title
CN1741035A (en) Blocks letter Arabic character set text dividing method
CN1324521C (en) Preprocessing equipment and method for distinguishing image character
CN1119767C (en) Character string extraction apparatus and pattern extraction apparatus
CN1220162C (en) Title extracting device and its method for extracting title from file images
CN1184796C (en) Image processing method and equipment, image processing system and storage medium
CN1140878C (en) Character identifying/correcting mode
CN1213592C (en) Adaptive two-valued image processing method and equipment
CN1161687C (en) Scribble matching
CN1167043C (en) Image display device
CN1101032C (en) Related term extraction apparatus, related term extraction method, and computer-readable recording medium having related term extration program recorded thereon
CN100347723C (en) Off-line hand writing Chinese character segmentation method with compromised geomotric cast and sematic discrimination cost
CN1331449A (en) Method and relative system for dividing or separating text or decument into sectional word by process of adherence
CN1225484A (en) Address recognition apparatus and method
CN1505431A (en) Apparatus and method for recognizing a character image from an image screen
CN1115540A (en) Apparatus for detecting position of area of words screen and picture charactor such as pictureless zone
CN1684492A (en) Image dictionary creating apparatus, coding apparatus, image dictionary creating method
CN1846232A (en) Object posture estimation/correlation system using weight information
CN1991863A (en) Medium processing apparatus, medium processing method, and medium processing system
CN101038625A (en) Image processing apparatus and method
CN1251130C (en) Method for identifying multi-font multi-character size print form Tibetan character
CN1588431A (en) Character extracting method from complecate background color image based on run-length adjacent map
CN1200387C (en) Statistic handwriting identification and verification method based on separate character
CN1266643C (en) Printed font character identification method based on Arabic character set
CN1841407A (en) Image processing apparatus
CN1202670A (en) Pattern extraction apparatus

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20070815

Termination date: 20180923