CN101901333A

CN101901333A - Method for segmenting word in text image and identification device using same

Info

Publication number: CN101901333A
Application number: CN 200910085536
Authority: CN
Inventors: 王琛; 刘正珍
Original assignee: Hanwang Technology Co Ltd
Current assignee: Hanwang Technology Co Ltd
Priority date: 2009-05-25
Filing date: 2009-05-25
Publication date: 2010-12-01
Anticipated expiration: 2029-05-25
Also published as: CN101901333B

Abstract

The invention provides a method for segmenting a word in a text image and an identification device using same, belonging to an image processing field. The method comprises the following steps of: performing parameter analysis to inputted character information through a parameter analysis unit; preprocessing the obtained character information according to the analysis; computing a character pitch array; performing template convoluting and smoothening to the character pitch array to obtain a smooth array; computing a difference value between the pitch array and the corresponding position of the smooth array though a computing unit, and judging blanks by comparing the difference value with a previously-set threshold value; and finally processing the judged blanks. A corresponding identification device consists of a parameter analyzing unit, a character identifying unit, a data transmitting unit, a preprocessing unit, a computing unit, a comparing and judging unit and a post-processing unit. The method obtains a local peak value as selection basis, improves the accuracy for segmenting a slantwise character due to the preprocessing of the region of the character, and is convenient for selecting a universal threshold in the process for composing the complex character.

Description

The method of segmenting word and use the recognition device of this method in the text image

Technical field

The invention belongs to image processing field, relate to a kind of method of the segmenting word in text image.

Background technology

The general flow of OCR technology is printed page analysis, row cutting, character cutting, individual character identification, aftertreatment, if what discern is to be the languages that unit is write with the speech, after carrying out individual character identification, also need carry out the cutting of word, carrying out the segmentation of words mainly is to judge according to the spacing between the character, if spacing is bigger, then this position might be a space, and the character after the space then is the first character of word.

Generally, we can estimate a threshold value according to the statistical information of character pitch, judge according to threshold value whether certain character exists the space before, if the spacing before certain character is greater than threshold value, think that then this character is the space before, this character is the bebinning character of a word, but a lot of situations may occur in application.

If there is the inclination font in delegation's literal, the shared zone of the literal of the font that at this moment tilts is overlapping often, causes the interval in two shared zones of character untrue, is difficult to judge which is a space at interval.As " of flight " in Fig. 1 first row, the space between two f is just by the bottom cover of the top of first f and second f.

If character composing density degree disunity, at this moment generic threshold value is not easy to choose, and the space is also difficult to be judged.If have bigger word of font and the less word of font in the delegation, when the space gap between the speech at the word place that space between the speech at the word place that font is bigger and font are less is big, causes easily and obscure.In the literal as shown in Figure 2, the left side is 10 pixels than space minimum between the speech of big font, and the right is than average out to 5 pixels in space between the speech of small font, and the left side than the spacing between the big font character much all greater than 5 pixels, if, will obtain wrong cutting result so get unified threshold value.

Summary of the invention

The invention provides the method for segmenting word in the text image and use the recognition device of this method, the local peaking that asks for character pitch value ordered series of numbers is as possible position, space, the pre-service that carry out the character region has improved the accuracy of the cutting of inclination font, has made things convenient for choosing of generic threshold value in the process that the character of complexity is set type.

The present invention relates to the method for segmenting word in the text image, comprise the steps:

Step (1) is carried out the parameter analysis by the parameter analytic unit to the character information in the line character of input.

Step (2), the character information of analyzing gained transfers to pretreatment unit, carries out pre-service by pretreatment unit according to the character information of described analysis gained.

Step (3), arithmetic element form the character pitch array according to pretreated character information calculating character spacing.

Step (4), arithmetic element is carried out smothing filtering based on mask convolution to described character pitch array, obtains level and smooth array.

Step (5), the difference of arithmetic element calculating character spacing array and level and smooth array correspondence position is compared the judgement of carrying out the space by comparison judgment unit according to described difference with pre-set threshold.

Step (6) is carried out aftertreatment by post-processing unit to the space that judgement obtains.

Further, this method comprises that also a line character of input described in the step (1) is by character recognition system identification back input.

Further, this method also comprises, it is characterized in that, described parameter comprises the mean value of character pitch, the mean value of character the ratio of width to height and the mean value of character duration.

Further, this method comprises that also further, this device comprises that also described pre-service comprises the border, the left and right sides of adjusting character zone according to character information.

Further, this method comprises that also described character zone is for comprising the rectangle frame of the minimum of this character fully.

Further, this method comprises that also described pre-service comprises: for the character of inclination font, tighten its rectangle frame, get the shared zone of the middle layout of character among four lines, three lattice as new character zone.

Further, this method also comprises, the method of getting the shared zone of the middle layout of character among four lines, three lattice comprises: at first obtain the last bottom profiled of lowercase a, c, e, m, n, o, r, s, t, u, v, w, x and z character, then by the last point of described character and bottom profiled point are carried out least square fitting obtain constituting in the middle of second line and the 3rd line in four lines of lattice.

Further, this method comprises that also described pre-service comprises: for narrow character, its rectangle frame that stretches deducts 1/3 of average character pitch with the left margin of its rectangle frame, and the right margin of its rectangle frame adds 1/3 of average character pitch.

Further, this method also comprises, this device also comprises, described narrow character is the ratio of width to height less than 1/3 the character of average the ratio of width to height.

Further, this method comprises that also character pitch described in the step (3) equals the left margin in current character zone and the distance between the last character zone right margin.

Further, this method comprises that also described template is gained rule of thumb, and preferred template can be got (0.25,0.5,0.25).

Further, this method comprises that also if difference is greater than described threshold value described in the step (5), the position of described difference correspondence promptly is judged as a space.

Further, this method comprises that also described aftertreatment comprises the steps:

Step 61 by all positions, space in the post-processing unit calculated difference array the mean value of corresponding difference, if the difference of certain position, space, thinks then that this position is not a space less than 2/3 of mean value.

Step 62 by post-processing unit calculate all positions, space in the spacing array the mean value of corresponding spacing, if the spacing of certain position, space, thinks then that this position is not a space, goes back to step 61 less than 2/3 of mean value; If the spacing of all positions, space is then returned the cutting result all more than or equal to 2/3 of mean value.

The device of text image identification, this device comprises: character recognition unit, be used for discerning the character of text image, and with the character output of identification; Also comprise:

The parameter analytic unit is to carrying out the parameter analysis by the character information in the line character of described character recognition unit input;

Data transmission unit transfers to pretreatment unit with the described character information of having analyzed;

Pretreatment unit carries out pre-service according to the described character information of having analyzed;

Arithmetic element according to pretreated character information calculating character spacing, forms the character pitch array, and described character pitch array is carried out smothing filtering based on mask convolution, obtains level and smooth array; Calculate the difference of described character pitch array and described level and smooth array correspondence position;

Comparison judgment unit is compared the judgement of carrying out the space according to described difference with pre-set threshold;

Post-processing unit carries out aftertreatment to the space that judgement obtains.

The method of segmenting word of the present invention, compared with prior art its advantage is:

1 pre-service of the present invention, the inclination font is got the shared zone of the character of lattice in the middle of its four lines, three lattice as new character zone, prevented that like this inclination character top or bottom to the blocking of character pitch, effectively having solved the problem that the inclination font is brought.

2 the present invention utilize the method for character pitch local peaking to determine the space, have well solved the problem that multiple font is brought in overstocked and the delegation of setting type of character in the delegation.

Description of drawings

Fig. 1 is for existing the text image of inclination font in the delegation;

Fig. 2 is for there being the text image of font size different literals in the delegation;

Fig. 3 is the bianry image of the line of text of the embodiment of the invention;

Fig. 4 is the process flow diagram of the method for segmenting word in a kind of image of the present invention;

Fig. 5 carries out pre-service in the embodiment of the invention to character information;

Fig. 6 carries out aftertreatment in the embodiment of the invention to the cutting result;

Fig. 7 a is four lines, three trrellis diagrams in embodiment of the invention Chinese version zone;

Fig. 7 b is the situation that embodiment of the invention medium dip character top or bottom cover are lived the space.

Embodiment

In order more to be expressly understood technology contents of the present invention, describe in detail especially exemplified by following examples.

The method of the segmenting word in the Chinese version image of the present invention is in order to distinguish with natural language understanding field word segmentation or participle.It is the languages that unit is write that this method is applicable to the speech, as English, and German etc.

Embodiments of the invention are for carrying out segmenting word to image as shown in Figure 3, the workflow of this embodiment as shown in Figure 4, its processing procedure comprises the steps:

Step 1, the operation parameter analytic unit is to carrying out the parameter analysis by the character information in the line character of character recognition system identification back input, and described parameter comprises the mean value of character pitch, the mean value of character the ratio of width to height and the mean value of character duration etc.Whether character information comprises the rectangular area at character place, be information such as inclination font.For example shown in Figure 3 is text filed, and the analysis by the parameter analytic unit mean value of its character pitch as can be known is 3 pixels, and the mean value of character the ratio of width to height is 0.73, and the mean value of character duration is 16 pixels.

Step 2, the character information of analyzing gained transfers to pretreatment unit, carries out pre-service by pretreatment unit according to the character information of described analysis gained.The pre-service of pretreatment unit comprises the border, the left and right sides of adjusting the shared rectangular area of character according to the character information of having analyzed.

Step 21 tightens its character zone for the character of inclination font.Shown in Fig. 7 a, get the shared zone of centre one lattice of character among four lines, three lattice as new character zone, do like this and can avoid character top or bottom cover to live the space, as ' f ' among Fig. 7 b and ' t '.The method of wherein asking for the position of the middle lattice of four lines, three lattice can be as follows, at first obtain the last bottom profiled of lowercase a, c, e, m, n, o, r, s, t, u, v, w, x and z, then by the last point of above these characters and bottom profiled point are carried out least square fitting obtain constituting in the middle of second line and the 3rd line in four lines of lattice.

Step 22 enlarges its character zone for narrow character.For narrow character, the distance at interval is bigger before and after the character, is mistaken as the space between the character easily, and the result impacts to segmenting word.Character zone to narrow character suitably enlarges, and its left margin is expanded 1/3 of average character pitch left, and its right margin is expanded 1/3 of average character pitch to the right.Described narrow character refers to 1/3 the character of the ratio of width to height less than average the ratio of width to height, as " l " and " i " etc.By this step, can solve the big problem of the general character of gap ratio that causes owing to the width of these characters is less between other characters and these the narrow characters, thereby can reduce the influence that narrow character is judged the space, back.

By step 21 and step 22, can effectively solve character top in the inclination font or bottom to the covering in space and some characters narrower cause and the front and back character between the bigger situation in interval.Shown in Fig. 7 b, the shared rectangular area of letter " f " is the rectangle frame that can comprise the minimum of this letter " f " fully.The shared rectangular area of following character is also referred to as rectangle frame.Shown in process flow diagram among Fig. 5, this described pre-service can comprise following two steps.

Step 3, arithmetic element is according to pretreated character information calculating character spacing array.In the hardware device of operation computation process, mark off one section memory headroom, and set four storage unit: first storage unit, second storage unit, the 3rd storage unit, the 4th storage unit.The left margin that character pitch equals the rectangular area of current character deducts the right margin of the shared rectangular area of previous character, and deposits described character pitch array in first storage unit.At present embodiment, the character pitch array is first line data in the table 1.

Step 4, arithmetic element are carried out mask convolution to described character pitch array and are smoothly obtained level and smooth array, template gained rule of thumb wherein, and preferred template is got (0.25,0.5,0.25), and deposits described level and smooth array in second storage unit.At present embodiment, level and smooth array is second line data in the table 1.

Step 5 reads and calculates the difference of spacing array and level and smooth array correspondence position by arithmetic element, and deposits the difference array in the 3rd storage unit.The difference of spacing array and level and smooth array can reflect the intensity of variation of spacing array, and the position at local maximal value place just might be a space in the spacing array.

Relatively judge the space by comparison judgment unit according to described difference and pre-set threshold, if described difference greater than described threshold value, the position of described difference correspondence promptly is judged as a space, and the space is judged that array deposits the 4th storage unit in.

At present embodiment, the difference array is the third line data in the table 1.The space judges that array is first line data in the table 2, and wherein 1 represents the space, 0 expression unblank.At present embodiment, threshold value be the character mean breadth divided by 15, be 1.Numerical value in the difference array is a local peaking greater than the position of threshold value 1.

The pixels statistics table of each cell stores in the table 1 segmenting word process

First storage unit	02342934285473023847324212305231001238443113439382
		Second storage unit	123326332543521235353232721323611235332112336352
The 3rd storage unit	10010301031121-100312001051-12004-10003111001103030

Table 2 segmenting word process hollow lattice label table

The 4th storage unit	000001000100100001010000100100100001000000001010
		Aftertreatment 61	000001000100100001010000100100100001000000001010
Aftertreatment 62	000001000100100001010000100000100001000000001010
		Word segmentation result	000001000100100001010000100000100001000000001010

Judgement in the step 5 is to make according to local peaking, and it all is peak value in the full line character pitch that local peaking does not represent the spacing value of its position.If certain is own at interval and little, but the spacing value on both sides is all very little at interval, will make this position become a local peaking, but such local peaking does not represent a real position, space, this just need carry out subsequent treatment to the judgement of making.Local peaking is that all positions that equal 1 in the delegation are judged in the space in the table 1 in the present embodiment, because the difference of these position correspondences is all greater than threshold value 1, but some of them local peaking may not represent a real space, as the 27th spacing 5 in the spacing array, this position is not a real position, space, but its previous spacing is 0, the spacing in back is 2, can obtain this position is a local peaking, so need judge the space that obtains to be further processed, remove suspicious space.

Step 6: by post-processing unit aftertreatment is carried out in the space that judgement obtains, removed some suspicious spaces by the information of full line.Shown in process flow diagram among Fig. 6, aftertreatment comprises the steps:

Step 61: by all positions, space in the post-processing unit calculated difference array the mean value of corresponding difference, if the difference of certain position, space, thinks then that this position is not a space less than 2/3 of mean value.This is because whole information is comparatively reliable, and the intensity of variation of the local peaking in some suspicious space 2/3 little than the intensity of variation of whole local peaking.At present embodiment, the mean value of all positions, space is 3 in the difference array, and the difference of all positions, space does not detect suspicious space in this step all more than or equal to 2/3 of mean value, and result sees Table the aftertreatment 61 in 1.

Step 62: by post-processing unit calculate all positions, space in the spacing array the mean value of corresponding spacing, if the spacing of certain position, space, thinks then that this position is not a space, goes back to step 61 less than 2/3 of mean value; If the spacing of all positions, space is then returned the cutting result all more than or equal to 2/3 of mean value.The same with step 61, step 62 also is to judge with reliable Global Information whether the space is suspicious.At present embodiment, the mean value of all positions, space is 9 in the difference array, and the spacing value of the 7th space correspondence position is 5 less than 2/3 of the mean value of all positions, space, thinks that then this position is not a space, and result sees Table the aftertreatment 62 in 1.

After changeing step 61, continue step 62 again, all do not find then to export final word segmentation result in suspicious space.Net result is in the present embodiment: scale does not start at zero. (Chart firstpublished by R.

The invention still further relates to a kind of device of text image identification, this device comprises: character recognition unit, parameter analytic unit, data transmission unit, pretreatment unit, arithmetic element, comparison judgment unit and post-processing unit.

Character recognition unit is used for discerning the character of text image, and with delegation's line character output of identification.

The parameter analytic unit, to carrying out the parameter analysis by the character information in the line character of described character recognition unit input, one line character of the described input of analyzing in the described parameter analytic unit is that described parameter comprises the mean value of character pitch, the mean value of character the ratio of width to height and the mean value of character duration etc. by character recognition system identification back input.

Data transmission unit transfers to pretreatment unit with the described character information of having analyzed.

Described pretreatment unit, carry out pre-service according to the described character information of having analyzed, described pre-service comprises the border, the left and right sides of adjusting the shared rectangular area of character according to character information, and the shared rectangular area of described character is the rectangle frame that can comprise the minimum of this character fully.

Arithmetic element according to pretreated character information calculating character spacing array, is carried out mask convolution to described character pitch array and is smoothly obtained level and smooth array, calculates the difference of described spacing array and described level and smooth array correspondence position.

Comparison judgment unit is relatively judged the space according to described difference and pre-set threshold.

Wherein, described pre-service can comprise: for the character of inclination font, tighten its rectangle frame, get the shared zone of centre one lattice of character among four lines, three lattice as new character zone, live the space to avoid character top or bottom cover.The method of asking for the position of the middle lattice of four lines, three lattice can be as follows, at first obtain the last bottom profiled of lowercase a, c, e, m, n, o, r, s, t, u, v, w, x and z, then by the last point of above these characters and bottom profiled point are carried out least square fitting obtain constituting in the middle of second line and the 3rd line in four lines of lattice.Described pre-service can also comprise: for narrow character, its rectangle frame stretches, the left margin of its rectangle frame is deducted 1/3 of average character pitch, and the right margin of its rectangle frame adds 1/3 of average character pitch, thereby can reduce the influence that narrow character is judged the space, back.Described narrow character refers to 1/3 the character of the ratio of width to height less than average the ratio of width to height.

Wherein, the left margin that described character pitch equals the rectangular area of current character deducts the right margin of the shared rectangular area of previous character, and the described character pitch array that obtains deposits first storage unit in.At present embodiment, the character pitch array is first line data in the table 1.

Wherein, described template is gained rule of thumb, and preferred template is got (0.25,0.5,0.25), and the described level and smooth array that obtains deposits second storage unit in.At present embodiment, level and smooth array is second line data in the table 1.

If described comparison judgment unit is judged described difference greater than described threshold value, the position of described difference correspondence promptly is judged as a space, and the space that obtains judges that array deposits the 4th storage unit in.

The aftertreatment of described post-processing unit can comprise the steps:

1, in the post-processing unit calculated difference array all positions, space the mean value of corresponding difference, if the difference of certain position, space, thinks then that this position is not a space less than 2/3 of mean value;

2, post-processing unit calculate all positions, space in the spacing array the mean value of corresponding spacing, if the spacing of certain position, space, thinks then that this position is not a space less than 2/3 of mean value, it is rapid to go back to previous step; If the spacing of all positions, space is then returned the cutting result all more than or equal to 2/3 of mean value.

Although illustrated and described embodiments of the invention, but it will be appreciated by those skilled in the art that, on the basis of not departing from spirit of the present invention and principle, can change this embodiment, scope of the present invention is limited by claims and their equivalents.

Claims

1. the method for segmenting word in the text image is characterized in that, comprising:

Step (1) is carried out the parameter analysis by the parameter analytic unit to the character information in the line character of input;

Step (2), the character information of analyzing gained transfers to pretreatment unit, carries out pre-service by pretreatment unit according to the character information of described analysis gained;

Step (3), arithmetic element form the character pitch array according to pretreated character information calculating character spacing;

Step (4), arithmetic element is carried out smothing filtering based on mask convolution to described character pitch array, obtains level and smooth array;

Step (5), the difference of arithmetic element calculating character spacing array and level and smooth array correspondence position is compared the judgement of carrying out the space by comparison judgment unit according to described difference with pre-set threshold;

2. the method for segmenting word according to claim 1 is characterized in that, a line character of input described in the step (1) is by character recognition system identification back input.

3. the method for segmenting word according to claim 1 is characterized in that, described parameter comprises the mean value of character pitch, the mean value of character the ratio of width to height and the mean value of character duration.

4. the method for segmenting word according to claim 1 is characterized in that, described pre-service comprises the border, the left and right sides of adjusting character zone according to character information.

5. the method for segmenting word according to claim 4 is characterized in that, described character zone is for comprising the rectangle frame of the minimum of this character fully.

6. the method for segmenting word according to claim 5 is characterized in that, described pre-service comprises: for the character of inclination font, tighten its rectangle frame, get the shared zone of the middle layout of character among four lines, three lattice as new character zone.

7. the method for segmenting word according to claim 6, it is characterized in that, the method of getting the shared zone of the middle layout of character among four lines, three lattice comprises: at first obtain the last bottom profiled of lowercase a, c, e, m, n, o, r, s, t, u, v, w, x and z character, then by the last point of described character and bottom profiled point are carried out least square fitting obtain constituting in the middle of second line and the 3rd line in four lines of lattice.

8. according to the method for claim 5 or 6 described segmenting words, it is characterized in that described pre-service comprises: for narrow character, its rectangle frame stretches, the left margin of its rectangle frame is deducted 1/3 of average character pitch, and the right margin of its rectangle frame adds 1/3 of average character pitch.

9. the method for segmenting word according to claim 8 is characterized in that, described narrow character is the ratio of width to height less than 1/3 the character of average the ratio of width to height.

10. the method for segmenting word according to claim 1 is characterized in that, character pitch described in the step (3) equals the left margin in current character zone and the distance between the last character zone right margin.

11. the method for segmenting word according to claim 1 is characterized in that, described template is gained rule of thumb, and preferred template can be got (0.25,0.5,0.25).

12. the method for segmenting word according to claim 1 is characterized in that, if difference is greater than described threshold value described in the step (5), the position of described difference correspondence promptly is judged as a space.

13. the method for segmenting word according to claim 1 is characterized in that,

Described aftertreatment comprises the steps:

Step 61 by all positions, space in the post-processing unit calculated difference array the mean value of corresponding difference, if the difference of certain position, space, thinks then that this position is not a space less than 2/3 of mean value;

14. the device of a text image identification comprises:

Character recognition unit is used for discerning the character of text image, and with the character output of identification; It is characterized in that this device also comprises: