Background technology
Along with OCR (Optical Character Recognition, optical character identification) raising of monocase recognition correct rate in, character cutting has become the key issue in the OCR field, and the major part progress that obtains in literal identification field at present also can be given the credit to the raising into literal cutting level.But the practicability of character recognition technology has been subject to the segmentation technique of alphabetic character at present, and the accuracy of literal cutting is directly related with the literal recognition correct rate, and the mistake of literal cutting can directly cause the mistake of literal identification.
The purpose of character cutting is syncopated as a series of subimages exactly from multiword symbol image, wherein each subimage all comprises the character of an independent completion.Character cutting method at present commonly used has: standard syncopation, based on the combination of syncopation, whole syncopation and the first three methods of identification.
Whole syncopation is mainly used in English character cutting process, this method is that as a whole identification done in a word, though this method has been avoided the problem of the inner cutting of word, it depends on the existing dictionary that defines, and this has limited the range of application of this method greatly.
The standard syncopation is mainly used in Chinese character cutting process, this method is by searching out comparatively rational cut-off between the character to image analysis, adopt static Projection Analysis method, with capable cutting of text image and row cutting, the specific implementation process of this method is as follows:
Obtain the gray level image data of document by digital image-forming equipment such as scanners.For the duplicate after long document of holding time, the document of being made dirty, the intensification duplicating, the gray level image data after the scanning comprise a lot of extra noises, tend to influence the accuracy rate of character cutting, as shown in Figure 1.Can adopt the overall situation or local thresholding method that the gray level image data are carried out the binaryzation operation, for example big Tianjin method, process of iteration and bimodal method etc., Fig. 2 is for adopting the design sketch after big Tianjin method is handled to image shown in Figure 1, as can be seen, through still having a lot of noises above the view data after the binaryzation operation, such as the little connected region shown in the long line segment and 202 shown in 201, at this moment can selectively carry out filter operation to noise.
Can adopt the image segmentation algorithm based on region growing to come filtered noise, this method gathers together the pixel that has similar quality in the same area, forms connected region, and the similar quality of pixel comprises information such as average gray value, texture, color.From the prime area (as small neighbourhood or each pixel even), the pixel that the adjacent pixel with similar quality or other zones is had this character is integrated in the current region, thereby growth region progressively, until do not have can the point or other zonule of merger till, form connected region.All connected region in the traversing graph picture, and calculate stain number in each connected region.
After calculating the stain number in each connected region, an empirical value ThresholdPixel is set, this empirical value can be provided with according to the noise power of text image, also can be provided with according to font name, font size and composing layout in the document.All stain numbers all are regarded as noise less than the connected region of ThresholdPixel and are filtered.Wherein the value of ThresholdPixel can not be too big, otherwise the radical of a lot of Chinese characters can be filtered out, such as the point in " filter " word; The value of ThresholdPixel can not be too little, otherwise can leave over the noise region of some.
For example the layout typesetting format of document is: A4 breadth size; Font is " imitation Song-Dynasty-style typeface "; Font size is little No. three; Document has 22 row, and every row has 28 characters (comprising punctuation mark).ThresholdPixel can be made as 50, promptly the stain number all is regarded as noise less than 50 connected region and is filtered, and each the pixel point value in the respective regions is changed to 0.Fig. 3 is the effect synoptic diagram of Fig. 2 after handling through noise removal, as can be seen, the less connected region major part of wherein similar 202 described stain numbers is filtered, but can not fall as noise filtering owing to the stain number in the similar 201 described connected regions is more.
With capable cutting of image and row cutting after the binaryzation operation, Fig. 4 is as the effect synoptic diagram behind the text filed employing standard cutting method of Fig. 3.As can be seen, because the existence of strong noise, may there be the problem of character adhesion in the text after the employing standard syncopation, and adhesion is meant in multiword symbol image, the situation that the intercharacter stroke is in contact with one another.
Method based on identification is the feedback that standard syncopation and whole syncopation are carried out, this method provides a plurality of cutting hypothesis, then the cutting structure is selected, obtain optimum cutting result, this method can identify character cutting result's correctness, but can not correct the mistake of character cutting, can not effectively solve problems such as character adhesion, disconnected pen, and this method more complicated, consuming time, use seldom in practice.
As can be seen, in existing character cutting technology, there is following shortcoming:
(1), easily cause two and more than two Chinese characters image owing to too small being sticked together of spacing between pretreated influence of image or the character, caused the problem that character cutting is inaccurate, discrimination is low.
In the block letter text image,, usually can cause the existence of adhesion overlap joint character because the printing specimen mass ratio is relatively poor and text image is carried out the noise that brings after the binaryzation operation and error etc.
And the time of document preservation crosses for a long time or the document copying process also can be brought extra noise, be provided with etc. such as having increased concentration in document is made dirty, the reader adds conveniently annotations and comments, the duplicating process, common noise remove algorithm only can be handled the less stain of noise, can't handle the noise of long line, the noise of these long lines can cause intercharacter adhesion, influences the result of character recognition.
(2), easily the Chinese character of being made up of radical is divided into a plurality of zones, radical is used as a Chinese character handles, caused Chinese character to merge the problem inaccurate, that discrimination is low, the reason that produces this result has two:
The one, for the Chinese character of forming by radical; before printing or printing; because the distance pixel count smaller or adhesion between the radical is fewer; the subimage that common meeting is used as an integral body to radical is handled; after the gray level image after the overscanning was through the binaryzation operation, radical was easy to be taken as a Chinese character and handles.
The 2nd, gray document image is carried out the binaryzation operation lose some Useful Informations through regular meeting, cause the disconnected pen of character easily, the Chinese character of being made up of radical is divided into a plurality of zones.After for example the document after printing or the printing duplicated through majority, it is very shallow that the gray-scale value of character picture can become, and the phenomenon of middle fracture often appears in thinner stroke in the character picture.
(3) for character cutting result's correctness, need not be too dependent on the character recognition feedback mechanism.
Summary of the invention
The embodiment of the invention provides a kind of character cutting method and device, in order to improve the correctness of character cutting.
The embodiment of the invention provides a kind of character cutting method, comprising:
To capable cutting of text image and row cutting, obtain several character cell image blocks;
Identification comprises the character cell image block of adhesion character, and continues the described character cell image block that comprises the adhesion character of cutting;
Identification Chinese character cell picture piece zone and English character cell picture piece zone, and the character cell image block that identification is taken by the Chinese character radical in described Chinese character cell picture piece zone;
The character cell image block that the radical of adjacent Chinese character takies is merged into a character cell image block.
Wherein, described adhesion character comprises the adhesion Chinese character, and the method that described identification comprises the character cell image block of adhesion Chinese character comprises:
When the width of character cell image block mean breadth greater than Chinese character cell picture piece, and the difference of the average height of the height of this character cell image block and character cell image block determines that the character cell image block comprises the adhesion Chinese character during less than preset threshold.
Described adhesion character comprises the adhesion English character, and the method that described identification comprises the character cell image block of adhesion English character comprises:
When the width of character cell image block mean breadth greater than Chinese character cell picture piece, and the difference of the average height of the height of this character cell image block and character cell image block determines that the character cell image block comprises the adhesion English character during greater than preset threshold.
The method of the character cell image block that described identification Chinese character radical takies comprises:
When the height of the character cell image block average height greater than the character cell image block, width determines that greater than 4/5 o'clock of the mean breadth of Chinese character cell picture piece the character cell image block comprises Chinese character;
When the distance between described Chinese character cell picture piece and the last character cell picture piece is in outside the distance range between the adjacent Chinese and English character cell picture piece, with last character as current character;
When the distance between current character cell picture piece and the last character cell picture piece center is in outside the distance range between the adjacent Chinese characters character cell image block center, determine that described current character and last character are the radical character.
Further, this character cutting method also comprises the character cell image block of discerning punctuation mark.
Wherein, the method for the character cell image block of described identification punctuation mark comprises:
When the width of character cell image block height smaller or equal to this character cell image block, and this character cell image block fully on the line of text position of center line or below the time, determine that the character cell image block comprises punctuation mark, perhaps
When the height of character cell image block height less than line of text, width is less than 1/4 of Chinese character cell picture piece mean breadth, and in the distance value between last character cell picture piece that this character cell image block is adjacent or the back one character cell image block, have at least a distance value to prescribe a time limit, determine that the character cell image block comprises punctuation mark greater than going up of distance range between the adjacent Chinese and English character cell picture piece.
The embodiment of the invention provides a kind of character cutting device, comprising:
Preliminary cutting unit is used for capable cutting of text image and row cutting are obtained several character cell image blocks;
Adhesion character cutting unit is used to discern the character cell image block that comprises the adhesion character, and continues the described character cell image block that comprises the adhesion character of cutting;
Identification radical unit is used to discern Chinese character cell picture piece zone and English character cell picture piece zone, and the character cell image block that identification is taken by the Chinese character radical in described Chinese character cell picture piece zone;
The character merge cells is used for the character cell image block that the radical with adjacent Chinese character takies and merges into a character cell image block.
Wherein, described adhesion character cutting unit specifically is used for, when the width of character cell image block mean breadth greater than Chinese character cell picture piece, and the difference of the average height of the height of this character cell image block and character cell image block is during less than preset threshold, determine that the character cell image block comprises the adhesion Chinese character, perhaps
When the width of character cell image block mean breadth greater than Chinese character cell picture piece, and the difference of the average height of the height of this character cell image block and character cell image block determines that the character cell image block comprises the adhesion English character during greater than preset threshold.
Described identification radical unit specifically is used for, and when the height of the character cell image block average height greater than the character cell image block, width determines that greater than 4/5 o'clock of the mean breadth of Chinese character cell picture piece the character cell image block comprises Chinese character;
When the distance between described Chinese character cell picture piece and the last character cell picture piece is in outside the distance range between the adjacent Chinese and English character cell picture piece, with last character as current character;
When the distance between current character cell picture piece and the last character cell picture piece center is in outside the distance range between the adjacent Chinese characters character cell image block center, determine that described current character and last character are the radical character.
Further, described character cutting device also comprises identification punctuation mark unit, be used for working as the height of the width of character cell image block smaller or equal to this character cell image block, and this character cell image block fully on the line of text position of center line or below the time, determine that the character cell image block comprises punctuation mark, perhaps
When the height of character cell image block height less than line of text, width is less than 1/4 of Chinese character cell picture piece mean breadth, and in the distance value between last character cell picture piece that this character cell image block is adjacent or the back one character cell image block, have at least a distance value to prescribe a time limit, determine that the character cell image block comprises punctuation mark greater than going up of distance range between the adjacent Chinese and English character cell picture piece.
By technique scheme, the embodiment of the invention obtains several character cell image blocks by to capable cutting of text image and row cutting; Identification comprises the character cell image block of adhesion character, and continues the described character cell image block that comprises the adhesion character of cutting; Identification Chinese character cell picture piece zone and English character cell picture piece zone, and the character cell image block that identification is taken by the Chinese character radical in described Chinese character cell picture piece zone; The character cell image block that the radical of adjacent Chinese character takies is merged into a character cell image block.This method can identify character cell image block that comprises the adhesion character and the character cell image block that comprises radical, makes the character cutting result need not be too dependent on the character recognition feedback mechanism, has further improved the discrimination of character.
Embodiment
The embodiment of the invention provides a kind of character cutting method and device thereof, the lower problem of character identification rate that the character cutting method that provides at prior art causes the character cutting mistake to be brought, proposed following technical scheme, now this technology be elaborated in conjunction with Figure of description and specific embodiment:
First embodiment of the invention provides a kind of character cutting method, and as shown in Figure 5, specific implementation process is as follows:
S100, to capable cutting of text image and row cutting, obtain several character cell image blocks.In conjunction with Fig. 6 this detailed process is elaborated:
S101, to the capable cutting of the text image after the binaryzation.
Obtain two-value text dot matrix image to be slit, text filed pixel wide is nWidth, highly is nHeight.Be provided with function f (i, j), the pixel value of the capable j of presentation video i row, when pixel f (i, when j) being the foreground point, value is 1; (i, when j) being background dot, value is 0 as pixel f.
In order to be syncopated as the row zone of text, and remove the noise that the shape of embarking on journey distributes, scan text image from the top down, and calculate the pixel value sum S of foreground point on every horizontal scanning line
n, S wherein
n=S
1+ S
2+ ...+S
i+ ... (i=0,1,2...nWidth).Threshold value N is set
1If, S
n〉=N
1, then this sweep trace is for forming the sweep trace of text; If S
n<N
1, then this sweep trace is noise or blank, removes the noise that the shape of embarking on journey distributes, and tentatively is syncopated as the row zone of text.Text filed effect synoptic diagram after the space cutting as shown in Figure 7 as shown in Figure 8.Write down boundary position of each row simultaneously: the position of the center line MiddleLine between the coordinate position of upper left point and lower-right most point and two horizontal lines, and calculate the height of each conjuncted line of text.
For N
1Setting should be noted that following some:
(1) if the noise ratio of text image is less, N
1Can be provided with smallerly, not influence capable cutting substantially.For example, can be with N
1Be set to 10.
(2) if the noise ratio of text image is bigger, shown in Figure 9, N
1Can be provided with greatlyyer.If N
1Be provided with smallerly, noise more by force just can not be eliminated, and the line of text zone that is syncopated as will be inaccurate, as shown in figure 10, so must be with N
1Be provided with to such an extent that more just can address this problem, can be with N
1Be set to 60, the effect after the cutting as shown in figure 11.
(3), N
1When being provided with greatlyyer, can influence the less line of text of character quantity.If the character quantity of line of text is less, the foreground point quantity in this article one's own profession on some horizontal scanning line is just fewer, the S that calculates
nValue is just less, if N
1Be provided with greatlyyer, can cause S
n<N
1, easily some foreground point with this article one's own profession is considered as noise or blank, and as shown in figure 12, last column has only a Chinese character " war ", and text is 2 capable or multirows more by cutting mistakenly.Can address this problem by two kinds of approach: the one, need artificial the participation, such as pollution condition according to text image, manual setting threshold N
1Size; The 2nd, to set a bigger threshold value and carry out cutting, the line of text spacing after the initial analysis cutting, the height of line of text are searched abnormal data, attempt merging the line of text border according to abnormal data.If do not cause new abnormal data after merging, then merge the line of text border of this abnormal data correspondence, otherwise abandon.The most noise of so promptly can forgoing can also be eliminated the abnormal data in data height sequence, the line of text pitch sequence simultaneously effectively.
On S102, the basis, carry out the operation of row cutting to the capable cutting of text image after the binaryzation.
In order to be syncopated as the column region of text, and remove into the noise that the row shape distributes, scan text image from left to right, and calculate the pixel value sum R of every foreground point on the longitudinal scanning line
n, R wherein
n=R
1+ R
2+ ...+R
j+ ..., wherein the scope of j is the coboundary and the lower boundary in this article one's own profession zone.Threshold value N is set
2If, R
j〉=N
2, then this sweep trace is for forming the sweep trace of character, if R
j<N
2, then this sweep trace is noise or blank, removes to be the noise that the row shape distributes.Owing to the text image after the binaryzation has been carried out noise removal process, so general little noise can not influence text column zone cutting, so N can be set
2Be 0.
Each character all can have a boundary rectangle frame like this, and the up-and-down boundary of character is the up-and-down boundary of this row, and border, the left and right sides is the row cut-off of this character.
S103, acquisition comprise the minimum boundary rectangle frame of each all black picture element of character.
Because the boundary rectangle frame height of each character is inconsistent, especially the difference of Chinese and English character height, the boundary rectangle frame of each character is carried out toe-in or outwards expansion, make that rectangle frame is the minimum boundary rectangle that comprises all black picture elements of character, thereby obtain a sequence of sets Ω who comprises several character cell image blocks, as shown in figure 13.
The characteristic of all rectangular image pieces in S200, the statistical study text image.
According to the sequence of sets Ω that comprises character rectangular image piece that obtains among the S103, the characteristic of all rectangular image pieces in the statistical study text image comprises following characteristic:
(1), the average row height of the height of line of text, position of center line and line of text
As shown in figure 14, the height H L of line of text is meant the distance between two horizontal lines that comprise literal; The center line MiddleLine of line of text is meant two residing positions of the center line between the horizontal line; Add up the height H L of all line of text, calculate the high HLAVE of text filed average row.All that are syncopated as among the traversal S100 are text filed, calculate corresponding row high HL, position of center line MiddleLine and the text filed high HLAVE of average row.
(2), the average height of character cell image block
As shown in figure 15, the height H of character cell image block is meant the height of the minimum boundary rectangle frame of each character cell, adds up the height of the minimum boundary rectangle frame of all character cells, calculates the average height HeightAve of character cell image block.
(3), the mean breadth of character cell image block
As shown in figure 16, the width W idth of character cell image block (writing a Chinese character in simplified form W) is meant the width of the minimum boundary rectangle frame of each character cell.But this character is a legal character not necessarily, the character of forming such as the radical of Chinese character or the character of the inter-adhesive composition of intercharacter etc., as " newspaper " among Figure 16 and " arriving ", " newspaper " is because intercharacter is inter-adhesive, 2 Chinese characters have been formed a character, " to " the radical character of 2 Chinese characters be divided into.
Add up the width distribution feature of all character cell image blocks, establish the width that the x axle is the character cell image block, the y axle is the number of the pairing character cell image block of this width value, is similar to the statistics with histogram of gray level image.Because Chinese character is Chinese characters basically, the width of Chinese character cell picture piece can be highly big more not a lot of than it, so the upper limit of x axle can be 1.5 times of the text filed high HLAVE of average row.
In the common document, can not stick together between Chinese character and the English/numeral.After obtaining as above character cell image block width distribution figure, by distribution characteristics as can be seen, have the zone that two adjacent width value numbers are assembled, wherein the accumulation area that width is big is the width value scope of normal Chinese characters character cell image block, and the less accumulation area of width is the width value scope of normal English or numerical character cell picture piece.In addition in this width distribution figure, have some width values bigger zone or littler zone, wherein, the zone that width value is bigger is to be caused by the character cell image block that comprises the adhesion character, for example " newspaper ", this character cell image block comprises two Chinese characters, so the width value of each character cell image block is all bigger; Width value littler zone may be made up of the character cell image block that comprises the Chinese character radical, and for example " river " word is divided into three characters, so the width value of each character cell image block is all less.
In the width value distributed area of Chinese character, getting local crest value is the mean breadth ChnWidth of Chinese character cell picture piece; Equally, in the width value distributed area of English/numerical character, getting local crest value is the mean breadth EnWidth of English/numerical character cell picture piece.
(4), the distance between the adjacent character cell picture piece center
As shown in figure 17, the distance W ave between the adjacent character cell picture piece center is meant the distance between the minimum boundary rectangle frame of the adjacent character center.
If the x axle is the distance value between the adjacent character cell picture piece center, the y axle is the number of the pairing character cell image block of this distance value, after obtaining the distance value distribution plan between the character cell image block center, by distribution characteristics as can be seen, have the zone of two adjacent distance value distribution or accumulation, wherein, what the zone that distance value is bigger was corresponding is the zone that Chinese character cell picture piece is assembled, and what distance value was smaller is the zone that English/numerical character cell picture piece is assembled.
In the zone of Chinese character and English/numerical character cell picture piece gathering, find out local crest value WaveCN and WaveEN respectively.According to WaveCN and WaveEN, the distance range that can delimit out between the adjacent Chinese characters character cell image block center is [(2*WaveCN+WaveEN)/3, (4*WaveCN-WaveEN)/3], distance range between adjacent English/numerical character cell picture piece center is [(4*WaveEN-WaveCN)/3, (WaveCN+2*WaveEN)/3].
(5), the distance between the adjacent character cell picture piece
As shown in figure 18, the distance D is between the character cell image block is meant: adjacent two character cell image blocks in the one text row, the right margin of previous character cell image block is to the distance between the left margin of a back character cell image block.Add up the range distribution between all character cell image blocks, after obtaining this distribution plan, can see the zone that a tangible number is assembled, both may comprise the distance between the adjacent Chinese character cell picture piece in this zone, also may comprise the distance between adjacent English/numerical character cell picture piece, because the distance between the distance between the adjacent Chinese character cell picture piece and the adjacent English/numerical character cell picture piece is all very little, there is not absolute separatrix.But as can be seen by distribution characteristics, have the another one accumulation area, be adjacent Chinese character and the distance between English/numerical character cell picture piece, get its local crest value DisChnAndEn, what do not fix the number of this accumulation area, degree according to the mixing of document areas Chinese and English, the distance range that can delimit out between the adjacent Chinese and English character cell picture piece is [DisChnAndEn-Threshold, DisChnAndEn+Threshold], wherein Threshold is a given threshold value, can be provided with according to actual conditions.
S300, identification comprise the character cell image block of adhesion character, and continue the character cell image block that cutting comprises the adhesion character.
If it is the adhesion character that the width of character cell image block, is then determined this character greater than the mean breadth of Chinese character cell picture piece.Compare according to the height of the adhesion character cell image block of determining and the average height HeightAve of character cell image block, adhesion character cell image block can be divided into adhesion Chinese character image block and adhesion English character image block, respectively adhesion Chinese character image block and adhesion English character image block are discerned below, and the character cell image block that comprises the adhesion character is carried out cutting.
If it is the adhesion Chinese character that the difference of the height of adhesion character cell image block and the average height of character cell image block, is then determined this adhesion character less than preset threshold.In general, the number of the foreground point of the longitudinal scanning line of adhesion place between the adhesion Chinese character is minimum, be in the trough location that projection distributes, so, can carry out cutting to this adhesion Chinese character according to the foreground point number of adhesion character longitudinal scanning line correspondence, this detailed process is elaborated below in conjunction with Figure 19:
If the difference of the height of S301 adhesion character cell image block and the average height of character cell image block is less than preset threshold, then determining this adhesion character is the adhesion Chinese character.
The border, upper and lower, left and right of S302, note adhesion character cell image block is for being respectively T, B, L, R, with L to R is transverse axis, T to B is the longitudinal axis, calculate the number of black pixel point on this adhesion character cell image block longitudinal scanning line, horizontal ordinate is sorted from small to large according to what of its corresponding foreground point number, obtain an array sequence about the position.
The array sequence of S303, a sky of establishment
1, the horizontal ordinate of left margin L and right margin R is joined
1In, select first element among the , be inserted into according to the size order of position
1In.
S304, calculating
1In the distance between the position adjacent in twos, if distance is then carried out S306 all less than the mean breadth of character cell image block; Otherwise carry out S305.
Next element among S305, the selection is inserted into according to the position size order
1In, the process of repetition S304 is until
1In the distance between the position adjacent all till the mean breadth ChnWidth less than the character cell image block in twos.
S306, with
1In the position be cut-point, adhesion character cell image block is cut apart, thereby obtain the overlapping sub-character cell image block of a plurality of head and the tail, the boundary rectangle frame of each character is carried out toe-in or outwards expansion, make that rectangle frame is the minimum boundary rectangle that comprises all black picture elements of character.
S307, adhesion character cell image block is deleted from original sequence Ω, and all character cell image blocks that obtain among the S306 are inserted on the position identical among the Ω, thereby obtain a new character cell image block sequence Ω
1
Figure 20 is the effect synoptic diagram of Figure 13 subregion after amplifying, and Figure 21 carries out effect synoptic diagram after the cutting for Figure 20 according to S300.
If the difference of the height of adhesion character cell image block and the average height of character cell image block is greater than preset threshold, determining this adhesion character cell image block is English/numerical character cell picture piece, for the adhesion between the English character, need to consider two kinds of situations:
First kind of situation, adhesion between the adjacent character image, but can't cut apart and the character adhesion that causes with white vertical line, can use border following algorithm this moment, and the connected region of finding out separately can be carried out cutting to the adhesion character.
There is adhesion in second kind of situation between the adjacent character image, can utilize the profile of character to search for all possible cut-off, generates a series of cutting route, picks out optimal cutting route according to English cutting evaluation the adhesion character is carried out cutting.
The character cell image block of S400, identification punctuation mark.
For determining that the character in the character cell image block is a punctuation mark, need to consider two kinds of situations, as long as any situation below satisfying, this character just is defined as punctuation mark:
First kind of situation, if the height of character cell image block less than the text every trade high 1/2, width is smaller or equal to the height of this character cell image block, and this character cell image block fully on MiddleLine or below, the character of then determining in this character cell image block is a punctuation mark, for example ", ", ".", ", " etc. punctuation mark;
Second kind of situation, if the height of character cell image block is less than text every trade height, width is less than ChnWidth/4, and in the distance value between the character cell image block of this character cell image block and front and back, has a value at least greater than 1.2* (DisChnAndEn+Threshold), promptly this distance has surpassed the upper limit of the distance range between the Chinese and English character cell picture piece, and the character of then determining in this character cell image block is a punctuation mark, for example "; ", "! ", punctuation mark such as ": ".
S500, identification Chinese character cell picture piece zone and English character cell picture piece zone, and the character cell image block that identification is taken by the Chinese character radical in Chinese character cell picture piece zone; The character cell image block that the radical of adjacent Chinese character takies is merged into a character cell image block.
This step can be handled respectively at each line of text zone, at first finds all punctuation marks in each line of text, handles the character cell image block between per two punctuation marks then successively, in conjunction with Figure 22 this detailed process is elaborated:
The index value of bebinning character cell picture piece in full line is IndexBegin and IndexEnd between S501, two punctuation marks of record.
S502, traversal index value are in character cell image blocks all between IndexBegin and the IndexEnd, according to the height of Chinese character cell picture piece greater than HeightAve, width is greater than ChnWidth*0.8, according to order from front to back, find out first Chinese character cell picture piece O, and write down the index value Index of this Chinese character cell picture piece in full line.
S503, be benchmark with Chinese character cell picture piece O, the character cell of search forward, and note successively image block C is current Chinese character cell picture piece, searches the character cell image block of index value IndexBegin, and concrete processing procedure is:
If the index value of current Chinese character cell picture piece C is IndexBegin, then carry out S507;
Otherwise, take out a character cell image block C of current Chinese character cell picture piece C front
1, calculate C and C
1Between distance D is, if Dis drops in the interval [DisChnAndEn-Threshold, DisChnAndEn+Threshold], then show C
1Be English character cell picture piece, directly this character cell image block added cutting sequence as a result, and with C
1Be considered as new current English character cell picture piece C, carry out S506; Otherwise carry out S504.
S504, investigation Chinese character cell picture piece C
1The character C of front
2Whether be its radical, specifically comprise:
Calculate C
1With C
2Distance D is1 between the center if Dis1 drops in the interval range [(2*WaveCN+WaveEN)/3, (4*WaveCN-WaveEN)/3], shows C
2Not C
1Radical, but a Chinese character cell picture piece is independently carried out S505; Otherwise continuation following processes:
If Dis1 does not drop in the interval range [(2*WaveCN+WaveEN)/3, (4*WaveCN-WaveEN)/3], with C
1And C
2Merge into a new character cell image block O
1
Investigate C
2The character cell image block C of front
3, calculate O
1And C
3Distance D is2 between the center if Dis2 drops in the interval range [(2*WaveCN+WaveEN)/3, (4*WaveCN-WaveEN)/3], then shows C
3Be an independently Chinese character cell picture piece, then with O
1Join cutting as a result in the sequence, O
1Be regarded as new current Chinese character cell picture piece C, the associated description of its specific implementation process and S503 is identical, no longer is described in detail herein;
If Dis2 does not drop in the interval range [(2*WaveCN+WaveEN)/3, (4*WaveCN-WaveEN)/3], then C
3Chinese character cell picture piece independently certainly not, it might be O
1The radical of middle character also might be C
3The character cell image block C of front
4The radical of middle character;
Calculate C
3With C
4The width W idth1 that merges back character cell image block, and C
3With O
1The width W idth2 that merges back character cell image block;
If the width of Width1 is less than the width of Width2, then C
3Once more by O
1Merge, then with O
1Join cutting as a result in the sequence, O
1Be regarded as new current Chinese character zone C, the associated description of its specific implementation process and S503 is identical, no longer is described in detail herein;
If the width of Width1 is greater than the width of Width2, directly with O
1Join cutting as a result in the sequence, O
1Be regarded as new current Chinese character zone C, the associated description of its specific implementation process and S503 is identical, no longer is described in detail herein.
S505, with C
1Directly join cutting as a result in the sequence, and with C
1Be considered as new current Chinese character cell picture piece C, the associated description of its specific implementation process and S503 is identical, no longer is described in detail herein.
If the index value of the current English character cell picture of S506 piece is IndexBegin, then directly turn to S507, otherwise, take out a character cell image block C of current English character cell picture piece C front
1, calculate C and C
1Distance D is between the center if Dis drops in the interval [(4*WaveEN-WaveCN)/3, (WaveCN+2*WaveEN)/3], then shows C
1Be English character cell picture piece, directly this character cell image block added cutting sequence as a result, and with C
1Be considered as new current English character cell picture piece, and repeat this process, otherwise turn to S504;
S507, be benchmark with Chinese character cell picture piece O, search backward successively, and note character cell image block C is current Chinese character cell picture piece, search the character cell image block of index value IndexEnd, associated description among specific implementation process and the S503 is identical, no longer is described in detail herein.
According to above-mentioned method step, handle line of text zones all in the whole text image successively, obtain final character zone cutting result, wherein, Figure 23 is for carrying out effect synoptic diagram after the cutting to Figure 22 according to S500.
Figure 24 carries out final character zone cutting result schematic diagram after the cutting for the method that adopts the embodiment of the invention and provide with Figure 13, as can be seen, the character cutting method that the embodiment of the invention provides guarantees character cutting result's correctness, has solved the problem that intercharacter adhesion and radical are taken as an independent character.
Second embodiment of the invention provides a kind of character cutting device, and the structure of this character cutting device comprises referring to Figure 25, preliminary cutting unit 2501, adhesion character cutting unit 2502, identification radical unit 2503 and character merge cells 2504.
Wherein, preliminary cutting unit 2501 is used for capable cutting of text image and row cutting are obtained several character cell image blocks;
Adhesion character cutting unit 2502 is used to discern the character cell image block that comprises the adhesion character, and continues the character cell image block that cutting comprises the adhesion character;
Identification radical unit 2503 is used to discern Chinese character cell picture piece zone and English character cell picture piece zone, and the character cell image block that identification is taken by the Chinese character radical in Chinese character cell picture piece zone;
Character merge cells 2504 is used for the character cell image block that the radical with adjacent Chinese character takies and merges into a character cell image block.
Wherein, adhesion character cutting unit 2502 specifically is used for, when the width of character cell image block mean breadth greater than Chinese character cell picture piece, and the difference of the average height of the height of this character cell image block and character cell image block is during less than preset threshold, determine that the character cell image block comprises the adhesion Chinese character, perhaps
When the width of character cell image block mean breadth greater than Chinese character cell picture piece, and the difference of the average height of the height of this character cell image block and character cell image block determines that the character cell image block comprises the adhesion English character during greater than preset threshold.
Identification radical unit 2503 specifically is used for, and when the height of the character cell image block average height greater than the character cell image block, width determines that greater than 4/5 o'clock of the mean breadth of Chinese character cell picture piece the character cell image block comprises Chinese character;
When the distance between Chinese character cell picture piece and the last character cell picture piece is in outside the distance range between the adjacent Chinese and English character cell picture piece, with last character as current character;
When the distance between current character cell picture piece and the last character cell picture piece center is in outside the distance range between the adjacent Chinese characters character cell image block center, determine that current character and last character are the radical character.
Further, this character cutting device also comprises identification punctuation mark unit 2505, be used for working as the height of the width of character cell image block smaller or equal to this character cell image block, and this character cell image block fully on the line of text position of center line or below the time, determine that the character cell image block comprises punctuation mark, perhaps
When the height of character cell image block height less than line of text, width is less than 1/4 of Chinese character cell picture piece mean breadth, and in the distance value between last character cell picture piece that this character cell image block is adjacent or the back one character cell image block, have at least a distance value to prescribe a time limit, determine that the character cell image block comprises punctuation mark greater than going up of distance range between the adjacent Chinese and English character cell picture piece.
The embodiment of the invention has guaranteed character cutting result's correctness, makes the character cutting result need not be too dependent on the character recognition feedback mechanism, has further improved the discrimination of character.
Obviously, those skilled in the art can carry out various changes and modification to the present invention and not break away from the spirit and scope of the present invention.Like this, if of the present invention these are revised and modification belongs within the scope of claim of the present invention and equivalent technologies thereof, then the present invention also is intended to comprise these changes and modification interior.