CN101251892A - Method and apparatus for cutting character - Google Patents

Method and apparatus for cutting character Download PDF

Info

Publication number
CN101251892A
CN101251892A CNA2008101015916A CN200810101591A CN101251892A CN 101251892 A CN101251892 A CN 101251892A CN A2008101015916 A CNA2008101015916 A CN A2008101015916A CN 200810101591 A CN200810101591 A CN 200810101591A CN 101251892 A CN101251892 A CN 101251892A
Authority
CN
China
Prior art keywords
character
image block
character cell
cell image
chinese
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CNA2008101015916A
Other languages
Chinese (zh)
Other versions
CN101251892B (en
Inventor
亓文法
程道放
李晓龙
卢书一
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
New Founder Holdings Development Co ltd
Peking University
Beijing Founder Electronics Co Ltd
Original Assignee
Peking University
Peking University Founder Group Co Ltd
Beijing Founder Electronics Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Peking University, Peking University Founder Group Co Ltd, Beijing Founder Electronics Co Ltd filed Critical Peking University
Priority to CN2008101015916A priority Critical patent/CN101251892B/en
Publication of CN101251892A publication Critical patent/CN101251892A/en
Application granted granted Critical
Publication of CN101251892B publication Critical patent/CN101251892B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Character Input (AREA)

Abstract

The invention discloses a character segmentation method and a character segmentation device, which can recognize character unit image blocks containing touching characters and character unit image blocks containing components and radicals, and assure the correctness of the character segmentation result. In the technical proposal of the invention, a plurality of character unit image blocks are obtained by making line segmentation and column segmentation to a text image, character unit image blocks containing touching characters are recognized and continue to be segmented, Chinese character unit image block areas and English character unit image block areas are recognized, character unit image blocks occupied by components and radicals of Chinese characters are recognized in the Chinese character unit image block areas, and character unit image blocks occupied by components and radicals of adjacent Chinese characters are merged into a character unit image block. The invention ensures that the character segmentation result does not depend too much on a character recognition feedback mechanism and further improves the recognition rate of the characters.

Description

A kind of character cutting method and device
Technical field
The invention belongs to area of pattern recognition, be specifically related to the segmentation technique of optical character.
Background technology
Along with OCR (Optical Character Recognition, optical character identification) raising of monocase recognition correct rate in, character cutting has become the key issue in the OCR field, and the major part progress that obtains in literal identification field at present also can be given the credit to the raising into literal cutting level.But the practicability of character recognition technology has been subject to the segmentation technique of alphabetic character at present, and the accuracy of literal cutting is directly related with the literal recognition correct rate, and the mistake of literal cutting can directly cause the mistake of literal identification.
The purpose of character cutting is syncopated as a series of subimages exactly from multiword symbol image, wherein each subimage all comprises the character of an independent completion.Character cutting method at present commonly used has: standard syncopation, based on the combination of syncopation, whole syncopation and the first three methods of identification.
Whole syncopation is mainly used in English character cutting process, this method is that as a whole identification done in a word, though this method has been avoided the problem of the inner cutting of word, it depends on the existing dictionary that defines, and this has limited the range of application of this method greatly.
The standard syncopation is mainly used in Chinese character cutting process, this method is by searching out comparatively rational cut-off between the character to image analysis, adopt static Projection Analysis method, with capable cutting of text image and row cutting, the specific implementation process of this method is as follows:
Obtain the gray level image data of document by digital image-forming equipment such as scanners.For the duplicate after long document of holding time, the document of being made dirty, the intensification duplicating, the gray level image data after the scanning comprise a lot of extra noises, tend to influence the accuracy rate of character cutting, as shown in Figure 1.Can adopt the overall situation or local thresholding method that the gray level image data are carried out the binaryzation operation, for example big Tianjin method, process of iteration and bimodal method etc., Fig. 2 is for adopting the design sketch after big Tianjin method is handled to image shown in Figure 1, as can be seen, through still having a lot of noises above the view data after the binaryzation operation, such as the little connected region shown in the long line segment and 202 shown in 201, at this moment can selectively carry out filter operation to noise.
Can adopt the image segmentation algorithm based on region growing to come filtered noise, this method gathers together the pixel that has similar quality in the same area, forms connected region, and the similar quality of pixel comprises information such as average gray value, texture, color.From the prime area (as small neighbourhood or each pixel even), the pixel that the adjacent pixel with similar quality or other zones is had this character is integrated in the current region, thereby growth region progressively, until do not have can the point or other zonule of merger till, form connected region.All connected region in the traversing graph picture, and calculate stain number in each connected region.
After calculating the stain number in each connected region, an empirical value ThresholdPixel is set, this empirical value can be provided with according to the noise power of text image, also can be provided with according to font name, font size and composing layout in the document.All stain numbers all are regarded as noise less than the connected region of ThresholdPixel and are filtered.Wherein the value of ThresholdPixel can not be too big, otherwise the radical of a lot of Chinese characters can be filtered out, such as the point in " filter " word; The value of ThresholdPixel can not be too little, otherwise can leave over the noise region of some.
For example the layout typesetting format of document is: A4 breadth size; Font is " imitation Song-Dynasty-style typeface "; Font size is little No. three; Document has 22 row, and every row has 28 characters (comprising punctuation mark).ThresholdPixel can be made as 50, promptly the stain number all is regarded as noise less than 50 connected region and is filtered, and each the pixel point value in the respective regions is changed to 0.Fig. 3 is the effect synoptic diagram of Fig. 2 after handling through noise removal, as can be seen, the less connected region major part of wherein similar 202 described stain numbers is filtered, but can not fall as noise filtering owing to the stain number in the similar 201 described connected regions is more.
With capable cutting of image and row cutting after the binaryzation operation, Fig. 4 is as the effect synoptic diagram behind the text filed employing standard cutting method of Fig. 3.As can be seen, because the existence of strong noise, may there be the problem of character adhesion in the text after the employing standard syncopation, and adhesion is meant in multiword symbol image, the situation that the intercharacter stroke is in contact with one another.
Method based on identification is the feedback that standard syncopation and whole syncopation are carried out, this method provides a plurality of cutting hypothesis, then the cutting structure is selected, obtain optimum cutting result, this method can identify character cutting result's correctness, but can not correct the mistake of character cutting, can not effectively solve problems such as character adhesion, disconnected pen, and this method more complicated, consuming time, use seldom in practice.
As can be seen, in existing character cutting technology, there is following shortcoming:
(1), easily cause two and more than two Chinese characters image owing to too small being sticked together of spacing between pretreated influence of image or the character, caused the problem that character cutting is inaccurate, discrimination is low.
In the block letter text image,, usually can cause the existence of adhesion overlap joint character because the printing specimen mass ratio is relatively poor and text image is carried out the noise that brings after the binaryzation operation and error etc.
And the time of document preservation crosses for a long time or the document copying process also can be brought extra noise, be provided with etc. such as having increased concentration in document is made dirty, the reader adds conveniently annotations and comments, the duplicating process, common noise remove algorithm only can be handled the less stain of noise, can't handle the noise of long line, the noise of these long lines can cause intercharacter adhesion, influences the result of character recognition.
(2), easily the Chinese character of being made up of radical is divided into a plurality of zones, radical is used as a Chinese character handles, caused Chinese character to merge the problem inaccurate, that discrimination is low, the reason that produces this result has two:
The one, for the Chinese character of forming by radical; before printing or printing; because the distance pixel count smaller or adhesion between the radical is fewer; the subimage that common meeting is used as an integral body to radical is handled; after the gray level image after the overscanning was through the binaryzation operation, radical was easy to be taken as a Chinese character and handles.
The 2nd, gray document image is carried out the binaryzation operation lose some Useful Informations through regular meeting, cause the disconnected pen of character easily, the Chinese character of being made up of radical is divided into a plurality of zones.After for example the document after printing or the printing duplicated through majority, it is very shallow that the gray-scale value of character picture can become, and the phenomenon of middle fracture often appears in thinner stroke in the character picture.
(3) for character cutting result's correctness, need not be too dependent on the character recognition feedback mechanism.
Summary of the invention
The embodiment of the invention provides a kind of character cutting method and device, in order to improve the correctness of character cutting.
The embodiment of the invention provides a kind of character cutting method, comprising:
To capable cutting of text image and row cutting, obtain several character cell image blocks;
Identification comprises the character cell image block of adhesion character, and continues the described character cell image block that comprises the adhesion character of cutting;
Identification Chinese character cell picture piece zone and English character cell picture piece zone, and the character cell image block that identification is taken by the Chinese character radical in described Chinese character cell picture piece zone;
The character cell image block that the radical of adjacent Chinese character takies is merged into a character cell image block.
Wherein, described adhesion character comprises the adhesion Chinese character, and the method that described identification comprises the character cell image block of adhesion Chinese character comprises:
When the width of character cell image block mean breadth greater than Chinese character cell picture piece, and the difference of the average height of the height of this character cell image block and character cell image block determines that the character cell image block comprises the adhesion Chinese character during less than preset threshold.
Described adhesion character comprises the adhesion English character, and the method that described identification comprises the character cell image block of adhesion English character comprises:
When the width of character cell image block mean breadth greater than Chinese character cell picture piece, and the difference of the average height of the height of this character cell image block and character cell image block determines that the character cell image block comprises the adhesion English character during greater than preset threshold.
The method of the character cell image block that described identification Chinese character radical takies comprises:
When the height of the character cell image block average height greater than the character cell image block, width determines that greater than 4/5 o'clock of the mean breadth of Chinese character cell picture piece the character cell image block comprises Chinese character;
When the distance between described Chinese character cell picture piece and the last character cell picture piece is in outside the distance range between the adjacent Chinese and English character cell picture piece, with last character as current character;
When the distance between current character cell picture piece and the last character cell picture piece center is in outside the distance range between the adjacent Chinese characters character cell image block center, determine that described current character and last character are the radical character.
Further, this character cutting method also comprises the character cell image block of discerning punctuation mark.
Wherein, the method for the character cell image block of described identification punctuation mark comprises:
When the width of character cell image block height smaller or equal to this character cell image block, and this character cell image block fully on the line of text position of center line or below the time, determine that the character cell image block comprises punctuation mark, perhaps
When the height of character cell image block height less than line of text, width is less than 1/4 of Chinese character cell picture piece mean breadth, and in the distance value between last character cell picture piece that this character cell image block is adjacent or the back one character cell image block, have at least a distance value to prescribe a time limit, determine that the character cell image block comprises punctuation mark greater than going up of distance range between the adjacent Chinese and English character cell picture piece.
The embodiment of the invention provides a kind of character cutting device, comprising:
Preliminary cutting unit is used for capable cutting of text image and row cutting are obtained several character cell image blocks;
Adhesion character cutting unit is used to discern the character cell image block that comprises the adhesion character, and continues the described character cell image block that comprises the adhesion character of cutting;
Identification radical unit is used to discern Chinese character cell picture piece zone and English character cell picture piece zone, and the character cell image block that identification is taken by the Chinese character radical in described Chinese character cell picture piece zone;
The character merge cells is used for the character cell image block that the radical with adjacent Chinese character takies and merges into a character cell image block.
Wherein, described adhesion character cutting unit specifically is used for, when the width of character cell image block mean breadth greater than Chinese character cell picture piece, and the difference of the average height of the height of this character cell image block and character cell image block is during less than preset threshold, determine that the character cell image block comprises the adhesion Chinese character, perhaps
When the width of character cell image block mean breadth greater than Chinese character cell picture piece, and the difference of the average height of the height of this character cell image block and character cell image block determines that the character cell image block comprises the adhesion English character during greater than preset threshold.
Described identification radical unit specifically is used for, and when the height of the character cell image block average height greater than the character cell image block, width determines that greater than 4/5 o'clock of the mean breadth of Chinese character cell picture piece the character cell image block comprises Chinese character;
When the distance between described Chinese character cell picture piece and the last character cell picture piece is in outside the distance range between the adjacent Chinese and English character cell picture piece, with last character as current character;
When the distance between current character cell picture piece and the last character cell picture piece center is in outside the distance range between the adjacent Chinese characters character cell image block center, determine that described current character and last character are the radical character.
Further, described character cutting device also comprises identification punctuation mark unit, be used for working as the height of the width of character cell image block smaller or equal to this character cell image block, and this character cell image block fully on the line of text position of center line or below the time, determine that the character cell image block comprises punctuation mark, perhaps
When the height of character cell image block height less than line of text, width is less than 1/4 of Chinese character cell picture piece mean breadth, and in the distance value between last character cell picture piece that this character cell image block is adjacent or the back one character cell image block, have at least a distance value to prescribe a time limit, determine that the character cell image block comprises punctuation mark greater than going up of distance range between the adjacent Chinese and English character cell picture piece.
By technique scheme, the embodiment of the invention obtains several character cell image blocks by to capable cutting of text image and row cutting; Identification comprises the character cell image block of adhesion character, and continues the described character cell image block that comprises the adhesion character of cutting; Identification Chinese character cell picture piece zone and English character cell picture piece zone, and the character cell image block that identification is taken by the Chinese character radical in described Chinese character cell picture piece zone; The character cell image block that the radical of adjacent Chinese character takies is merged into a character cell image block.This method can identify character cell image block that comprises the adhesion character and the character cell image block that comprises radical, makes the character cutting result need not be too dependent on the character recognition feedback mechanism, has further improved the discrimination of character.
Description of drawings
Fig. 1 is the gray level image synoptic diagram of scanning back document;
Fig. 2 is the effect synoptic diagram of Fig. 1 after the method binaryzation of excessive Tianjin;
Fig. 3 is the effect synoptic diagram of Fig. 2 after handling through noise removal;
Fig. 4 is as the effect synoptic diagram behind the text filed employing standard cutting method of Fig. 3;
The process flow diagram of a kind of character cutting method that Fig. 5 provides for the embodiment of the invention;
Fig. 6 is the process flow diagram to capable cutting of text image after the binary conversion treatment and row cutting;
Fig. 7 is the text dot matrix image through binary conversion treatment;
Fig. 8 is a text filed effect synoptic diagram after the space cutting shown in Figure 7;
Fig. 9 is the text image synoptic diagram of text filed interpolation extra noise shown in Figure 7;
Figure 10 is the capable cutting text filed synoptic diagram of Fig. 9 under little threshold value situation;
Figure 11 is the capable cutting text filed synoptic diagram of Fig. 9 under big threshold value situation;
Figure 12 is the capable cutting text filed synoptic diagram of text image under big threshold value situation of low noise;
Figure 13 passes through preliminary capable cutting and the effect synoptic diagram after the row cutting for Fig. 7;
The height of the line of text that Figure 14 provides for the embodiment of the invention and the structural representation of position of center line;
The attach structure synoptic diagram of the character cell image block that Figure 15 provides for the embodiment of the invention;
The width structure synoptic diagram of the character cell image block that Figure 16 provides for the embodiment of the invention;
Distance structure synoptic diagram between the adjacent character cell picture piece center that Figure 17 provides for the embodiment of the invention;
Distance structure synoptic diagram between the adjacent character cell picture piece that Figure 18 provides for the embodiment of the invention;
The process flow diagram of the cutting character cell method that Figure 19 provides for the embodiment of the invention;
Figure 20 is the subregional amplification effect synoptic diagram in Figure 13 middle part;
Figure 21 is the effect synoptic diagram after the adhesion character cell character block among Figure 20 carries out cutting;
The method flow diagram of Figure 22 for the radical character is merged;
Figure 23 carries out result after the correct cutting for the method that adopts the embodiment of the invention and provide with Figure 20;
Figure 24 carries out result after the correct cutting for the method that adopts the embodiment of the invention and provide with Figure 13;
The structural representation of a kind of character cutting device that Figure 25 provides for the embodiment of the invention.
Embodiment
The embodiment of the invention provides a kind of character cutting method and device thereof, the lower problem of character identification rate that the character cutting method that provides at prior art causes the character cutting mistake to be brought, proposed following technical scheme, now this technology be elaborated in conjunction with Figure of description and specific embodiment:
First embodiment of the invention provides a kind of character cutting method, and as shown in Figure 5, specific implementation process is as follows:
S100, to capable cutting of text image and row cutting, obtain several character cell image blocks.In conjunction with Fig. 6 this detailed process is elaborated:
S101, to the capable cutting of the text image after the binaryzation.
Obtain two-value text dot matrix image to be slit, text filed pixel wide is nWidth, highly is nHeight.Be provided with function f (i, j), the pixel value of the capable j of presentation video i row, when pixel f (i, when j) being the foreground point, value is 1; (i, when j) being background dot, value is 0 as pixel f.
In order to be syncopated as the row zone of text, and remove the noise that the shape of embarking on journey distributes, scan text image from the top down, and calculate the pixel value sum S of foreground point on every horizontal scanning line n, S wherein n=S 1+ S 2+ ...+S i+ ... (i=0,1,2...nWidth).Threshold value N is set 1If, S n〉=N 1, then this sweep trace is for forming the sweep trace of text; If S n<N 1, then this sweep trace is noise or blank, removes the noise that the shape of embarking on journey distributes, and tentatively is syncopated as the row zone of text.Text filed effect synoptic diagram after the space cutting as shown in Figure 7 as shown in Figure 8.Write down boundary position of each row simultaneously: the position of the center line MiddleLine between the coordinate position of upper left point and lower-right most point and two horizontal lines, and calculate the height of each conjuncted line of text.
For N 1Setting should be noted that following some:
(1) if the noise ratio of text image is less, N 1Can be provided with smallerly, not influence capable cutting substantially.For example, can be with N 1Be set to 10.
(2) if the noise ratio of text image is bigger, shown in Figure 9, N 1Can be provided with greatlyyer.If N 1Be provided with smallerly, noise more by force just can not be eliminated, and the line of text zone that is syncopated as will be inaccurate, as shown in figure 10, so must be with N 1Be provided with to such an extent that more just can address this problem, can be with N 1Be set to 60, the effect after the cutting as shown in figure 11.
(3), N 1When being provided with greatlyyer, can influence the less line of text of character quantity.If the character quantity of line of text is less, the foreground point quantity in this article one's own profession on some horizontal scanning line is just fewer, the S that calculates nValue is just less, if N 1Be provided with greatlyyer, can cause S n<N 1, easily some foreground point with this article one's own profession is considered as noise or blank, and as shown in figure 12, last column has only a Chinese character " war ", and text is 2 capable or multirows more by cutting mistakenly.Can address this problem by two kinds of approach: the one, need artificial the participation, such as pollution condition according to text image, manual setting threshold N 1Size; The 2nd, to set a bigger threshold value and carry out cutting, the line of text spacing after the initial analysis cutting, the height of line of text are searched abnormal data, attempt merging the line of text border according to abnormal data.If do not cause new abnormal data after merging, then merge the line of text border of this abnormal data correspondence, otherwise abandon.The most noise of so promptly can forgoing can also be eliminated the abnormal data in data height sequence, the line of text pitch sequence simultaneously effectively.
On S102, the basis, carry out the operation of row cutting to the capable cutting of text image after the binaryzation.
In order to be syncopated as the column region of text, and remove into the noise that the row shape distributes, scan text image from left to right, and calculate the pixel value sum R of every foreground point on the longitudinal scanning line n, R wherein n=R 1+ R 2+ ...+R j+ ..., wherein the scope of j is the coboundary and the lower boundary in this article one's own profession zone.Threshold value N is set 2If, R j〉=N 2, then this sweep trace is for forming the sweep trace of character, if R j<N 2, then this sweep trace is noise or blank, removes to be the noise that the row shape distributes.Owing to the text image after the binaryzation has been carried out noise removal process, so general little noise can not influence text column zone cutting, so N can be set 2Be 0.
Each character all can have a boundary rectangle frame like this, and the up-and-down boundary of character is the up-and-down boundary of this row, and border, the left and right sides is the row cut-off of this character.
S103, acquisition comprise the minimum boundary rectangle frame of each all black picture element of character.
Because the boundary rectangle frame height of each character is inconsistent, especially the difference of Chinese and English character height, the boundary rectangle frame of each character is carried out toe-in or outwards expansion, make that rectangle frame is the minimum boundary rectangle that comprises all black picture elements of character, thereby obtain a sequence of sets Ω who comprises several character cell image blocks, as shown in figure 13.
The characteristic of all rectangular image pieces in S200, the statistical study text image.
According to the sequence of sets Ω that comprises character rectangular image piece that obtains among the S103, the characteristic of all rectangular image pieces in the statistical study text image comprises following characteristic:
(1), the average row height of the height of line of text, position of center line and line of text
As shown in figure 14, the height H L of line of text is meant the distance between two horizontal lines that comprise literal; The center line MiddleLine of line of text is meant two residing positions of the center line between the horizontal line; Add up the height H L of all line of text, calculate the high HLAVE of text filed average row.All that are syncopated as among the traversal S100 are text filed, calculate corresponding row high HL, position of center line MiddleLine and the text filed high HLAVE of average row.
(2), the average height of character cell image block
As shown in figure 15, the height H of character cell image block is meant the height of the minimum boundary rectangle frame of each character cell, adds up the height of the minimum boundary rectangle frame of all character cells, calculates the average height HeightAve of character cell image block.
(3), the mean breadth of character cell image block
As shown in figure 16, the width W idth of character cell image block (writing a Chinese character in simplified form W) is meant the width of the minimum boundary rectangle frame of each character cell.But this character is a legal character not necessarily, the character of forming such as the radical of Chinese character or the character of the inter-adhesive composition of intercharacter etc., as " newspaper " among Figure 16 and " arriving ", " newspaper " is because intercharacter is inter-adhesive, 2 Chinese characters have been formed a character, " to " the radical character of 2 Chinese characters be divided into.
Add up the width distribution feature of all character cell image blocks, establish the width that the x axle is the character cell image block, the y axle is the number of the pairing character cell image block of this width value, is similar to the statistics with histogram of gray level image.Because Chinese character is Chinese characters basically, the width of Chinese character cell picture piece can be highly big more not a lot of than it, so the upper limit of x axle can be 1.5 times of the text filed high HLAVE of average row.
In the common document, can not stick together between Chinese character and the English/numeral.After obtaining as above character cell image block width distribution figure, by distribution characteristics as can be seen, have the zone that two adjacent width value numbers are assembled, wherein the accumulation area that width is big is the width value scope of normal Chinese characters character cell image block, and the less accumulation area of width is the width value scope of normal English or numerical character cell picture piece.In addition in this width distribution figure, have some width values bigger zone or littler zone, wherein, the zone that width value is bigger is to be caused by the character cell image block that comprises the adhesion character, for example " newspaper ", this character cell image block comprises two Chinese characters, so the width value of each character cell image block is all bigger; Width value littler zone may be made up of the character cell image block that comprises the Chinese character radical, and for example " river " word is divided into three characters, so the width value of each character cell image block is all less.
In the width value distributed area of Chinese character, getting local crest value is the mean breadth ChnWidth of Chinese character cell picture piece; Equally, in the width value distributed area of English/numerical character, getting local crest value is the mean breadth EnWidth of English/numerical character cell picture piece.
(4), the distance between the adjacent character cell picture piece center
As shown in figure 17, the distance W ave between the adjacent character cell picture piece center is meant the distance between the minimum boundary rectangle frame of the adjacent character center.
If the x axle is the distance value between the adjacent character cell picture piece center, the y axle is the number of the pairing character cell image block of this distance value, after obtaining the distance value distribution plan between the character cell image block center, by distribution characteristics as can be seen, have the zone of two adjacent distance value distribution or accumulation, wherein, what the zone that distance value is bigger was corresponding is the zone that Chinese character cell picture piece is assembled, and what distance value was smaller is the zone that English/numerical character cell picture piece is assembled.
In the zone of Chinese character and English/numerical character cell picture piece gathering, find out local crest value WaveCN and WaveEN respectively.According to WaveCN and WaveEN, the distance range that can delimit out between the adjacent Chinese characters character cell image block center is [(2*WaveCN+WaveEN)/3, (4*WaveCN-WaveEN)/3], distance range between adjacent English/numerical character cell picture piece center is [(4*WaveEN-WaveCN)/3, (WaveCN+2*WaveEN)/3].
(5), the distance between the adjacent character cell picture piece
As shown in figure 18, the distance D is between the character cell image block is meant: adjacent two character cell image blocks in the one text row, the right margin of previous character cell image block is to the distance between the left margin of a back character cell image block.Add up the range distribution between all character cell image blocks, after obtaining this distribution plan, can see the zone that a tangible number is assembled, both may comprise the distance between the adjacent Chinese character cell picture piece in this zone, also may comprise the distance between adjacent English/numerical character cell picture piece, because the distance between the distance between the adjacent Chinese character cell picture piece and the adjacent English/numerical character cell picture piece is all very little, there is not absolute separatrix.But as can be seen by distribution characteristics, have the another one accumulation area, be adjacent Chinese character and the distance between English/numerical character cell picture piece, get its local crest value DisChnAndEn, what do not fix the number of this accumulation area, degree according to the mixing of document areas Chinese and English, the distance range that can delimit out between the adjacent Chinese and English character cell picture piece is [DisChnAndEn-Threshold, DisChnAndEn+Threshold], wherein Threshold is a given threshold value, can be provided with according to actual conditions.
S300, identification comprise the character cell image block of adhesion character, and continue the character cell image block that cutting comprises the adhesion character.
If it is the adhesion character that the width of character cell image block, is then determined this character greater than the mean breadth of Chinese character cell picture piece.Compare according to the height of the adhesion character cell image block of determining and the average height HeightAve of character cell image block, adhesion character cell image block can be divided into adhesion Chinese character image block and adhesion English character image block, respectively adhesion Chinese character image block and adhesion English character image block are discerned below, and the character cell image block that comprises the adhesion character is carried out cutting.
If it is the adhesion Chinese character that the difference of the height of adhesion character cell image block and the average height of character cell image block, is then determined this adhesion character less than preset threshold.In general, the number of the foreground point of the longitudinal scanning line of adhesion place between the adhesion Chinese character is minimum, be in the trough location that projection distributes, so, can carry out cutting to this adhesion Chinese character according to the foreground point number of adhesion character longitudinal scanning line correspondence, this detailed process is elaborated below in conjunction with Figure 19:
If the difference of the height of S301 adhesion character cell image block and the average height of character cell image block is less than preset threshold, then determining this adhesion character is the adhesion Chinese character.
The border, upper and lower, left and right of S302, note adhesion character cell image block is for being respectively T, B, L, R, with L to R is transverse axis, T to B is the longitudinal axis, calculate the number of black pixel point on this adhesion character cell image block longitudinal scanning line, horizontal ordinate is sorted from small to large according to what of its corresponding foreground point number, obtain an array sequence  about the position.
The array sequence  of S303, a sky of establishment 1, the horizontal ordinate of left margin L and right margin R is joined  1In, select first element among the , be inserted into  according to the size order of position 1In.
S304, calculating  1In the distance between the position adjacent in twos, if distance is then carried out S306 all less than the mean breadth of character cell image block; Otherwise carry out S305.
Next element among S305, the selection  is inserted into  according to the position size order 1In, the process of repetition S304 is until  1In the distance between the position adjacent all till the mean breadth ChnWidth less than the character cell image block in twos.
S306, with  1In the position be cut-point, adhesion character cell image block is cut apart, thereby obtain the overlapping sub-character cell image block of a plurality of head and the tail, the boundary rectangle frame of each character is carried out toe-in or outwards expansion, make that rectangle frame is the minimum boundary rectangle that comprises all black picture elements of character.
S307, adhesion character cell image block is deleted from original sequence Ω, and all character cell image blocks that obtain among the S306 are inserted on the position identical among the Ω, thereby obtain a new character cell image block sequence Ω 1
Figure 20 is the effect synoptic diagram of Figure 13 subregion after amplifying, and Figure 21 carries out effect synoptic diagram after the cutting for Figure 20 according to S300.
If the difference of the height of adhesion character cell image block and the average height of character cell image block is greater than preset threshold, determining this adhesion character cell image block is English/numerical character cell picture piece, for the adhesion between the English character, need to consider two kinds of situations:
First kind of situation, adhesion between the adjacent character image, but can't cut apart and the character adhesion that causes with white vertical line, can use border following algorithm this moment, and the connected region of finding out separately can be carried out cutting to the adhesion character.
There is adhesion in second kind of situation between the adjacent character image, can utilize the profile of character to search for all possible cut-off, generates a series of cutting route, picks out optimal cutting route according to English cutting evaluation the adhesion character is carried out cutting.
The character cell image block of S400, identification punctuation mark.
For determining that the character in the character cell image block is a punctuation mark, need to consider two kinds of situations, as long as any situation below satisfying, this character just is defined as punctuation mark:
First kind of situation, if the height of character cell image block less than the text every trade high 1/2, width is smaller or equal to the height of this character cell image block, and this character cell image block fully on MiddleLine or below, the character of then determining in this character cell image block is a punctuation mark, for example ", ", ".", ", " etc. punctuation mark;
Second kind of situation, if the height of character cell image block is less than text every trade height, width is less than ChnWidth/4, and in the distance value between the character cell image block of this character cell image block and front and back, has a value at least greater than 1.2* (DisChnAndEn+Threshold), promptly this distance has surpassed the upper limit of the distance range between the Chinese and English character cell picture piece, and the character of then determining in this character cell image block is a punctuation mark, for example "; ", "! ", punctuation mark such as ": ".
S500, identification Chinese character cell picture piece zone and English character cell picture piece zone, and the character cell image block that identification is taken by the Chinese character radical in Chinese character cell picture piece zone; The character cell image block that the radical of adjacent Chinese character takies is merged into a character cell image block.
This step can be handled respectively at each line of text zone, at first finds all punctuation marks in each line of text, handles the character cell image block between per two punctuation marks then successively, in conjunction with Figure 22 this detailed process is elaborated:
The index value of bebinning character cell picture piece in full line is IndexBegin and IndexEnd between S501, two punctuation marks of record.
S502, traversal index value are in character cell image blocks all between IndexBegin and the IndexEnd, according to the height of Chinese character cell picture piece greater than HeightAve, width is greater than ChnWidth*0.8, according to order from front to back, find out first Chinese character cell picture piece O, and write down the index value Index of this Chinese character cell picture piece in full line.
S503, be benchmark with Chinese character cell picture piece O, the character cell of search forward, and note successively image block C is current Chinese character cell picture piece, searches the character cell image block of index value IndexBegin, and concrete processing procedure is:
If the index value of current Chinese character cell picture piece C is IndexBegin, then carry out S507;
Otherwise, take out a character cell image block C of current Chinese character cell picture piece C front 1, calculate C and C 1Between distance D is, if Dis drops in the interval [DisChnAndEn-Threshold, DisChnAndEn+Threshold], then show C 1Be English character cell picture piece, directly this character cell image block added cutting sequence as a result, and with C 1Be considered as new current English character cell picture piece C, carry out S506; Otherwise carry out S504.
S504, investigation Chinese character cell picture piece C 1The character C of front 2Whether be its radical, specifically comprise:
Calculate C 1With C 2Distance D is1 between the center if Dis1 drops in the interval range [(2*WaveCN+WaveEN)/3, (4*WaveCN-WaveEN)/3], shows C 2Not C 1Radical, but a Chinese character cell picture piece is independently carried out S505; Otherwise continuation following processes:
If Dis1 does not drop in the interval range [(2*WaveCN+WaveEN)/3, (4*WaveCN-WaveEN)/3], with C 1And C 2Merge into a new character cell image block O 1
Investigate C 2The character cell image block C of front 3, calculate O 1And C 3Distance D is2 between the center if Dis2 drops in the interval range [(2*WaveCN+WaveEN)/3, (4*WaveCN-WaveEN)/3], then shows C 3Be an independently Chinese character cell picture piece, then with O 1Join cutting as a result in the sequence, O 1Be regarded as new current Chinese character cell picture piece C, the associated description of its specific implementation process and S503 is identical, no longer is described in detail herein;
If Dis2 does not drop in the interval range [(2*WaveCN+WaveEN)/3, (4*WaveCN-WaveEN)/3], then C 3Chinese character cell picture piece independently certainly not, it might be O 1The radical of middle character also might be C 3The character cell image block C of front 4The radical of middle character;
Calculate C 3With C 4The width W idth1 that merges back character cell image block, and C 3With O 1The width W idth2 that merges back character cell image block;
If the width of Width1 is less than the width of Width2, then C 3Once more by O 1Merge, then with O 1Join cutting as a result in the sequence, O 1Be regarded as new current Chinese character zone C, the associated description of its specific implementation process and S503 is identical, no longer is described in detail herein;
If the width of Width1 is greater than the width of Width2, directly with O 1Join cutting as a result in the sequence, O 1Be regarded as new current Chinese character zone C, the associated description of its specific implementation process and S503 is identical, no longer is described in detail herein.
S505, with C 1Directly join cutting as a result in the sequence, and with C 1Be considered as new current Chinese character cell picture piece C, the associated description of its specific implementation process and S503 is identical, no longer is described in detail herein.
If the index value of the current English character cell picture of S506 piece is IndexBegin, then directly turn to S507, otherwise, take out a character cell image block C of current English character cell picture piece C front 1, calculate C and C 1Distance D is between the center if Dis drops in the interval [(4*WaveEN-WaveCN)/3, (WaveCN+2*WaveEN)/3], then shows C 1Be English character cell picture piece, directly this character cell image block added cutting sequence as a result, and with C 1Be considered as new current English character cell picture piece, and repeat this process, otherwise turn to S504;
S507, be benchmark with Chinese character cell picture piece O, search backward successively, and note character cell image block C is current Chinese character cell picture piece, search the character cell image block of index value IndexEnd, associated description among specific implementation process and the S503 is identical, no longer is described in detail herein.
According to above-mentioned method step, handle line of text zones all in the whole text image successively, obtain final character zone cutting result, wherein, Figure 23 is for carrying out effect synoptic diagram after the cutting to Figure 22 according to S500.
Figure 24 carries out final character zone cutting result schematic diagram after the cutting for the method that adopts the embodiment of the invention and provide with Figure 13, as can be seen, the character cutting method that the embodiment of the invention provides guarantees character cutting result's correctness, has solved the problem that intercharacter adhesion and radical are taken as an independent character.
Second embodiment of the invention provides a kind of character cutting device, and the structure of this character cutting device comprises referring to Figure 25, preliminary cutting unit 2501, adhesion character cutting unit 2502, identification radical unit 2503 and character merge cells 2504.
Wherein, preliminary cutting unit 2501 is used for capable cutting of text image and row cutting are obtained several character cell image blocks;
Adhesion character cutting unit 2502 is used to discern the character cell image block that comprises the adhesion character, and continues the character cell image block that cutting comprises the adhesion character;
Identification radical unit 2503 is used to discern Chinese character cell picture piece zone and English character cell picture piece zone, and the character cell image block that identification is taken by the Chinese character radical in Chinese character cell picture piece zone;
Character merge cells 2504 is used for the character cell image block that the radical with adjacent Chinese character takies and merges into a character cell image block.
Wherein, adhesion character cutting unit 2502 specifically is used for, when the width of character cell image block mean breadth greater than Chinese character cell picture piece, and the difference of the average height of the height of this character cell image block and character cell image block is during less than preset threshold, determine that the character cell image block comprises the adhesion Chinese character, perhaps
When the width of character cell image block mean breadth greater than Chinese character cell picture piece, and the difference of the average height of the height of this character cell image block and character cell image block determines that the character cell image block comprises the adhesion English character during greater than preset threshold.
Identification radical unit 2503 specifically is used for, and when the height of the character cell image block average height greater than the character cell image block, width determines that greater than 4/5 o'clock of the mean breadth of Chinese character cell picture piece the character cell image block comprises Chinese character;
When the distance between Chinese character cell picture piece and the last character cell picture piece is in outside the distance range between the adjacent Chinese and English character cell picture piece, with last character as current character;
When the distance between current character cell picture piece and the last character cell picture piece center is in outside the distance range between the adjacent Chinese characters character cell image block center, determine that current character and last character are the radical character.
Further, this character cutting device also comprises identification punctuation mark unit 2505, be used for working as the height of the width of character cell image block smaller or equal to this character cell image block, and this character cell image block fully on the line of text position of center line or below the time, determine that the character cell image block comprises punctuation mark, perhaps
When the height of character cell image block height less than line of text, width is less than 1/4 of Chinese character cell picture piece mean breadth, and in the distance value between last character cell picture piece that this character cell image block is adjacent or the back one character cell image block, have at least a distance value to prescribe a time limit, determine that the character cell image block comprises punctuation mark greater than going up of distance range between the adjacent Chinese and English character cell picture piece.
The embodiment of the invention has guaranteed character cutting result's correctness, makes the character cutting result need not be too dependent on the character recognition feedback mechanism, has further improved the discrimination of character.
Obviously, those skilled in the art can carry out various changes and modification to the present invention and not break away from the spirit and scope of the present invention.Like this, if of the present invention these are revised and modification belongs within the scope of claim of the present invention and equivalent technologies thereof, then the present invention also is intended to comprise these changes and modification interior.

Claims (10)

1, a kind of character cutting method is characterized in that, comprising:
To capable cutting of text image and row cutting, obtain several character cell image blocks;
Identification comprises the character cell image block of adhesion character, and continues the described character cell image block that comprises the adhesion character of cutting;
Identification Chinese character cell picture piece zone and English character cell picture piece zone, and the character cell image block that identification is taken by the Chinese character radical in described Chinese character cell picture piece zone;
The character cell image block that the radical of adjacent Chinese character takies is merged into a character cell image block.
2, the method for claim 1 is characterized in that, also comprises: the character cell image block of identification punctuation mark.
3, method as claimed in claim 2 is characterized in that, the method for the character cell image block of described identification punctuation mark comprises:
When the width of character cell image block height smaller or equal to this character cell image block, and this character cell image block fully on the line of text position of center line or below the time, determine that the character cell image block comprises punctuation mark, perhaps
When the height of character cell image block height less than line of text, width is less than 1/4 of Chinese character cell picture piece mean breadth, and in the distance value between last character cell picture piece that this character cell image block is adjacent or the back one character cell image block, have at least a distance value to prescribe a time limit, determine that the character cell image block comprises punctuation mark greater than going up of distance range between the adjacent Chinese and English character cell picture piece.
4, the method for claim 1 is characterized in that, described adhesion character comprises the adhesion Chinese character, and the method that described identification comprises the character cell image block of adhesion Chinese character comprises:
When the width of character cell image block mean breadth greater than Chinese character cell picture piece, and the difference of the average height of the height of this character cell image block and character cell image block determines that the character cell image block comprises the adhesion Chinese character during less than preset threshold.
As claim 1 or 4 described methods, it is characterized in that 5, described adhesion character comprises the adhesion English character, the method that described identification comprises the character cell image block of adhesion English character comprises:
When the width of character cell image block mean breadth greater than Chinese character cell picture piece, and the difference of the average height of the height of this character cell image block and character cell image block determines that the character cell image block comprises the adhesion English character during greater than preset threshold.
6, the method for claim 1 is characterized in that, the method for the character cell image block that described identification Chinese character radical takies comprises:
When the height of the character cell image block average height greater than the character cell image block, width determines that greater than 4/5 o'clock of the mean breadth of Chinese character cell picture piece the character cell image block comprises Chinese character;
When the distance between described Chinese character cell picture piece and the last character cell picture piece is in outside the distance range between the adjacent Chinese and English character cell picture piece, with last character as current character;
When the distance between current character cell picture piece and the last character cell picture piece center is in outside the distance range between the adjacent Chinese characters character cell image block center, determine that described current character and last character are the radical character.
7, a kind of character cutting device is characterized in that, comprising:
Preliminary cutting unit is used for capable cutting of text image and row cutting are obtained several character cell image blocks;
Adhesion character cutting unit is used to discern the character cell image block that comprises the adhesion character, and continues the described character cell image block that comprises the adhesion character of cutting;
Identification radical unit is used to discern Chinese character cell picture piece zone and English character cell picture piece zone, and the character cell image block that identification is taken by the Chinese character radical in described Chinese character cell picture piece zone;
The character merge cells is used for the character cell image block that the radical with adjacent Chinese character takies and merges into a character cell image block.
8, character cutting device as claimed in claim 7, it is characterized in that, described device also comprises identification punctuation mark unit, be used for working as the height of the width of character cell image block smaller or equal to this character cell image block, and this character cell image block fully on the line of text position of center line or below the time, determine that the character cell image block comprises punctuation mark, perhaps
When the height of character cell image block height less than line of text, width is less than 1/4 of Chinese character cell picture piece mean breadth, and in the distance value between last character cell picture piece that this character cell image block is adjacent or the back one character cell image block, have at least a distance value to prescribe a time limit, determine that the character cell image block comprises punctuation mark greater than going up of distance range between the adjacent Chinese and English character cell picture piece.
9, as claim 7 or 8 described character cutting devices, it is characterized in that, described adhesion character cutting unit specifically is used for, when the width of character cell image block mean breadth greater than Chinese character cell picture piece, and the difference of the average height of the height of this character cell image block and character cell image block is during less than preset threshold, determine that the character cell image block comprises the adhesion Chinese character, perhaps
When the width of character cell image block mean breadth greater than Chinese character cell picture piece, and the difference of the average height of the height of this character cell image block and character cell image block determines that the character cell image block comprises the adhesion English character during greater than preset threshold.
10, as claim 7 or 8 described character cutting devices, it is characterized in that, described identification radical unit specifically is used for, when the height of character cell image block average height greater than the character cell image block, width determines that greater than 4/5 o'clock of the mean breadth of Chinese character cell picture piece the character cell image block comprises Chinese character;
When the distance between described Chinese character cell picture piece and the last character cell picture piece is in outside the distance range between the adjacent Chinese and English character cell picture piece, with last character as current character;
When the distance between current character cell picture piece and the last character cell picture piece center is in outside the distance range between the adjacent Chinese characters character cell image block center, determine that described current character and last character are the radical character.
CN2008101015916A 2008-03-07 2008-03-07 Method and apparatus for cutting character Expired - Fee Related CN101251892B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2008101015916A CN101251892B (en) 2008-03-07 2008-03-07 Method and apparatus for cutting character

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2008101015916A CN101251892B (en) 2008-03-07 2008-03-07 Method and apparatus for cutting character

Publications (2)

Publication Number Publication Date
CN101251892A true CN101251892A (en) 2008-08-27
CN101251892B CN101251892B (en) 2010-06-09

Family

ID=39955276

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2008101015916A Expired - Fee Related CN101251892B (en) 2008-03-07 2008-03-07 Method and apparatus for cutting character

Country Status (1)

Country Link
CN (1) CN101251892B (en)

Cited By (49)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101777124A (en) * 2010-01-29 2010-07-14 北京新岸线网络技术有限公司 Method for extracting video text message and device thereof
CN101984426A (en) * 2010-10-21 2011-03-09 优视科技有限公司 Method used for character splitting on webpage picture and device thereof
CN102169542A (en) * 2010-02-25 2011-08-31 汉王科技股份有限公司 Method and device for touching character segmentation in character recognition
CN102243621A (en) * 2010-05-11 2011-11-16 项洁 Typesetting method for image text file
CN102254157A (en) * 2011-07-07 2011-11-23 北京文通图像识别技术研究中心有限公司 Evaluating method for searching character segmenting position between two adjacent characters
CN102456136A (en) * 2010-10-29 2012-05-16 方正国际软件(北京)有限公司 Image-text splitting method and system
CN102496013A (en) * 2011-11-11 2012-06-13 苏州大学 Chinese character segmentation method for off-line handwritten Chinese character recognition
CN102511048A (en) * 2009-12-31 2012-06-20 塔塔咨询服务有限公司 Method and system for preprocessing the region of video containing text
CN102542269A (en) * 2010-12-24 2012-07-04 北大方正集团有限公司 Western language word segmenting method and device
CN102567938A (en) * 2010-12-23 2012-07-11 北大方正集团有限公司 Watermark image blocking method and device for western language watermark processing
CN102867178A (en) * 2011-07-05 2013-01-09 富士通株式会社 Method and device for Chinese character recognition
CN102870399A (en) * 2010-05-10 2013-01-09 微软公司 Segmentation of a word bitmap into individual characters or glyphs during an OCR process
CN102982328A (en) * 2011-08-03 2013-03-20 夏普株式会社 Character recognition apparatus and character recognition method
CN103020621A (en) * 2012-12-25 2013-04-03 深圳深讯和科技有限公司 Method and device for segmenting Chinese and English mixed typeset character images
CN103093224A (en) * 2011-11-08 2013-05-08 佳能株式会社 Method and device for determining average character width and method and equipment of character segmentation
CN103106405A (en) * 2011-11-09 2013-05-15 佳能株式会社 Line segmentation method and line segmentation system for document images
CN103106406A (en) * 2011-11-09 2013-05-15 佳能株式会社 Method and system for segmenting characters in text line with different character widths
CN103559172A (en) * 2013-11-06 2014-02-05 北京百度网讯科技有限公司 Phrasing method and device for multi-language mixed text
CN103854024A (en) * 2012-12-04 2014-06-11 百度国际科技(深圳)有限公司 Method and device for extracting characters in image
CN104112287A (en) * 2013-04-17 2014-10-22 北大方正集团有限公司 Method and device for segmenting characters in picture
CN104134064A (en) * 2013-05-02 2014-11-05 百度国际科技(深圳)有限公司 Character recognition method and device
CN104361312A (en) * 2014-10-16 2015-02-18 北京捷通华声语音技术有限公司 Device and method for optical character recognition of images
CN104915332A (en) * 2015-06-15 2015-09-16 广东欧珀移动通信有限公司 Method and device for generating composing template
CN105046254A (en) * 2015-07-17 2015-11-11 腾讯科技(深圳)有限公司 Character recognition method and apparatus
CN105354834A (en) * 2015-10-15 2016-02-24 广东欧珀移动通信有限公司 Method and apparatus for making statistics on number of paper text fonts
CN105373526A (en) * 2015-10-23 2016-03-02 北大方正集团有限公司 Blank region processing method and system for electronic document
CN105631450A (en) * 2015-12-28 2016-06-01 小米科技有限责任公司 Character identifying method and device
CN103093224B (en) * 2011-11-08 2016-12-14 佳能株式会社 Determine the method and apparatus of average character duration and character cutting method and equipment
CN106339704A (en) * 2015-07-14 2017-01-18 富士通株式会社 Character recognition method and character recognition equipment
CN103839060B (en) * 2012-11-26 2017-03-01 阿里巴巴集团控股有限公司 A kind of merging method in individual character region and device
CN106611175A (en) * 2016-12-29 2017-05-03 成都数联铭品科技有限公司 Automatic character and picture segmentation system for recognizing image characters
CN106682667A (en) * 2016-12-29 2017-05-17 成都数联铭品科技有限公司 Image-text OCR (optical character recognition) system for uncommon fonts
CN106778758A (en) * 2016-12-29 2017-05-31 成都数联铭品科技有限公司 For the character cutting method of pictograph identification
WO2017118356A1 (en) * 2016-01-05 2017-07-13 腾讯科技(深圳)有限公司 Text image processing method and apparatus
CN107067005A (en) * 2017-04-10 2017-08-18 深圳爱拼信息科技有限公司 A kind of method and device of Sino-British mixing OCR Character segmentations
CN107330430A (en) * 2017-06-27 2017-11-07 司马大大(北京)智能系统有限公司 Tibetan character recognition apparatus and method
CN103810486B (en) * 2014-02-13 2017-11-21 广东小天才科技有限公司 Method and device for processing characters
CN108229454A (en) * 2016-12-15 2018-06-29 北京新唐思创教育科技有限公司 A kind of image cutting labeling method and its device
CN108446702A (en) * 2018-03-14 2018-08-24 深圳怡化电脑股份有限公司 A kind of image character dividing method, device, equipment and storage medium
CN108491845A (en) * 2018-03-02 2018-09-04 深圳怡化电脑股份有限公司 Determination, character segmentation method, device and the equipment of Character segmentation position
CN109871843A (en) * 2017-12-01 2019-06-11 北京搜狗科技发展有限公司 Character identifying method and device, the device for character recognition
CN110135425A (en) * 2018-02-09 2019-08-16 北京世纪好未来教育科技有限公司 Sample mask method and computer storage medium
CN110163203A (en) * 2019-04-09 2019-08-23 浙江口碑网络技术有限公司 Character identifying method, device, storage medium and computer equipment
CN110210477A (en) * 2019-05-24 2019-09-06 四川阿泰因机器人智能装备有限公司 A kind of digital instrument Recognition of Reading method
CN110378347A (en) * 2019-07-04 2019-10-25 北京爱医生智慧医疗科技有限公司 A kind of the key message extracting method and device of medical inspection list
CN111291794A (en) * 2020-01-21 2020-06-16 上海眼控科技股份有限公司 Character recognition method, character recognition device, computer equipment and computer-readable storage medium
CN112329548A (en) * 2020-10-16 2021-02-05 北京临近空间飞行器系统工程研究所 Document chapter segmentation method and device and storage medium
CN112016566B (en) * 2020-10-27 2021-03-16 恒银金融科技股份有限公司 Segmentation method for handwritten Chinese characters at financial bill upper-case money amount
CN112990178A (en) * 2021-04-13 2021-06-18 中国科学院大学 Text digital information embedding and extracting method and system based on character segmentation

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN100492403C (en) * 2001-09-27 2009-05-27 佳能株式会社 Character image line selecting method and device and character image identifying method and device

Cited By (75)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102511048B (en) * 2009-12-31 2015-08-26 塔塔咨询服务有限公司 A kind of method and system comprising the video area of text for pre-service
CN102511048A (en) * 2009-12-31 2012-06-20 塔塔咨询服务有限公司 Method and system for preprocessing the region of video containing text
CN101777124A (en) * 2010-01-29 2010-07-14 北京新岸线网络技术有限公司 Method for extracting video text message and device thereof
CN102169542B (en) * 2010-02-25 2012-11-28 汉王科技股份有限公司 Method and device for touching character segmentation in character recognition
CN102169542A (en) * 2010-02-25 2011-08-31 汉王科技股份有限公司 Method and device for touching character segmentation in character recognition
CN102870399B (en) * 2010-05-10 2015-09-02 微软技术许可有限责任公司 In OCR process, word bitmap is divided into single character or font
CN102870399A (en) * 2010-05-10 2013-01-09 微软公司 Segmentation of a word bitmap into individual characters or glyphs during an OCR process
CN102243621A (en) * 2010-05-11 2011-11-16 项洁 Typesetting method for image text file
WO2012051943A1 (en) * 2010-10-21 2012-04-26 优视科技有限公司 Method and device for segmenting characters in webpage images
CN101984426A (en) * 2010-10-21 2011-03-09 优视科技有限公司 Method used for character splitting on webpage picture and device thereof
CN101984426B (en) * 2010-10-21 2013-04-10 优视科技有限公司 Method used for character splitting on webpage picture and device thereof
CN102456136A (en) * 2010-10-29 2012-05-16 方正国际软件(北京)有限公司 Image-text splitting method and system
CN102456136B (en) * 2010-10-29 2013-06-05 方正国际软件(北京)有限公司 Image-text splitting method and system
CN102567938A (en) * 2010-12-23 2012-07-11 北大方正集团有限公司 Watermark image blocking method and device for western language watermark processing
CN102567938B (en) * 2010-12-23 2014-05-14 北大方正集团有限公司 Watermark image blocking method and device for western language watermark processing
CN102542269A (en) * 2010-12-24 2012-07-04 北大方正集团有限公司 Western language word segmenting method and device
CN102867178B (en) * 2011-07-05 2015-06-10 富士通株式会社 Method and device for Chinese character recognition
CN102867178A (en) * 2011-07-05 2013-01-09 富士通株式会社 Method and device for Chinese character recognition
CN102254157A (en) * 2011-07-07 2011-11-23 北京文通图像识别技术研究中心有限公司 Evaluating method for searching character segmenting position between two adjacent characters
CN102982328A (en) * 2011-08-03 2013-03-20 夏普株式会社 Character recognition apparatus and character recognition method
CN103093224A (en) * 2011-11-08 2013-05-08 佳能株式会社 Method and device for determining average character width and method and equipment of character segmentation
CN103093224B (en) * 2011-11-08 2016-12-14 佳能株式会社 Determine the method and apparatus of average character duration and character cutting method and equipment
CN103106405A (en) * 2011-11-09 2013-05-15 佳能株式会社 Line segmentation method and line segmentation system for document images
CN103106406A (en) * 2011-11-09 2013-05-15 佳能株式会社 Method and system for segmenting characters in text line with different character widths
CN103106406B (en) * 2011-11-09 2016-10-05 佳能株式会社 There is the method and system of character in the line of text of kinds of characters width for cutting
CN103106405B (en) * 2011-11-09 2017-05-03 佳能株式会社 Line segmentation method and line segmentation system for document images
CN102496013A (en) * 2011-11-11 2012-06-13 苏州大学 Chinese character segmentation method for off-line handwritten Chinese character recognition
CN107122778B (en) * 2012-11-26 2020-06-23 阿里巴巴集团控股有限公司 Method and device for merging single character areas
CN107122778A (en) * 2012-11-26 2017-09-01 阿里巴巴集团控股有限公司 The merging method and device in a kind of individual character region
CN103839060B (en) * 2012-11-26 2017-03-01 阿里巴巴集团控股有限公司 A kind of merging method in individual character region and device
CN103854024A (en) * 2012-12-04 2014-06-11 百度国际科技(深圳)有限公司 Method and device for extracting characters in image
CN103020621B (en) * 2012-12-25 2016-02-24 深圳深讯和科技有限公司 The cutting method of Chinese and English mixing character image and device
CN103020621A (en) * 2012-12-25 2013-04-03 深圳深讯和科技有限公司 Method and device for segmenting Chinese and English mixed typeset character images
CN104112287B (en) * 2013-04-17 2017-05-24 北大方正集团有限公司 Method and device for segmenting characters in picture
CN104112287A (en) * 2013-04-17 2014-10-22 北大方正集团有限公司 Method and device for segmenting characters in picture
CN104134064A (en) * 2013-05-02 2014-11-05 百度国际科技(深圳)有限公司 Character recognition method and device
CN103559172B (en) * 2013-11-06 2016-08-31 北京百度网讯科技有限公司 The subordinate sentence method and apparatus of multi-lingual mixing text
CN103559172A (en) * 2013-11-06 2014-02-05 北京百度网讯科技有限公司 Phrasing method and device for multi-language mixed text
CN103810486B (en) * 2014-02-13 2017-11-21 广东小天才科技有限公司 Method and device for processing characters
CN104361312B (en) * 2014-10-16 2017-11-14 北京捷通华声语音技术有限公司 A kind of method and apparatus that character recognition is carried out to image
CN104361312A (en) * 2014-10-16 2015-02-18 北京捷通华声语音技术有限公司 Device and method for optical character recognition of images
CN104915332B (en) * 2015-06-15 2017-09-15 广东欧珀移动通信有限公司 A kind of method and device for generating layout template
CN104915332A (en) * 2015-06-15 2015-09-16 广东欧珀移动通信有限公司 Method and device for generating composing template
CN106339704A (en) * 2015-07-14 2017-01-18 富士通株式会社 Character recognition method and character recognition equipment
CN105046254A (en) * 2015-07-17 2015-11-11 腾讯科技(深圳)有限公司 Character recognition method and apparatus
CN105354834A (en) * 2015-10-15 2016-02-24 广东欧珀移动通信有限公司 Method and apparatus for making statistics on number of paper text fonts
CN105354834B (en) * 2015-10-15 2018-04-17 广东欧珀移动通信有限公司 A kind of method and device for counting papery text font number
CN105373526A (en) * 2015-10-23 2016-03-02 北大方正集团有限公司 Blank region processing method and system for electronic document
CN105373526B (en) * 2015-10-23 2019-02-15 北大方正集团有限公司 A kind of white space processing method and system in electronic document
CN105631450A (en) * 2015-12-28 2016-06-01 小米科技有限责任公司 Character identifying method and device
WO2017118356A1 (en) * 2016-01-05 2017-07-13 腾讯科技(深圳)有限公司 Text image processing method and apparatus
US10572728B2 (en) 2016-01-05 2020-02-25 Tencent Technology (Shenzhen) Company Limited Text image processing method and apparatus
CN108229454A (en) * 2016-12-15 2018-06-29 北京新唐思创教育科技有限公司 A kind of image cutting labeling method and its device
CN106778758A (en) * 2016-12-29 2017-05-31 成都数联铭品科技有限公司 For the character cutting method of pictograph identification
CN106682667A (en) * 2016-12-29 2017-05-17 成都数联铭品科技有限公司 Image-text OCR (optical character recognition) system for uncommon fonts
CN106611175A (en) * 2016-12-29 2017-05-03 成都数联铭品科技有限公司 Automatic character and picture segmentation system for recognizing image characters
CN107067005A (en) * 2017-04-10 2017-08-18 深圳爱拼信息科技有限公司 A kind of method and device of Sino-British mixing OCR Character segmentations
CN107330430A (en) * 2017-06-27 2017-11-07 司马大大(北京)智能系统有限公司 Tibetan character recognition apparatus and method
CN109871843A (en) * 2017-12-01 2019-06-11 北京搜狗科技发展有限公司 Character identifying method and device, the device for character recognition
CN110135425A (en) * 2018-02-09 2019-08-16 北京世纪好未来教育科技有限公司 Sample mask method and computer storage medium
CN108491845A (en) * 2018-03-02 2018-09-04 深圳怡化电脑股份有限公司 Determination, character segmentation method, device and the equipment of Character segmentation position
CN108491845B (en) * 2018-03-02 2022-05-31 深圳怡化电脑股份有限公司 Character segmentation position determination method, character segmentation method, device and equipment
CN108446702A (en) * 2018-03-14 2018-08-24 深圳怡化电脑股份有限公司 A kind of image character dividing method, device, equipment and storage medium
CN108446702B (en) * 2018-03-14 2022-05-31 深圳怡化电脑股份有限公司 Image character segmentation method, device, equipment and storage medium
CN110163203B (en) * 2019-04-09 2021-08-24 浙江口碑网络技术有限公司 Character recognition method, device, storage medium and computer equipment
CN110163203A (en) * 2019-04-09 2019-08-23 浙江口碑网络技术有限公司 Character identifying method, device, storage medium and computer equipment
CN110210477A (en) * 2019-05-24 2019-09-06 四川阿泰因机器人智能装备有限公司 A kind of digital instrument Recognition of Reading method
CN110210477B (en) * 2019-05-24 2023-03-24 四川阿泰因机器人智能装备有限公司 Digital instrument reading identification method
CN110378347A (en) * 2019-07-04 2019-10-25 北京爱医生智慧医疗科技有限公司 A kind of the key message extracting method and device of medical inspection list
CN110378347B (en) * 2019-07-04 2021-10-08 北京爱医生智慧医疗科技有限公司 Method and device for extracting key information of medical examination sheet
CN111291794A (en) * 2020-01-21 2020-06-16 上海眼控科技股份有限公司 Character recognition method, character recognition device, computer equipment and computer-readable storage medium
CN112329548A (en) * 2020-10-16 2021-02-05 北京临近空间飞行器系统工程研究所 Document chapter segmentation method and device and storage medium
CN112016566B (en) * 2020-10-27 2021-03-16 恒银金融科技股份有限公司 Segmentation method for handwritten Chinese characters at financial bill upper-case money amount
CN112990178A (en) * 2021-04-13 2021-06-18 中国科学院大学 Text digital information embedding and extracting method and system based on character segmentation
CN112990178B (en) * 2021-04-13 2022-06-24 中国科学院大学 Text digital information embedding and extracting method and system based on character segmentation

Also Published As

Publication number Publication date
CN101251892B (en) 2010-06-09

Similar Documents

Publication Publication Date Title
CN101251892B (en) Method and apparatus for cutting character
CN111814722B (en) Method and device for identifying table in image, electronic equipment and storage medium
JP2951814B2 (en) Image extraction method
US6754385B2 (en) Ruled line extracting apparatus for extracting ruled line from normal document image and method thereof
EP1403813B1 (en) Image processing method, image processing apparatus and image processing program for dealing with inverted characters
Fan et al. Marginal noise removal of document images
CN102567300A (en) Picture document processing method and device
CN113537227B (en) Structured text recognition method and system
CN1312625C (en) Character extracting method from complecate background color image based on run-length adjacent map
US20120219220A1 (en) Method and system for preprocessing an image for optical character recognition
CN104966051A (en) Method of recognizing layout of document image
CN108717544B (en) Newspaper sample manuscript text automatic detection method based on intelligent image analysis
CN111626302A (en) Method and system for cutting adhered text lines of ancient book document images of Ujin Tibetan
JP3411472B2 (en) Pattern extraction device
CN110516674B (en) Handwritten Chinese character segmentation method and system for text image
CN116824608A (en) Answer sheet layout analysis method based on target detection technology
KR20010015025A (en) Character extracting method
JPH1031716A (en) Method and device for extracting character line
JPH0950527A (en) Frame extracting device and rectangle extracting device
Jindal et al. Segmentation problems and solutions in printed Degraded Gurmukhi Script
CN108062548B (en) Braille square self-adaptive positioning method and system
Roy et al. Multi-oriented English text line extraction using background and foreground information
CN113408532A (en) Medicine label number identification method based on multi-feature extraction
Razak et al. A real-time line segmentation algorithm for an offline overlapped handwritten Jawi character recognition chip
JP4244692B2 (en) Character recognition device and character recognition program

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20220621

Address after: 3007, Hengqin international financial center building, No. 58, Huajin street, Hengqin new area, Zhuhai, Guangdong 519031

Patentee after: New founder holdings development Co.,Ltd.

Patentee after: Peking University

Patentee after: BEIJING FOUNDER ELECTRONICS Co.,Ltd.

Address before: 100871, Haidian District Fangzheng Road, Beijing, Zhongguancun Fangzheng building, 298, 513

Patentee before: PEKING UNIVERSITY FOUNDER GROUP Co.,Ltd.

Patentee before: Peking University

Patentee before: BEIJING FOUNDER ELECTRONICS Co.,Ltd.

TR01 Transfer of patent right
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20100609

CF01 Termination of patent right due to non-payment of annual fee