CN106682671A

CN106682671A - Image character recognition system

Info

Publication number: CN106682671A
Application number: CN201611254376.0A
Authority: CN
Inventors: 景亮; 康青杨; 唐涔轩; 刘世林
Original assignee: Chengdu Business Big Data Technology Co Ltd
Current assignee: Chengdu Business Big Data Technology Co Ltd
Priority date: 2016-12-29
Filing date: 2016-12-29
Publication date: 2017-05-17

Abstract

The invention relates to the field of image recognition processing, in particular to an image character recognition system. The system comprises an image character segmentation module, a feature image generation module, a storage module, a normalization processing module and an image character recognition module. The image character segmentation module segments a to-be-processed image into sub-pictures, wherein each sub-picture comprises a single character and is stored in the storage module. The feature image generation module manufactures a corresponding character feature picture and stores the character feature picture into the storage module according to the typeface of a to-be-recognized image character. The normalization processing module extracts the feature picture and the to-be-recognized sub-picture stored in the storage module, carries out normalization processing according to corresponding types, and stores the processed picture information in the storage module. The image character recognition module extracts the sub-picture in the storage module, and calculates the coincidence degree of the sub-picture and the feature picture by use of the exclusive OR algorithm, thereby achieving recognition of character contents of the sub-picture and inputting the recognition results.

Description

System for recognizing characters from image

Technical field

Field of image recognition of the present invention, more particularly to system for recognizing characters from image.

Background technology

With the development and the progress of science and technology of society, the knowledge that the mankind create just is increased with exponential quantity, in electronics Before books occur, most knowledge is passed in the way of books, Chinese 5,000-year and down, is generated a large amount of outstanding Books, these books in the long korneforos of history, more or less all suffered it is different degrees of damage, therefore these books are carried out Digitized storage is extremely urgent；In taking care of books field, the fast search of book contents is helpful for quick positioning book, And because books quantity is too many, adding the books of early stage printing does not have the electronic manuscript of author, therefore the electronization of paper book It is necessary.

Optical character recognition is exactly to process the sharp weapon that this paper book is converted to electronic document, and it is mainly using big The character sample of amount, through the study of complex network, generates corresponding model file, so as to reach the mesh for recognizing character in picture 's.

Optical character recognition major function is the character in identification shooting, scanned picture, is being carried out in prior art In image during the identification of word, it is necessary first to open the character string cutting in image, the little picture comprising single word is formed, so Afterwards the word after cutting is identified using certain method.And carry out character segmentation most common method for sciagraphy, i.e., Be by pictograph binary conversion treatment after, the demarcation line between two words is found by vertical projection method, according to demarcation line will Character segmentation comes.But when the Chinese character comprising tiled configuration during there is adhesion, and image between the word in image, simply Projecting method be difficult to realize preferable cutting effect；Exactly because this reason causes cutting to be always the difficulty of OCR identifications Point, the quality of cutting will directly influence the recognition effect of word.

In addition optical character recognition major function be identification shoot, the character in scanned picture, for some are special The scanned copy of font, official seal is taken pictures, such as the books of early stage printing, certificate etc. that government unit makes, due to historical reasons with And secrecy and safety need, its font is often special, and existing optical character recognition focuses primarily upon machine learning Method, model calculation amount is big, and due to training font sample to be not covered with sytlized font, causes the identification of sytlized font Accuracy rate is not high, has a strong impact on the electronization of paper document.

Prior art is identified using neural network machine learning algorithm to character mostly, needs to make substantial amounts of sample This, takes a substantial amount of time and is trained, and the model file for generating is very huge, and for the character of different fonts, identification Rate is not quite similar, for some sytlized font characters, discrimination than relatively low, it is difficult to the character met under some special screnes is known Not.

The content of the invention

It is an object of the invention to overcome the above-mentioned deficiency in the presence of prior art, there is provided system for recognizing characters from image, Corresponding feature image is generated according to the font that user selects, on the basis of effective cutting is carried out to pictograph to be identified, The automatic identification of images to be recognized word is realized with reference to targetedly character feature picture.Quick work is provided for pictograph identification Tool.

In order to realize foregoing invention purpose, the invention provides technical scheme below：System for recognizing characters from image, the system System pictograph identification includes implemented below step：

(1) by images to be recognized character segmentation into the sub-pictures for only including single character；By numeral therein, letter and mark Point symbol, word subgraph is marked respectively；

(2) a sub-pictures are selected in each numeral, letter and the corresponding sub-pictures of punctuate, by the character in subgraph, Respectively up and down, left and right, upper left, lower-left, upper right and bottom right movement setpoint distance l, makes corresponding feature image, and to system Into feature image carry out corresponding mark；

Correspondence font is selected according to images to be recognized, samples pictures are generated, to the character in samples pictures respectively upwards, Under, left and right, upper left, lower-left, upper right and bottom right movement setpoint distance l, make corresponding feature image, and the feature to made by Picture carries out corresponding mark；

(3) feature image and picture to be identified are normalized：

The dimension of picture of feature image and sub-pictures to be identified is adjusted to into formed objects, and to each picture in each picture Plain gray value change into 0 or 1 respectively according to the threshold value for arranging (by the gray value of 0-255 in picture, according to the threshold value for arranging, It is converted into 0 or 1) the pixel value opsition dependent after conversion is stored in memory module；

(4) sub-pictures to be identified are contrasted with the feature image of corresponding types, the value execution of same location of pixels is different Or process, the number of times that statistics 1 occurs, the error frequency is designated as, using the corresponding mark of the minimum feature image of the error frequency as knowledge Other result is exported.

Specifically, the system is in the step (4), by digital, alphabetical and punctuate sub-pictures to be identified and numeral, Letter and punctuate feature image are contrasted, and the value of same location of pixels performs XOR and processes, and the number of times that statistics 1 occurs is designated as The error frequency, the corresponding mark of the minimum feature image of the error frequency is exported as recognition result；

Alphabetic character sub-pictures to be identified are contrasted with corresponding character features picture, the value of same location of pixels is held Row XOR process, the number of times that statistics 1 occurs is designated as the error frequency, and the corresponding mark of the minimum feature image of the error frequency is made Exported for recognition result.

Further, n*h ＜ l ＜ N*h.

Further, n≤1/4.

Further, the cutting of alphabetic character picture includes implemented below process：

The initial dicing position of alphabetic character picture is found out using sciagraphy, according to initial dicing position by images to be recognized Piece is cut into initial sub-pictures sequence；

Initial sub-pictures in sequence are processed using following rule：

A, cutting is carried out using sciagraphy images to be recognized word, be cut into sub-pictures sequence；By numeral therein, letter With punctuation mark out；

B, unlabelled sub-pictures are judged：Whether L≤M*h is met, and L is the width of sub-pictures character projection, and M is Coefficient, h is high for row；

For the sub-pictures of the condition that is unsatisfactory for carry out cutting, dicing position is determined according to below equation：

F (x)=g (x) t (x)

Step B is repeated, unlabelled sub-pictures are satisfied by condition in sequence：L≤M*h；

The overall width of adjacent two sub-pictures beyond C, letter digital in sequence and punctuate word picture judges： Whether L is met_Close≤M*h；

If it is satisfied, sequentially the adjacent sub-pictures to meeting condition are merged；

Step C is repeated until the adjacent sub-pictures overall width in addition to numeral, letter and punctuate is unsatisfactory for L_Close≤ M*h；

D, unlabelled sub-pictures in sequence are judged：If there are three adjacent sub-pictures in sequence, and three Individual sub-pictures meet：Width L≤the 0.5h of the first sub-pictures and the 3rd sub-pictures, and the width L >=h of middle sub-pictures, then will Middle sub-pictures are according to formula：

F (x)=g (x) t (x)

Determined by cut-off carry out cutting；According to the cut-off for determining, middle sub-pictures are cut into into son in the middle of first Picture and the second middle sub-pictures；

First sub-pictures and the first middle sub-pictures are merged；

Second middle sub-pictures and the 3rd sub-pictures are merged.

Further, 0.9≤M≤1.3.

As a kind of preferred：M=1.2.

Further, the system is the computer or server for being loaded with above-mentioned pictograph identification function program.

Compared with prior art, beneficial effects of the present invention：The present invention provides system for recognizing characters from image, is selected according to user The font selected, constructs primitive character picture, on the basis of primitive character picture, by the character in picture respectively to different directions The distance of mobile setting, makes corresponding feature templates；Feature templates made by so can preferably adapt to character picture and cut Divide faulty situation, thus with more preferable fault-tolerance.On the basis of feature image, recognized with XOR algorithm to be identified The similarity degree of sub-pictures and feature templates, calculating process straightforward procedure, recognition efficiency and reliability it is higher.

Additionally, present invention employs step by step to judge cutting after sub-pictures cutting quality, and to the son after cutting Picture is processed accordingly, the mode screened layer by layer and process, it is ensured that the cutting quality of sub-pictures；For final discrimination, Condition is further prepared.In addition compared to traditional cutting method, present system introduces amendment on the basis of amplitude Value, by the distance of dicing position and character edge as the Consideration for determining cut-off, therefore with higher accuracy, And occur multiple smaller values when special construction character is run into, or during extreme point, can quickly be found out by this formula Optimized cut-off, increased the accuracy of cutting, improve the efficiency of cutting；It is more preferable to the cutting effect of adhesion character.

On the basis of feature image and image character, sub-pictures to be identified and feature templates are recognized using XOR algorithm Similarity degree, calculating process straightforward procedure, recognition efficiency and reliability it is higher.

Description of the drawings：

Fig. 1 is the system structure diagram of this system for recognizing characters from image.

Fig. 2 realizes step or signal flow schematic diagram for what the pictograph of the system was recognized.

Fig. 3 is the making schematic diagram of digital template.

Fig. 4 is the making schematic diagram of word template.

Specific embodiment

With reference to test example and specific embodiment, the present invention is described in further detail.But this should not be understood Scope for above-mentioned theme of the invention is only limitted to below example, and all technologies realized based on present invention belong to this The scope of invention.

Present system provides system for recognizing characters from image, as shown in figure 1, including pictograph cutting module, characteristic pattern Piece generation module, memory module, normalized module and pictograph identification module；

Character in pending image is carried out cutting by described image character segmentation module, is cut into each only comprising single The sub-pictures of character, and the sub-pictures sequence after cutting is stored in memory module；

The feature image generation module, the font of the images to be recognized word selected according to user, produces corresponding Character feature picture, and the feature image being fabricated to is stored in the memory module；

The normalized module extracts the feature image and sub-pictures to be identified being stored in memory module, according to right The type answered, is normalized, and the pictorial information after process is stored in memory module；

Described image Text region module, extract memory module in sub-pictures, using XOR algorithm calculate sub-pictures with The matching degree of feature image, and then the identification of sub-pictures character content is realized, and recognition result is input into.

The Text region of the system includes implemented below step as shown in Figure 2：

(1) by images to be recognized character segmentation into the sub-pictures for only including single character；By numeral therein, letter and mark Point symbol, word subgraph is marked respectively that (labelling of this step, the only type of labelling sub-pictures, are not specifically known Not).When realizing, cutting is carried out using sciagraphy to pictograph to be identified, is cut into sub-pictures sequence, by it is therein numeral, Letter and punctuation mark are out；Such as the narrower width (being such as set to ＜ 0.4h) of projection, the area of projection is less (0.5h*0.8h) the distance between adjacent sub-pictures for, being formed after cutting are significantly greater than distance of general character picture etc., utilize Features described above, first can cut out the sub-pictures for belonging to numeral, letter and punctuate.In numeral, letter and punctuation mark Sub-pictures and it is labeled on the basis of, cutting is carried out to unlabelled sub-pictures (alphabetic character picture), be cut into and only include The sub-pictures of single character.The sub-pictures cutting for carrying out step by step can reach more preferable cutting effect.

(2) a sub-pictures are selected in each numeral, letter and the corresponding sub-pictures of punctuate, by the character in subgraph, Respectively up and down, left and right, upper left, lower-left, upper right and bottom right movement setpoint distance l, makes corresponding feature image, such as Fig. 3 It is shown, and the feature image to made by carries out corresponding mark (this mark referred to, by the corresponding character content mark of feature image Remember out, such as 9 feature images in Fig. 2 are labeled as " 8 ")；

Correspondence font is selected according to images to be recognized, samples pictures are generated, to the character in samples pictures respectively upwards, Under, left and right, upper left, lower-left, upper right and bottom right movement setpoint distance l, make corresponding feature image, and the feature to made by Picture carries out corresponding mark, and (this mark refers to, the corresponding character content of feature image is marked, such as in Fig. 4 9 feature images be labeled as：" word ")；Character in template is moved into respectively the distance of setting, more than sub-pictures frame scope Character portion will be removed, the picture and artwork piece set a trap apart from rear formation to the movement of above-mentioned direction together form same word The sample for reference picture of the different cutting situations of 9 of symbol is as shown in figure 3, this may not be advised with character picture cutting in practical operation Then, faulty situation is corresponding, therefore the character recognition realized based on the feature templates of this method formation, with more preferable Fault-tolerance.

(3) feature image and picture to be identified are normalized：

(4) sub-pictures to be identified are contrasted with the feature image of corresponding types, the value execution of same location of pixels is different Or process (if feature image is identical with the value of picture corresponding pixel points to be identified, XOR calculate after value be 0；If feature Picture is different with the value of picture corresponding pixel points to be identified, and the value after XOR is calculated is the 1) number of times that statistics 1 occurs, and is designated as missing Difference frequency, the corresponding mark of the minimum feature image of the error frequency is exported as recognition result.

Specifically, in the step (4), by digital, alphabetical and punctuate sub-pictures to be identified and numeral, letter and punctuate Feature image is contrasted, and the value of same location of pixels performs XOR and processes, and the number of times that statistics 1 occurs is designated as the error frequency, will The corresponding mark of the minimum feature image of the error frequency is exported as recognition result；

Present system recognizes the similarity degree of sub-pictures to be identified and feature templates, calculating process using XOR algorithm Straightforward procedure, recognition efficiency and reliability it is higher.

Initial sub-pictures in sequence are processed using following rule：

F (x)=g (x) t (x)

Step B is repeated, unlabelled sub-pictures are satisfied by condition in sequence：L≤M*h.

F (x) is amplitude in formula, and x is row subpoint coordinate in the row direction, and h is high for the row of current character, and g (x) is to repair On the occasion of t (x) is row projection value, and both together decide on the amplitude of subpoint, when amplitude is minimum, between as two characters Cut point；Minimum amplitude point is found as cut-off, through the amendment of g (x) compared to simple minimum row projection value, we The cut-off found in method method, introduces the considerations of cut-off position and character edge distance, therefore with higher Accuracy, and occur multiple smaller values when special construction character is run into, or during extreme point, can be fast by this formula Fast finds out optimized cut-off, increased the accuracy of cutting, improves the efficiency of cutting.

F (x)=g (x) t (x)

First sub-pictures and the first middle sub-pictures are merged；

Second middle sub-pictures and the 3rd sub-pictures are merged.

In some cases：The character picture of continuous two tiled configurations, centre has adhesion, then using projection When method carries out cutting, the radical in the middle of Qian Hou character may be cut, but for the radical of adhesion between two characters is recognized Not, character cutting situation out is treated as；Present system has in this case preferable treatment effect, for The mid portion of adhesion searches out optimal cut-off by above-mentioned formula, and by cutting after before and after the radical of character carry out weight New integration, has reached preferable cutting effect.

Above-mentioned rule is sequentially recycled, and through continuous iteration, ultimately forms the only sub-pictures comprising single character, Good cutting effect is that pictograph identification has prepared condition.

Further, 0.9≤M≤1.3.Being arranged in the range of this for sub-pictures width threshold value, can realize preferably cutting Divide and recognition effect.

As a kind of preferred：M=1.2.Verify repeatedly through experiment, when M is set to into 1.2, can realize preferably cutting Divide effect.

Claims

1. system for recognizing characters from image, it is characterised in that the system realizes pictograph identification comprising implemented below step：

(1) by images to be recognized character segmentation into the sub-pictures for only including single character；By numeral therein, letter and punctuate symbol Number, word subgraph is marked respectively；

(2) a sub-pictures are selected in each numeral, letter and the corresponding sub-pictures of punctuate, by the character in subgraph, difference Up and down, left and right, upper left, lower-left, upper right and bottom right movement setpoint distance l, makes corresponding feature image, and to made by Feature image carries out corresponding mark；

Correspondence font is selected according to images to be recognized, samples pictures are generated, to the character in samples pictures respectively up and down, it is left, The right side, upper left, lower-left, upper right and bottom right movement setpoint distance l, makes corresponding feature image, and feature image is entered to made by The corresponding mark of row；

(3) feature image and picture to be identified are normalized, and by the pixel respective value of each picture, step-by-step is stored in In memory module；

(4) sub-pictures to be identified are contrasted with the feature image of corresponding types, the value of same location of pixels is performed at XOR Reason, the number of times that statistics 1 occurs, is designated as the error frequency；Using the corresponding mark of the minimum feature image of the error frequency as identification knot Fruit is exported.

2. the system as claimed in claim 1, it is characterised in that n*h ＜ l ＜ N*h.

3. system as claimed in claim 2, it is characterised in that n≤1/4.

4. the system as claimed in claim 1, it is characterised in that the system, in normalized process include：By feature The dimension of picture of picture and sub-pictures to be identified is adjusted to formed objects；

0 or 1 is changed into respectively according to the threshold value for arranging to each grey scale pixel value in each picture, by the pixel value after conversion Opsition dependent is stored in memory module.

5. the system as described in one of Claims 1-4, it is characterised in that the cutting of alphabetic character picture includes implemented below Process：

A, by digital, the alphabetical and punctuation mark in sequence of pictures out；

B, unlabelled sub-pictures are judged：Whether L≤M*h is met, and L is the width of sub-pictures character projection, and M is to be Number, h is high for row；

F (x)=g (x) t (x)

g (x) = \frac{1}{1 + e^{- 0.01 | x - h |}}

The overall width of adjacent two sub-pictures beyond C, letter digital in sequence and punctuate word picture judges：Whether Meet L_Close≤M*h；

Step C is repeated until the adjacent sub-pictures overall width in addition to numeral, letter and punctuate is unsatisfactory for L_Close≤M*h；

D, unlabelled sub-pictures in sequence are judged：If there are three adjacent sub-pictures in sequence, and three sons Picture meets：Width L≤the 0.5h of the first sub-pictures and the 3rd sub-pictures, and the width L >=h of middle sub-pictures, then by centre Sub-pictures are according to formula：

F (x)=g (x) t (x)

g (x) = \frac{1}{1 + e^{- 0.01 | x - 0.5 h |}}

Determined by cut-off carry out cutting；According to the cut-off for determining, middle sub-pictures are cut into into the first middle sub-pictures With the second middle sub-pictures；

First sub-pictures and the first middle sub-pictures are merged；

Second middle sub-pictures and the 3rd sub-pictures are merged.

6. system as claimed in claim 5, it is characterised in that 0.9≤M≤1.3.

7. system as claimed in claim 6, it is characterised in that M=1.2.

8. system as claimed in claim 7, it is characterised in that the system is to be loaded with above-mentioned pictograph identification function journey The computer or server of sequence.