CN105447522A

CN105447522A - Complex image character identification system

Info

Publication number: CN105447522A
Application number: CN201510831651.XA
Authority: CN
Inventors: 陈炳章; 何宏靖; 刘世林; 吴雨浓
Original assignee: Chengdu Business Big Data Technology Co Ltd
Current assignee: Chengdu Business Big Data Technology Co Ltd
Priority date: 2015-11-25
Filing date: 2015-11-25
Publication date: 2016-03-30

Abstract

The invention relates to the image identification field and particularly relates to a complex image character identification system. By aiming at the structure characteristics of the complex test image like the image table, the invention decomposes the complex structure one layer by one layer by firstly detecting and segmenting the table cell and then segmenting and identifying the content of the table cell, achieves the identification effect which makes the hard thing simple and improves the identification accuracy. The invention adopts a Bayes classifier to perform optimization on a character block segmentation position which is found by a projection method, avoids the left-right-part structure Chinese character from being segmented into two parts due to the gap between the left part and the right part of the character, and enables the segmentation of the character block to be complete and accurate. The invention also adopts the nerve network classifier to perform automatic identification on the segmented character, and has high identification efficiency and accurate identification result. In a word, the invention enables the identification of the complex image character to be rapidly realized, has a complete system structure and high identification efficiency, and has great application prospect in the information exploitation and information analysis field.

Description

A kind of complicated image character identification system

Technical field

The present invention relates to field of image recognition, particularly a kind of complicated image character identification system.

Background technology

Image recognition technology is very important developing direction in current intelligent identification technology field, the development experience of image recognition three phases: Text region, Digital Image Processing and identification, object identification; Wherein in numerous image recognition technologys, the recognition technology of pictograph is seemed particularly important, this is because pictograph often comprises more importantly available information than simplicial graph picture, Text region is generally identification letter, numbers and symbols, recognize handwriting identification from printing word, application widely.The mathematics essence of problem of image recognition belongs to the mapping problems of model space to classification space.At present, in the development of image recognition, mainly contain three kinds of recognition methodss: statistical-simulation spectrometry, configuration mode identification, Fuzzy Pattern Recognition.

Along with the deep development of pictograph, the pictograph data of increasing kind, be included in the category of pictograph identification, wherein form gathers as lteral data record and simplifies expression way most, or expression format the most frequently used in data statistics, interpretation of result is master tool in various data analysis tool.The popularity that form is applied in all trades and professions is self-evident.Be not difficult to find out the importance for the identification of sheet format pictograph, such as to enterprise's year earning report, most important statistics and analysis result in enterprise's year achievement may be comprised, the importance of these information and generality are that other any materials are not compared, various form information is flooded with in the current network information, but a lot of form is all provide with the form of picture, such as various scanning files, pdf document, automatically these image form informations are identified, the table content of picture/mb-type being reduced into numerical data is the basis of these data being carried out fast processing and analysis.

The complex structure of form is abundant in content, and compared to document picture, require very high to the accurate rate of the identification of form Image, but due to the architectural feature that form information itself is complicated, making the identification of form pictograph more difficult than general normal image written historical materials, in prior art when carrying out the identification of word in image, first needing the character string cutting in image to open, form the little picture comprising single word, then use certain method to identify the word after cutting.And to carry out the most frequently used method of character segmentation be sciagraphy, be namely by after pictograph binary conversion treatment, find the separatrix between two words by vertical projection, according to separatrix, character segmentation is come.Due to the existence of the frame line of form, directly use traditional projection to send out realized character segmentation and identify just unrealistic.Complicated and changeable simultaneously due to this body structure of form, frame contour line complicated and changeable makes sciagraphy be difficult to realize the cutting of form Word message and identification.

Summary of the invention

The object of the invention is to overcome above-mentioned deficiency existing in prior art, a kind of complicated image character identification system is provided, the cell in the image form of labyrinth can be detected fast and accurately; And complete fast on this basis the Word message cutting in the image form of complexity to be identified.First present system detects the cell profile in image form to be identified; Word message in corresponding unit lattice cuts out by the corner apex coordinate corresponding to cell profile; Character segmentation module utilizes sciagraphy to find out the dicing position of character block in cell content on this basis, and utilizes Bayes classifier to be optimized dicing position, and then is cut out successively by character block; The character block cut out is input in Text region module, identifies corresponding word content.Native system is for the architectural feature point of the textual image of this complexity of image form, from first detection and cutting cell to the cutting of cell content and identification, the structure of complexity is successively decomposed, reach recognition effect of simplifying, the identification of complicated pictograph is realized fast, system architecture is complete, and recognition efficiency is high.

In order to realize foregoing invention object, the invention provides following technical scheme:

A kind of complicated image character identification system, comprises image input module, cell detection module, cell cutting module, character segmentation module and Text region module;

Wherein, image form to be identified is input in described cell detection module by image input module, and described cell detection module by the cell contour detecting in image form to be identified out, and is input in described cell cutting module;

Unit lattice content in image form to be identified cuts out by described cell cutting module, forms corresponding sub-pictures;

Input described sub-pictures is wherein cut into character block to be identified by described character segmentation block module;

Described character block is input to the identification completing corresponding content in described Text region module.

Concrete, in this complicated image character identification system, described cell detection module uses the findcontours function in openCV image processing tool to detect image table unit lattice profile.

Described cell cutting module, by calling minAreaRect function, constructs the rotation rectangular area that the area that comprises cell profile point set is minimum, and extracts the corner apex coordinate point of this rotation rectangular area; According to described corner apex coordinate value, the Word message in corresponding unit lattice is cut out as a whole, form corresponding sub-pictures.

Further, described character segmentation module, comprises cut-off detection module for subsequent use and Bayes classifier module; Described cut-off detection module for subsequent use is connected with described Bayes classifier module, and the class categories of described Bayes classifier comprises: word part, digital, alphabetical, punctuation mark.

Described cut-off detection module for subsequent use detects the left and right sides of character block dicing position for subsequent use in sub-pictures by sciagraphy, and testing result is input in described Bayes classifier module, described Bayes classifier module judges the classification of content between the cut-off for subsequent use of character block left and right, is cut out by two, the adjacent left and right character block belonging to word part as a character block.

Described cut-off detection module for subsequent use carries out side projection to sub-pictures, passes through formula: calculate the number of the non-zero pixel of a line, and be put in projection_y class by this value, wherein i is line number, and j is columns, and pix (i, j) is respective pixel value, and n is the coordinate figure of last row; Element in projection_y class is traveled through; If projection_y [k]=0, projection_y [k+1] > 0, projection_y [k+2] > 0, then judge that k is the initial cut-off of certain a line, and put it in class vector<int>top; If projection_y [k]=0, projection_y [k-1] > 0, projection_y [k-2] > 0, then judge that k is the terminal cut-off of certain a line, and put it in class vector<int>bottom; Based on the element value in class top and bottom by the literal line in cell to cutting out.

Described cut-off detection module for subsequent use carries out vertical projection for the character image of the every a line cut out, and adopts formula: calculate the number of the non-zero pixel of row, and be put in projection_x class by this value, wherein j is columns, and i is line number, and pix (i, j) is respective pixel value, and m is the coordinate figure of last column; Element in projection_x class is traveled through; If projection_x [k]=0, projection_x [k+1] > 0, projection_x [k+2] > 0, then judge that k is the initial cut-off of a certain character block, and put it in class vector<int>left; If projection_x [k]=0, projection_x [k-1] > 0, projection_x [k-2] > 0, then judge that k is the terminal cut-off of a certain character block, and put it in class vector<int>right; Element value in class left and right is the dicing position for subsequent use of single word.

Described Bayes classifier is after the given first left side for subsequent use cut-off coordinate, judge whether the content between described first left side for subsequent use cut-off and the first right side for subsequent use cut-off is word part, if not word part, then think that this content is numeral, letter or punctuate, directly cut out;

If word part, then continue to judge whether the content between the second left side for subsequent use cut-off and the second right side for subsequent use cut-off is word part, if word part, then the content between the first left side for subsequent use cut-off and the second right side for subsequent use cut-off is cut out.

Further, the process that described Bayes classifier realizes classifying comprises:

To contents extraction eigenwert to be identified;

Eigenwert is inputted in described Bayes classifier, calculates the probability that this eigenwert belongs to of all categories;

Each for content to be identified eigenwert is belonged to such probability multiplication, obtain the probability that this content belongs to such;

Select the classification that generally maximum probable value is corresponding, as the classification of this content;

Wherein, the computing formula that judging characteristic belongs to certain class probable value is: p=(w × ni+1)/(w × nj+q), and wherein w is the proper vector sample size that in Bayes classifier used, three classifications are total; Ni is the number of times that this feature of this object occurs in a certain classification; Nj is proper vector number total in this classification; Q is empirical value.

Further, described Text region module is neural network classifier or is the recognition system based on OCR.

Compared with prior art, beneficial effect of the present invention: the invention provides a kind of complicated image character identification system, native system is for the such complex text of image form word, adopt and first detect cutting cell, again the word content in cell is carried out to the process of cutting, achieve the object of the text structure of complexity successively decomposition, present system detects the cell in image form to be identified by cell detection module, information content entirety in each cell is cut out and forms corresponding sub-pictures, described character segmentation module adopts sciagraphy to find out the cut-off of character block, by cut-off, character block is cut out, in order to improve the accuracy rate of character block cutting, this method, system adopts Bayes classifier to be optimized the cut-off for subsequent use found out by sciagraphy, avoid and the Chinese character of tiled configuration is cut into two-part problem because of the gap existed between the parts of left and right, improve the accuracy rate of character block cutting, and by numeral, letter, punctuation mark cuts out respectively, cutting of classifying so is accurately that the accurate identification of character block content creates basis, adopt the content of neural network classifier to character block to identify on such basis, the accuracy rate of identification is high, quick, complete, ultrahigh in efficiency, and the extensive batch processing for complicated image word is supplied to the processing platform of a high efficient and reliable, in pictograph identification, information excavating, information analysis field has huge application prospect.

Accompanying drawing illustrates:

Fig. 1 is the system architecture schematic diagram of this complicated image character identification system.

Fig. 2 is the image form schematic diagram to be identified of binaryzation.

The cell profile schematic diagram of Fig. 3 for being extracted by the findcontours function in openCV.

Fig. 4 is that scope schematic diagram cut by the unit cell picture in Fig. 3.

The schematic diagram of the unit cell picture that Fig. 5 is syncopated as the cutting scope determined through the cutting of Fig. 4.

Fig. 6 is the cut-off schematic diagram for subsequent use found out by sciagraphy.

Fig. 7 is the deterministic process schematic diagram be optimized by Bayes classifier the cut-off for subsequent use of Fig. 6.

Fig. 8 is the concrete structure schematic diagram of present system.

It should be noted that, institute of the present invention drawings attached is schematically, does not represent actual size and ratio.

Embodiment

Below in conjunction with test example and embodiment, the present invention is described in further detail.But this should be interpreted as that the scope of the above-mentioned theme of the present invention is only limitted to following embodiment, all technology realized based on content of the present invention all belong to scope of the present invention.

The invention provides a kind of complicated image character identification system, the cell in the image form of labyrinth can be detected fast and accurately; And complete fast on this basis the Word message cutting in the image form of complexity to be identified.First present system detects the cell profile in image form to be identified; Word message in corresponding unit lattice cuts out by the corner apex coordinate corresponding to cell profile; Character segmentation module utilizes sciagraphy to find out the dicing position of character block in cell content on this basis, and is cut out successively by character block; The character block cut out is input in Text region module, identifies corresponding word content.Native system is for the architectural feature point of the textual image of this complexity of image form, from first detection and cutting cell to the cutting of cell content and identification, the structure of complexity is successively decomposed, reach recognition effect of simplifying, the identification of complicated pictograph is realized fast, system architecture is complete, and recognition efficiency is high.

A kind of complicated image character identification system, as shown in Figure 1: comprise image input module, cell detection module, cell cutting module, character segmentation module and Text region module;

Concrete, in this complicated image character identification system, described image input module can be scanner, the image acquisition such as digital camera or memory device, the a large amount of pictograph collected is input in present system by described image input module, identifies.

Described cell detection module, after image form to be identified is carried out binary conversion treatment, uses the findcontours function in openCV image processing tool to detect image table unit lattice profile.The point set of the cell profile detected is extracted and is placed in corresponding some class (as: with Vector1, Vector2, Vector3 ... represent), openCV is a kind of image processing tool efficiently, a lot of simple image processing function is efficiently comprised in openCV, wherein findcontours function can according to the feature of Form Frame Line profile by the contour detecting of cell out, and the cell profile point set detected is extracted, detection efficiency is high, Fig. 2 is an exemplary plot of pending image form, pending picture is input in cell detection module by described image input module, after image is carried out binary conversion treatment by described cell detection module, call the outline line of the table cell to be identified that findcontours function is found out as shown in Figure 3.

Described cell cutting module is by calling minAreaRect function (minAreaRect function is also a function in openCV), construct the rotation rectangular area that the area that comprises cell profile point set is minimum, and extract the corner apex coordinate point of this rotation rectangular area; Cut out as a whole by Word message in corresponding unit lattice according to described corner apex coordinate value, form corresponding sub-pictures, the cell picture cut out as shown in Figure 4.When carrying out cell content cutting, should first the Form Frame Line of cell be got rid of, described cell cutting module moves a certain distance determined new region by corresponding to the direction in rectangular area for corner apex coordinate; The scope covered by new region carrys out cutting cell content, just the frame of cell can be cut away, leave the sub-pictures only comprising the inner Word message of cell word, as shown in Figure 5, concrete process repeats no more the sub-pictures that the cell shown in Fig. 4 is formed through cutting.Further, described character segmentation module, comprises cut-off detection module for subsequent use and Bayes classifier module; Described cut-off detection module for subsequent use is connected with described Bayes classifier module, and the class categories of described Bayes classifier comprises: word part, digital, alphabetical, punctuation mark.

Described cut-off detection module for subsequent use detects the left and right sides of character block dicing position for subsequent use in sub-pictures by sciagraphy, and testing result is input in described Bayes classifier module, described cut-off detection module for subsequent use carries out side projection to sub-pictures, passes through formula: calculate the number of the non-zero pixel of a line, and be put in projection_y class by this value, wherein i is line number, and j is columns, and pix (i, j) is respective pixel value, and n is the coordinate figure of last row; Element in projection_y class is traveled through; If projection_y [k]=0, projection_y [k+1] > 0, projection_y [k+2] > 0, then judge that k is the initial cut-off of certain a line, and put it in class vector<int>top; If projection_y [k]=0, projection_y [k-1] > 0, projection_y [k-2] > 0, then judge that k is the terminal cut-off of certain a line, and put it in class vector<int>bottom; Based on the element value in class top and bottom by the literal line in cell to cutting out.

Described cut-off detection module for subsequent use carries out vertical projection for the character image of the every a line cut out, and adopts formula: calculate the number of the non-zero pixel of row, and be put in proiection_x class by this value, wherein j is columns, and i is line number, and pix (i, j) is respective pixel value, and m is the coordinate figure of last column; Element in projection_x class is traveled through; If projection_x [k]=0, projection_x [k+1] > 0, projection_x [k+2] > 0, then judge that k is the initial cut-off of a certain character block, and put it in class vector<int>left; If projection_x [k]=0, projection_x [k-1] > 0, projection_x [k-2] > 0, then judge that k is the terminal cut-off of a certain character block, and put it in class vector<int>right; Element value in class left and right is the dicing position for subsequent use of single word.

The coordinate position of each character block just can be determined according to transverse projection and longitudinal projection's coordinate; Single character block can be cut out according to coordinate position.But such situation likely appears in sciagraphy: when a word is the word of tiled configuration, because the gap existed between tiled configuration parts, each word part may be cut out separately by mistake, have impact on the accuracy of Text region.In order to improve the accuracy of character block cutting, further calculate in cut-off information input Bayes classifier for subsequent use; Described Bayes classifier module judges the classification of content between the cut-off for subsequent use of character block left and right, by two, the adjacent left and right character block belonging to word part integrally character block cut out.

To contents extraction eigenwert to be identified;

Wherein, the computing formula that judging characteristic belongs to certain class probable value is: p=(w × ni+1)/(w × nj+q), and wherein w is the proper vector sample size that in Bayes classifier used, three classifications are total; Ni is the number of times that this feature of this object occurs in a certain classification; Nj is proper vector number total in this classification; Q is empirical value.In the sorter sample that present system is taked, the eigenwert selected comprises: depth-width ratio, coverage rate, vertical center line passes through stroke number, vertical center line passes through stroke number upper and lower two ends ultimate range and ratio highly, horizontal center line passes through stroke number, and horizontal center line passes through the ratio of stroke number left and right two ends ultimate range and width.

Concrete, the width of character block and the threshold value of setting compare after finding out the left and right cut-off for subsequent use of character block by invention system, if the character block between this left and right cut-off just can directly cut out by the threshold value being greater than setting.This is because the Chinese character of type-script under normal circumstances, the width of single character is relatively homogeneous, and be greater than the width of word part, numeral, letter or punctuation mark, when carrying out character segmentation, first can be compared by the threshold value of the width between cut-off and setting, the word between adjacent left and right cut-off larger for width is cut out as single character block.

For the character block that left and right cut-off width is less, several situation may be there is, such as may be word part, numeral, letter or punctuate; Accurately can not judge which kind of type this block belongs to only by width, if be not optimized the left and right parts of the word of tiled configuration may be cut out respectively because of the gap of middle existence cut-off, have impact on the recognition effect of word.Present system utilizes Bayes classifier to judge the content between the cut-off for subsequent use of the left and right sides, on the basis of cut-off for subsequent use, finds out optimization dicing position, improves the accuracy rate of cutting.When carrying out cut-off optimization, (the wherein schematic diagram of for subsequent use cut-off of Fig. 6 for finding out through sciagraphy as shown in Figure 6, Figure 7, A, B, C, D are the left side for subsequent use cut-off of corresponding character block, the right side for subsequent use cut-off that A ', B ', C ', D ' are corresponding character block, be only and schematically illustrate, do not represent the true cut-off position of character block; Fig. 7 uses Bayes classifier to be optimized cut-off for subsequent use and content is carried out the process schematic of classification cutting), after fixing left side cutting coordinate (the first left prepped side cut-off A), the content between Bayes classifier pair and its right side cut-off for subsequent use (the first right prepped side cut-off A ') being close to thereafter is used to classify; If numeral, letter or punctuation mark, then it is directly cut out according to the type of sorter identification, if word part, then judgement position is moved on to next left side for subsequent use dicing position (the second left side cut-off B for subsequent use), judge this left side for subsequent use cut-off and and its right side for subsequent use cut-off (the second right side cut-off B ' for subsequent use) immediately between content whether be word part, if word part, then the character block between the first left side cut-off A for subsequent use and the second right side cut-off B ' for subsequent use is integrally cut out; Doing so avoids and become two parts to carry out situation about identifying the Chinese character segmentation of tiled configuration, ensure the integrality of character segmentation.

Further, as shown in Figure 8, described Text region module is neural network classifier or is the recognition system based on OCR.The wherein application of neural network at present in voice and image recognition very extensive, to the recognition technology relative maturity of the picture of the word segmented, neural network is similar to the neural training of human brain and learning process, can the feature of learning sample, pattern and rule, after needing to build neural network according to identification, prepare some training samples according to the complexity identified and be input to the training carrying out neural network in neural network, before these training samples are input to neural network, need manually to mark this training sample, the selection of training sample will have influence on the recognition result of neural network, the object identified in the present invention is image form, abundant Chinese character is comprised in image form, numeral, letter and symbol, the kind of the character set comprised and quantity are all abundanter, neural network of the present invention can select the sample character set consistent with image form character set to be identified, such as the Chinese Character Set comprised in form Image to be identified is at about 2000, set of digits 0-9, glossary of symbols comprises punctuation mark, mathematical formulae symbol, such as branch, percentage sign, various measurement unit symbol, the sample set then selected when carrying out neural metwork training also should comprise these Chinese characters accordingly, numbers and symbols, the Output rusults of such guarantee character block identification is correct.Neural network has adaptive error transfer factor ability, can according to regulative modes such as error back propagations, constantly reduce the difference between learning outcome and annotation results, last progressively level off to stable correct identification direction, after neural metwork training completes, the test sample book (development sample) of some is input in neural network, the accuracy of the input results of test neural network, when accuracy reaches the threshold value of setting, can think that the training of neural network completes, after neural metwork training is good, the character block picture segmented is input in neural network, complete the identifying of picture.

Claims

1. a complicated image character identification system, is characterized in that: comprise image input module, cell detection module, cell cutting module, character segmentation module and Text region module;

2. the system as claimed in claim 1, is characterized in that: described cell detection module uses the findcontours function in openCV image processing tool to carry out detected image table cell profile.

3. system as claimed in claim 2, it is characterized in that: described cell cutting module is by calling minAreaRect function, construct the rotation rectangular area that the area that comprises cell profile point set is minimum, and extract the corner apex coordinate point of this rotation rectangular area; According to described corner apex coordinate value, the Word message in corresponding unit lattice is cut out as a whole, form corresponding sub-pictures.

4. system as claimed in claim 3, is characterized in that: described character segmentation module comprises cut-off detection module for subsequent use and Bayes classifier module; Described cut-off detection module for subsequent use is connected with described Bayes classifier module, and the class categories of described Bayes classifier comprises: word part, digital, alphabetical, punctuation mark.

5. system as claimed in claim 4, it is characterized in that: described cut-off detection module for subsequent use detects the left and right sides of character block dicing position for subsequent use in sub-pictures by sciagraphy, and testing result is input in described Bayes classifier module, described Bayes classifier module judges the classification of content between the cut-off for subsequent use of character block left and right, is cut out by two, the adjacent left and right character block belonging to word part as a character block.

6. system as claimed in claim 5, is characterized in that: described cut-off detection module for subsequent use carries out side projection to sub-pictures, passes through formula: calculate the number of the non-zero pixel of a line, and be put in projection_y class by this value, wherein i is line number, and j is columns, and pix (i, j) is respective pixel value, and n is the coordinate figure of last row; Element in projection_y class is traveled through; If projection_y [k]=0, projection_y [k+1] > 0, projection_y [k+2] > 0, then judge that k is the initial cut-off of certain a line, and put it in class vector<int > top;

If projection_y [k]=0, projection_y [k-1] > 0, projection_y [k-2] > 0, then judge that k is the terminal cut-off of certain a line, and put it in class vector<int>bottom;

Based on the element value in class top and bottom by the literal line in cell to cutting out.

7. system as claimed in claim 6, is characterized in that: described cut-off detection module for subsequent use carries out vertical projection for the character image of the every a line cut out, and adopts formula: calculate the number of the non-zero pixel of row, and be put in projection_x class by this value, wherein j is columns, and i is line number, and pix (i, j) is respective pixel value, and m is the coordinate figure of last column; Element in projection_x class is traveled through; If projection_x [k]=0, projection_x [k+1] > 0, projection_x [k+2] > 0, then judge that k is the initial cut-off of a certain character block, and put it in class vector<int>left;

If projection_x [k]=0, projection_x [k-1] > 0, projection_x [k-2] > 0, then judge that k is the terminal cut-off of a certain character block, and put it in class vector<int>right;

Element value in class left and right is the dicing position for subsequent use of single word.

8. system as claimed in claim 7, it is characterized in that: described Bayes classifier is after the given first left side for subsequent use cut-off coordinate, judge whether the content between described first left side for subsequent use cut-off and the first right side for subsequent use cut-off is word part, if not word part, then think that this content is numeral, letter or punctuate, directly cut out;

9. system as claimed in claim 8, is characterized in that: the process that described Bayes classifier realizes classification comprises:

To contents extraction eigenwert to be identified;

10. the system as described in one of claim 1 to 9, is characterized in that: described Text region module is neural network classifier.