CN104636117A

CN104636117A - Automatic segmentation method of form image

Info

Publication number: CN104636117A
Application number: CN201310557566.XA
Authority: CN
Inventors: 殷绪成
Original assignee: JIANGSU ABEYOND OUTSOURCING CO Ltd
Current assignee: JIANGSU ABEYOND OUTSOURCING CO Ltd
Priority date: 2013-11-12
Filing date: 2013-11-12
Publication date: 2015-05-20

Abstract

The invention discloses an automatic segmentation method of a form image. The automatic segmentation method of the form image comprises recording the form and regional information of the form, automatically analyzing, testing, positioning a test region of handwriting in the form image and finally positioning of a segmentation region. The automatic segmentation method of the form image comprises the following steps: a. a preset calibration on regions needed to be sliced, identified or recorded by hand in the known form is conducted, by template customization, the form and the regional information of the form are stored in a form template library, and driving information of knowledge is obtained ; b, automatically analyzing, testing and positioning a text region is conducted on the scanned or photographed form image, and driving information of data is obtained; c, the driving information of the knowledge and the driving information of data are synthesized, the coincidence degree between the driving information of the knowledge and the driving information of the data is compared, and the final segmentation region is positioned. The automatic segmentation method of the form and the image combines an accurate positioning technology of image areas of the driving information of the knowledge and the driving information of the data and an automatic intelligent processing system of the form and the data based on accurately automatic segmentation of the form image.

Description

A kind of automatic segmentation method of tabular drawing picture

Technical field

The present invention relates to form technical field of image processing, particularly a kind of automatic segmentation method of tabular drawing picture.

Background technology

Tradition all adopts artificial means to hand-written manuscript typing, and hand-written manuscript write variation, complicated, make the labour intensity of employee high, efficiency of inputting is but very low, this brings very large trouble to work, researchers develop many application software for this reason, it is desirable to fundamentally to solve hand-written manuscript Rapid input problem.

According to Chinese patent [CN103020619A] method of handwritten entries " in a kind of automatic segmentation electronization notebook ", as shown in Figure 2, (1) shooting needs the papery page-images of the notebook of electronization; (2) determined the four edges edge line of described papery page-images by the line detection method in image, and the page area that four edges edge line limits is corrected to square region; (3) determine the type of the described papery page according to described papery page-images, obtain the papery page empty cutting template of the described type notebook preserved in advance, described blank cutting template is made up of some character blocks; (4) determine the character block at user's handwriting place in described square region, in units of character block, the user's handwriting be in any one character block is extracted in automatic segmentation.The registration of this invention to template and handwritten text is simply differentiate to reach accurate location have form can not effectively process in interior handwritten text region to mixing simultaneously.

Summary of the invention

The object of the invention is to: for the above-mentioned technical matters existed in prior art, there is provided a kind of and combine the image-region placement technology of Knowledge driving information and data-driven information and the list data automated intelligent disposal system based on the accurate automatic segmentation of tabular drawing picture, effectively can improve the automatic segmentation method of the tabular drawing picture of efficiency of inputting.

The present invention is achieved by the following technical solutions:

An automatic segmentation method for tabular drawing picture, comprises the steps: that (1) obtains form entity in form document; (2) scanning or shooting form entity obtain tabular drawing picture; (3) his-and-hers watches table images data analysis and study, obtains the data-driven information being applied to the text filed cutting of handwritten form; (4) table custom-tailoring, by form and area information thereof all stored in form template storehouse; (5) from form template library, the Knowledge driving information being applied to region cutting is obtained; (6) regional analysis integrated data activation bit and Knowledge driving information, his-and-hers watches table images carries out regional analysis and location, obtains cutting the area informations such as subregional position; (7) region cutting utilizes area information, and his-and-hers watches table images carries out cutting, obtains the last area image exported.

Further, his-and-hers watches table images data analysis and study, obtain the data-driven information being applied to the text filed cutting of handwritten form, comprising position and the type information in region; His-and-hers watches table images data analysis and learning procedure as follows:

(A) first by form image binaryzation; In system, adopt adaptive binarization method, in conjunction with Otsu method and Niblack method, the image obtained is the "AND" that two kinds of binarization methods obtain image; If p (x, y) is the value of the last binary picture picture point (x, y) exported, p _otsuthe value that (x, y) obtains for Otsu method, p _niblackthe value that (x, y) obtains for Niblack method, then have

p(x，y)＝p _Otsu(x,y)&p _Niblack(x,y)

Wherein, p (x, y)=1 represents stain (prospect character), and p (x, y)=0 represents white point (background);

(B) obtain the corresponding regional of tabular drawing picture by connected domain analysis, then need to differentiate region; Hybrid hierarchy differentiates handwritten form, namely the unit processed is the merging block of some connected domains, the uncertainty of handwritten form characteristic, a kind of Fisher linear discriminant based on incremental learning (the Fisher Linear Discriminant adopted, FLD) sorter, the projection matrix (vector) of classical FLD algorithm is

W = S_{w}^{- 1} (m_{1} - m_{2})

Wherein, S _w=C ₁+ C ₂for scatter matrix within class, m _ifor Different categories of samples mean vector;

Utilize the renewal C of sequence KL mapping algorithm (Sequential Karhunen-Loeve Algorithm, SKL) incremental form _i, SKL algorithm is by the D of K eigenvalue of maximum composition _iwith the U that corresponding proper vector forms _iestimate C _i

C_{i} \approx U_{i} D_{i} U_{i}^{T}

Wherein, D _ithe orthogonal matrix of K × K dimension, U _iit is the matrix with K row;

In handwritten form differentiates, the feature vector dimension of use is fewer, so along with the continuous increase of new samples, directly uses singular solution decomposition (Singular Value Decomposition, SVD) to upgrade D _iand U _i;

In this incremental sorter, we utilize a kind of adaptive filter mode to upgrade m _i,

m_{i}^{new} = (1 - α) m_{i} + {αx}_{i}

Wherein, α is an average constant factor, generally can be set to 0.05, and x _ifor the new samples of the i-th class in incremental learning.

Be further, regional analysis integrated data activation bit and Knowledge driving information, if the text filed position of data-driven information handwritten form and Knowledge driving information handwritten form text filed position registration are higher than 50%, the handwritten form then utilizing data-driven information to obtain is text filed as final cutting region, and text filed for other type, then to come from the Knowledge driving information in form template library, carry out cutting subregional location.

In sum, owing to have employed technique scheme, the invention has the beneficial effects as follows:

(1) the image-region placement technology of Knowledge driving information (customizing and corresponding form template storehouse to the masterplate of demand based on business) and data-driven information (coming from form automated image analysis and study) is combined;

(2) based on list data automated intelligent process (identifying or the typing) system of the accurate automatic segmentation of tabular drawing picture;

(3) diversification, complicated hand-written manuscript can effectively differentiate and typing, further increase the efficiency of typing.

Accompanying drawing explanation

Examples of the present invention will be described by way of reference to the accompanying drawings, wherein:

Fig. 1 is that tabular drawing of the present invention is as automatic segmentation method flow diagram;

Fig. 2 is the schematic flow sheet of the method for handwritten entries in a kind of automatic segmentation electronization of [CN103020619A] patent notebook.

Embodiment

All features disclosed in this instructions, or the step in disclosed all methods or process, except mutually exclusive feature and/or step, all can combine by any way.

Arbitrary feature disclosed in this instructions (comprising any accessory claim, summary and accompanying drawing), unless specifically stated otherwise, all can be replaced by other equivalences or the alternative features with similar object.That is, unless specifically stated otherwise, each feature is an example in a series of equivalence or similar characteristics.

As shown in Figure 1, the present invention proposes a kind of automatic segmentation method of tabular drawing picture, comprises the steps: that (1) obtains form entity in form document; (2) scanning or shooting form entity obtain tabular drawing picture; (3) his-and-hers watches table images data analysis and study, obtains the data-driven information being applied to the text filed cutting of handwritten form; (4) table custom-tailoring, by form and area information thereof all stored in form template storehouse; (5) from form template library, the Knowledge driving information being applied to region cutting is obtained; (6) regional analysis integrated data activation bit and Knowledge driving information, his-and-hers watches table images carries out regional analysis and location, obtains cutting the area informations such as subregional position; (7) region cutting utilizes area information, and his-and-hers watches table images carries out cutting, obtains the last area image exported.

(A) first by form image binaryzation; Binarization method comprises overall binarization method and local binarization method, and in overall binarization method, performance does very well and stable method is Otsu method, and in local binarization method, performance does very well and stable method is then Niblack method.In system, adopt adaptive binarization method, in conjunction with Otsu method and Niblack method, the image obtained is the "AND" that two kinds of binarization methods obtain image; If p (x, y) is the value of the last binary picture picture point (x, y) exported, p _otsuthe value that (x, y) obtains for Otsu method, p _niblackthe value that (x, y) obtains for Niblack method, then have

p(x,y)＝p _Otsu(x，y)&p _Niblack(x,y)

(B) obtain the corresponding regional of tabular drawing picture by connected domain analysis, then need to differentiate region; Hybrid hierarchy differentiates handwritten form, namely the unit processed is the merging block of some connected domains, and these image blocks are likely character rows, or multiple character, or one or two character, or multiple character row composition, the uncertainty of handwritten form characteristic in addition, a kind of Fisher linear discriminant based on incremental learning (Fisher Linear Discriminant, the FLD) sorter adopted, the projection matrix (vector) of classical FLD algorithm is

W = S_{w}^{- 1} (m_{1} - m_{2})

Utilize the renewal C of sequence KL mapping algorithm (Sequential Karhunen-Loeve Algorithm, SKL) incremental form _i; SKL algorithm is by the D of K eigenvalue of maximum composition _iwith the U that corresponding proper vector forms _iestimate C _i

C_{i} \approx U_{i} D_{i} U_{i}^{T}

m_{i}^{new} = (1 - α) m_{i} + {αx}_{i}

Be further, regional analysis integrated data activation bit and Knowledge driving information, if the text filed position of data-driven information handwritten form and Knowledge driving information handwritten form text filed position registration higher (generally can be set to registration 50%), the handwritten form then utilizing data-driven information to obtain is text filed as final cutting region, and text filed for other type, then to come from the Knowledge driving information in form template library, carry out cutting subregional location.

In operation, according to the demand of business or user, to needing cutting in form (comprising other spoken and written languages such as Chinese, Japanese), the region of identification or manual entry demarcates in advance.Customized by masterplate, form and area information thereof are all stored in form template storehouse, and the knowledge area information spinner that form template storehouse provides will comprise the position in region, the type (handwritten form region, block letter region, picture region etc.) in region.

View data automatic analysis and study main in tabular drawing picture automatic analysis, detection & localization handwritten form text filed.In general, the handwritten form text in form is most important information, needs location and cutting, so that follow-up identification or manual entry; But a lot of handwritten form text does not complete and writes in region that Table Design designs for it, often exceeds the scope of this design section, so need the automatic analysis by view data and study, carries out the text filed location of handwritten form.So, by view data automatic analysis and study, the data-driven information that handwritten form is text filed can be obtained, mainly comprise position and type (handwritten form the is text filed) information in region.Need before this to relocate image binaryzation, filter line, denoising and identified region.

Regional analysis carries out goodness of fit comparison to knowledge area information and the text filed position of data-driven information handwritten form, if degree of agreement is greater than 50%, the handwritten form then utilizing data-driven information to obtain is text filed as final cutting region, if degree of agreement is less than 50%, then utilize the Knowledge driving information in form template library to be master, carry out cutting subregional location.

The present invention is not limited to aforesaid embodiment.The present invention expands to any new feature of disclosing in this manual or any combination newly, and the step of the arbitrary new method disclosed or process or any combination newly.

Claims

1. an automatic segmentation method for tabular drawing picture, is characterized in that, comprise the steps:

(1) in form document, form entity is obtained;

(2) scanning or shooting form entity obtain tabular drawing picture;

(3) his-and-hers watches table images data analysis and study, obtains the data-driven information being applied to the text filed cutting of handwritten form;

(4) table custom-tailoring, by form and area information thereof all stored in form template storehouse;

(5) the Knowledge driving information being applied to region cutting is obtained from form template library;

(6) regional analysis integrated data activation bit and Knowledge driving information, his-and-hers watches table images carries out regional analysis and location, obtains area information;

(7) region cutting utilizes area information, and his-and-hers watches table images carries out cutting, obtains the last area image exported.

2. the automatic segmentation method of a kind of tabular drawing picture according to claim 1, it is characterized in that: his-and-hers watches table images data analysis and study, obtain the data-driven information being applied to the text filed cutting of handwritten form, comprising position and the type information in region; His-and-hers watches table images data analysis and study are carried out as follows:

p(x，y)＝p _Otsu(x，y)&p _Niblack(x,y)

(B) in addition, obtain the corresponding regional of tabular drawing picture by connected domain analysis, then need to differentiate region; Hybrid hierarchy differentiates handwritten form, and the unit namely processed is the merging block of some connected domains; The uncertainty of handwritten form characteristic, a kind of Fisher linear discriminant based on incremental learning (Fisher Linear Discr iminant, the FLD) sorter of employing, the projection matrix (vector) of classical FLD algorithm is

W = S_{w}^{- 1} (m_{1} - m_{2})

Sequence SKL mapping algorithm (Sequential Karhunen-Loeve Algorithm, SKL) incremental form is utilized to upgrade C _i, SKL algorithm is by the D of K eigenvalue of maximum composition _iwith the U that corresponding proper vector forms _iestimate C _i

C_{i} \approx U_{i} D_{i} U_{i}^{T}

In this incremental sorter, utilize a kind of adaptive filter mode to upgrade m _i

m_{i}^{new} = (1 - α) m_{i} + {αx}_{i}

3. the automatic segmentation method of a kind of tabular drawing picture according to claim 1, it is characterized in that: regional analysis integrated data activation bit and Knowledge driving information, if the text filed position of data-driven information handwritten form and Knowledge driving information handwritten form text filed position registration are higher than 50%, the handwritten form then utilizing data-driven information to obtain is text filed as final cutting region, and text filed for other type, then to come from the Knowledge driving information in form template library, carry out cutting subregional location.