CN107133621A

CN107133621A - The classification of formatting fax based on OCR and information extracting method

Info

Publication number: CN107133621A
Application number: CN201710334784.5A
Authority: CN
Inventors: 于志文; 车少帅; 胡笳; 吴洲洋; 周玲
Original assignee: JIANGSU HONGXIN SYSTEM INTEGRATION CO Ltd
Current assignee: JIANGSU HONGXIN SYSTEM INTEGRATION CO Ltd
Priority date: 2017-05-12
Filing date: 2017-05-12
Publication date: 2017-09-05
Anticipated expiration: 2037-05-12
Also published as: CN107133621B

Abstract

The invention discloses a kind of classification of formatting fax based on OCR and information extracting method, including：The binaryzation of adaptive threshold is carried out to the image of fax；Image is corrected；Find the profile of the maximum enclosure frame of form in the image after correction, the gauge outfit region of the upper area interception image of the maximum enclosure frame of form from image；Screen the character contour in gauge outfit region and character contour is merged；The quantity of the field after gauge outfit region merging technique is detected, image is classified；The successful image of classification is extracted, region to be identified in image is positioned；The field in the region to be identified in form is identified according to OCR identification technologies；Optimize identified field.The present invention has the operating efficiency for improving office, liberates employee productivity, realizes transformation of the unstructured data to structural data, is adapted to formatting fax, the i.e. fax of tabular drawing picture, such as standardization contract, self-control voucher, bill.

Description

The classification of formatting fax based on OCR and information extracting method

Technical field

The present invention relates to image processing field, the classification and information extraction of particularly a kind of formatting fax based on OCR Method.

Background technology

With the development of science and technology transnational trans-regional business exchange is also more and more frequent, passed due to faxing compared with alternative document Defeated mode has special legal effect so that it is widely used in office system.Format and contained in fax paper A large amount of useful informations, these current fax papers are required for manually being classified, and manual extraction important information therein, efficiency It is beneath.Need the document classification and information extracting method of a kind of efficient quick badly, lift the operating efficiency of employee, reduction manpower into This, discharges productivity.

China Patent Publication No. CN101876999 disclose it is a kind of generate the methods of fax indexes, message analyzing device and Fax searching system, the system carries out printed page analysis to facsimile message, the characteristic information in the facsimile message is extracted, according to institute State the characteristic information of extraction and set up label for facsimile message, using the label as the facsimile message index, so as to user According to the corresponding facsimile message of the label lookup.But the system is merely able to realize classification and the index of file, it is difficult to realize The extraction of key message in file.

China Patent Publication No. CN102222289 discloses a kind of mobile phone financial management method based on OCR and system, should System carries out analysis identification to Financial Billing by OCR technique, but can not be directed to the scanning fax formatted, it is impossible to realizes and passes The classification and information extraction of portrait of one's ancestors hung at a sacrificial ceremony picture.

The content of the invention

The technical problems to be solved by the invention are to provide a kind of form based on OCR for above-mentioned the deficiencies in the prior art Change classification and the information extracting method of fax, the classification and information extracting method that this formatting based on OCR is faxed, which have, to improve The operating efficiency of office, liberates employee productivity, realizes transformation of the unstructured data to structural data, the present invention is adapted to Fax, the i.e. fax of tabular drawing picture are formatted, such as standardization contract, self-control voucher, bill.

To realize above-mentioned technical purpose, the technical scheme that the present invention takes is：

A kind of classification of formatting fax based on OCR and information extracting method, specifically include following steps：

Step 1：The image file of fax is obtained, the binaryzation of adaptive threshold is carried out to image, the interference of noise is reduced；

Step 2：The angle of inclination of image is determined, image is corrected；

Step 3：Find the profile of the maximum enclosure frame of form in the image after correction, the maximum enclosure frame of form from image The gauge outfit region of upper area interception image；

Step 4：Screen the character contour in gauge outfit region and character contour is merged, so that character contour be merged into Integer field；

Step 5：The quantity of the field after gauge outfit region merging technique is detected, according to the quantity of the field in gauge outfit region and the content of field Image is classified；

Step 6：The successful image of classification is extracted, region to be identified in image is positioned；

Step 7：According to region to be identified position in the table and OCR identification technologies to the region to be identified in form Field be identified；

Step 8：Optimize identified field.

As further improved technical scheme of the present invention, described step 1 specifically includes following steps：

（1）The image file of fax is obtained, image is gone to the image of HSV colour gamuts, removes in red interval pixel；

（2）Binary-state threshold on the location of pixels is determined according to the pixel Distribution value of the neighborhood block of the pixel of image, to figure Binaryzation as carrying out adaptive threshold, reduces the interference of noise.

As further improved technical scheme of the present invention, described step 2 includes finding the most long straight line in image, root According to most long straight line and the angle of horizontal direction, so as to carry out rotation correction to image.

As further improved technical scheme of the present invention, described step 4 comprises the following steps：

（1）The scope of the length threshold of Set Font profile and the scope of width threshold value；

（2）Profile retrieval is carried out to gauge outfit region, length is filtered out in the range of the length threshold of character contour and width exists Profile in the range of the width threshold value of character contour, the profile filtered out as character contour；

（3）Character contour is merged, the color of character contour is extracted, by the close character contour of color and each font wheel The font that the distance between exterior feature is less than the half of the width of character contour in itself is merged into entire fields.

As further improved technical scheme of the present invention, described step 5 comprises the following steps：

（1）Detect the quantity of the field in gauge outfit region；

（2）If the quantity of field is 0, image is not classified；

（3）If the quantity of field is 1, image is classified using the method for machine learning SVM classifier；

（4）If the quantity of field is more than 1, the font in gauge outfit region is recognized by OCR, by the font and image in gauge outfit region Typonym in identification storehouse is matched, so that classification is realized, by the total number of word of matching divided by the correct total word of field of matching Number is simultaneously contrasted obtained result with threshold value set in advance, if greater than threshold value set in advance, is then classified successfully, no Then, classification failure.

As further improved technical scheme of the present invention, described step 6 comprises the following steps：

（1）Loading makes Template Information in advance；

（2）Extraction step 5 is classified successful image, finds in image all encirclement frames that include in the profile of maximum enclosure frame Profile；

（3）The length threshold scope and width threshold value scope of encirclement frame are set, length threshold model of the length in encirclement frame is filtered out Enclose interior and encirclement frame of the width in the range of the width threshold value of encirclement frame；

（4）According to the positional information of the encirclement frame filtered out, according to from top to bottom, order from left to right is entered to all encirclement frames Row is scanned and sorted, and realizes the positioning of form, and region to be identified in form is found according to template information；

（5）Judged whether to need outside identification form according to Template Information, if necessary to the information outside identification form, then needed To carrying out field contours extract outside form, using the character contour outside the method screening form of step 4 and to character contour Merged, so that character contour is merged into entire fields, according to the maximum in the field and image recorded in Template Information The relative position of encirclement frame determines the region to be identified outside form, according to the field recorded in Template Information to maximum enclosure The field location of identification is needed to be positioned beyond frame.

As further improved technical scheme of the present invention, described step 7 comprises the following steps：

（1）According to the positional information in region to be identified in step 6, field picture is intercepted；

（2）Recognized by OCR, the field having good positioning is identified.

As further improved technical scheme of the present invention, described step 8 comprises the following steps：

（1）Extract the field of OCR identifications；

（2）Optimized for field type difference, to small letter class field, remove nonnumeric part therein；To date field, Space and nonnumeric and date are screened out；

（3）Dictionary optimizes, and by setting up the form of dictionary library, the field that OCR is recognized is matched with field in dictionary library, such as Fruit matching fraction is more than threshold value set in advance, then field in dictionary library is replaced with into the field of OCR identifications so as to dictionary library In field optimize renewal, meanwhile, the correct field of manual confirmation is supplemented in dictionary library, it is described matching fraction be equal to OCR recognizes correct word sum divided by word sum is currently matched with dictionary library.

The present invention quickly can be classified and information extraction to formatting fax paper, and classification speed is fast, and classification is accurate, Information extraction accuracy rate is high.Have in the prior art and searching classification is carried out to facsimile signal, but carrying for field information can not be realized Take；Have what image was identified, but the identification function for formatting facsimile signal can not be realized.Therefore, there is presently no one kind Effective ways for formatting fax paper information extraction, set forth herein method completion this technology vacancy, improve Desk job efficiency, releases productivity, has saved human cost.

Brief description of the drawings

Fig. 1 is flow chart of the invention.

Embodiment

The embodiment of the present invention is further illustrated below according to Fig. 1：

Referring to Fig. 1, the present embodiment is adapted to the fax of any formatting, wherein the fax formatted the i.e. image with form is passed Very, the present embodiment is specific as follows by taking the fax of bill as an example：

Step 1：The image file of the fax of bill is obtained, the binaryzation of adaptive threshold is carried out to image, the dry of noise is reduced Disturb；

Step 2：The angle of inclination of image is determined, image is corrected；

Step 3：Find the profile of the maximum enclosure frame of form in the image after correction, the maximum enclosure frame of form from image The ticket head region of upper area interception image；

Step 6：The successful image of classification is extracted, region to be identified in image is positioned（Including inside form and outside form Portion）；

Step 8：Optimize identified field.

In the present embodiment, described step 1 specifically includes following steps：

（1）The image file of fax is obtained, image is gone to the image of HSV colour gamuts, removes in red interval pixel（Go Red chapter）；

It is preferred that, described step 2 is specially to find the most long straight line in image, according to most long straight line and horizontal direction Angle, so as to carry out rotation correction to image.

In the present embodiment, described step 4 comprises the following steps：

（2）To gauge outfit region carry out profile retrieval, filter out the length of profile in the range of the length threshold of character contour and Profile of the width of profile in the range of the width threshold value of character contour, the profile filtered out as character contour；

（3）Character contour is merged, the color of character contour is extracted, by color identical character contour and each font wheel The font that the distance between exterior feature is less than the half of the width of character contour in itself is merged into entire fields.

In the present embodiment, described step 5 comprises the following steps：

（1）Detect the quantity of the field in gauge outfit region；

（2）If the quantity of field is 0, image is not classified, exited；

（3）If the quantity of field is 1, image is classified using the method for machine learning SVM classifier, svm classifier Device is needed to carry out being trained big gauge head in advance, and the bill do not distinguished by SVM classifier is directly exited, and the present embodiment is adopted With machine learning SVM classifier of the prior art；

（4）If the quantity of field is more than 1, the font in gauge outfit region is recognized by OCR, by the font and image in gauge outfit region Typonym in identification storehouse is matched, so that classification is realized, by the total number of word of matching divided by the correct total word of field of matching Number is simultaneously contrasted obtained result with threshold value Thr set in advance, if greater than threshold value set in advance, is then categorized into Work(, otherwise, classification failure are exited.

It is preferred that, described step 6 comprises the following steps：

（1）Template Information is made, loading makes Template Information in advance；

（2）The successful image of classification is extracted, the profiles for including encirclement frame all in the profile of maximum enclosure frame in image are found；

（3）The length threshold scope and width threshold value scope of encirclement frame are set, length of the length in encirclement frame of encirclement frame is filtered out Spend in threshold range and encirclement frame encirclement frame of the width in the range of the width threshold value of encirclement frame；

（4）According to the positional information of the encirclement frame filtered out, according to from top to bottom, order from left to right is entered to all encirclement frames Row is scanned and sorted, and realizes the positioning of form, and region to be identified in form is found according to Template Information（According to Template Information, Judge the position of region to be identified in the table to judge region to be identified whether outside form；If area to be identified Domain then only needs to carry out positioning extraction to the region to be identified in form, if region to be identified in the inside of form Outside form, below step is performed）；

（5）Determined the need for outside identification form, if necessary to identification form external information, then needed pair according to Template Information Field contours extract is carried out outside form, the character contour outside form is screened using the method for step 4 and character contour is entered Row fusion, so that character contour is merged into entire fields, the maximum bag in the field and image recorded in Template Information The relative position of peripheral frame determines the region to be identified outside form, according to the field recorded in Template Information to maximum enclosure frame with The outer field location for needing to recognize is positioned.

In the present embodiment, described step 7 comprises the following steps：

（2）Recognized by OCR, the field having good positioning is identified.

In the present embodiment, described step 8 comprises the following steps：

（1）Extract the field of OCR identifications；

（2）Optimized for field type difference, to small letter class field, remove nonnumeric part therein；To date field, Space therein and nonnumeric and " date " are screened out；

（3）Dictionary optimizes, and by setting up the form of dictionary library, the field that OCR is recognized is matched with field in dictionary library, If matching fraction be more than threshold value scoreThr set in advance, by field in dictionary library replace with OCR identification field from And renewal is optimized to the field in dictionary library, meanwhile, the correct field of manual confirmation is continuously replenished into dictionary library, institute Matching fraction is stated to be equal to the correct word sum of OCR identifications divided by currently match word sum with dictionary library.

Protection scope of the present invention includes but is not limited to embodiment of above, and protection scope of the present invention is with claims It is defined, any replacement being readily apparent that to those skilled in the art that this technology is made, deformation, improvement each fall within the present invention's Protection domain.

Claims

1. classification and the information extracting method of a kind of formatting fax based on OCR, it is characterised in that：

Specifically include following steps：

Step 2：The angle of inclination of image is determined, image is corrected；

Step 8：Optimize identified field.

2. classification and the information extracting method of the formatting fax according to claim 1 based on OCR, it is characterised in that：

Described step 1 specifically includes following steps：

3. classification and the information extracting method of the formatting fax according to claim 2 based on OCR, it is characterised in that：

Described step 2 includes finding the most long straight line in image, according to most long straight line and the angle of horizontal direction, so as to figure As carrying out rotation correction.

4. classification and the information extracting method of the formatting fax according to claim 3 based on OCR, it is characterised in that： Described step 4 comprises the following steps：

5. classification and the information extracting method of the formatting fax according to claim 4 based on OCR, it is characterised in that：

Described step 5 comprises the following steps：

（1）Detect the quantity of the field in gauge outfit region；

（2）If the quantity of field is 0, image is not classified；

6. classification and the information extracting method of the formatting fax according to claim 5 based on OCR, it is characterised in that：

Described step 6 comprises the following steps：

（1）Loading makes Template Information in advance；

7. classification and the information extracting method of the formatting fax according to claim 6 based on OCR, it is characterised in that：

Described step 7 comprises the following steps：

（2）Recognized by OCR, the field having good positioning is identified.

8. classification and the information extracting method of the formatting fax according to claim 1 based on OCR, it is characterised in that：

Described step 8 comprises the following steps：

（1）Extract the field of OCR identifications；