CN107133621B

CN107133621B - Method for classifying and extracting information of formatted fax based on OCR

Info

Publication number: CN107133621B
Application number: CN201710334784.5A
Authority: CN
Inventors: 于志文; 车少帅; 胡笳; 吴洲洋; 周玲
Original assignee: Clp Hongxin Information Technology Co ltd
Current assignee: Clp Hongxin Information Technology Co ltd
Priority date: 2017-05-12
Filing date: 2017-05-12
Publication date: 2020-09-29
Anticipated expiration: 2037-05-12
Also published as: CN107133621A

Abstract

The invention discloses a method for classifying and extracting information of a formatted fax based on OCR (optical character recognition), which comprises the following steps: carrying out binarization of an adaptive threshold value on a fax image; correcting the image; finding the outline of the maximum surrounding frame of the table in the corrected image, and intercepting the header area of the image from the upper area of the maximum surrounding frame of the table in the image; screening font outlines in the header area and fusing the font outlines; detecting the number of fields after the header area is combined, and classifying the images; extracting successfully classified images, and positioning the region to be identified in the images; identifying fields of the areas to be identified in the table according to an OCR (optical character recognition) technology; the identified fields are optimized. The invention has the advantages of improving the working efficiency of office work, liberating the productivity of staff, realizing the conversion from unstructured data to structured data, and being suitable for formatted faxes, namely faxes of form images, such as standardized contracts, homemade certificates, bills and the like.

Description

Method for classifying and extracting information of formatted fax based on OCR

Technical Field

The invention relates to the field of image processing, in particular to a method for classifying and extracting information of a formatted fax based on OCR.

Background

With the development of science and technology, business communication across countries and regions is more and more frequent, and faxes are greatly used in office systems because of the special legal effect compared with other file transmission modes. The formatted fax documents contain a large amount of useful information, and at present, the fax documents need to be classified manually, and important information in the fax documents needs to be extracted manually, so that the efficiency is low. An efficient and fast file classification and information extraction method is urgently needed, the working efficiency of staff is improved, the labor cost is reduced, and the productivity is released.

Chinese patent publication No. CN101876999 discloses a method for generating a fax index, a message analyzing device and a fax retrieval system, which perform layout analysis on a fax message, extract feature information in the fax message, establish a tag for the fax message according to the extracted feature information, and use the tag as an index of the fax message, so that a user can search for a corresponding fax message according to the tag. However, the system can only realize the classification and the indexing of the files, and the extraction of the key information in the files is difficult to realize.

Chinese patent publication No. CN102222289 discloses a mobile phone financial management method and system based on OCR, which analyzes and identifies financial bills by means of OCR technology, but cannot classify fax images and extract information for formatted scanned fax documents.

Disclosure of Invention

The invention aims to solve the technical problem of providing a method for classifying and extracting information of formatted faxes based on OCR (optical character recognition), aiming at overcoming the defects of the prior art, and the method for classifying and extracting information of formatted faxes based on OCR has the advantages of improving the working efficiency of office work, liberating the productivity of workers and realizing the conversion of unstructured data into structured data.

In order to achieve the technical purpose, the technical scheme adopted by the invention is as follows:

a method for classifying and extracting information of formatted faxes based on OCR specifically comprises the following steps:

step 1: acquiring a fax image file, and performing binarization on an image according to a self-adaptive threshold value to reduce noise interference;

step 2: determining the inclination angle of the image, and correcting the image;

and step 3: finding the outline of the maximum surrounding frame of the table in the corrected image, and intercepting the header area of the image from the upper area of the maximum surrounding frame of the table in the image;

and 4, step 4: screening font outlines in the header area and fusing the font outlines so as to combine the font outlines into a complete field;

and 5: detecting the number of fields after the header area is combined, and classifying the images according to the number of the fields of the header area and the content of the fields;

step 6: extracting successfully classified images, and positioning the region to be identified in the images;

and 7: identifying the field of the area to be identified in the table according to the position of the area to be identified in the table and an OCR (optical character recognition) technology;

and 8: the identified fields are optimized.

As a further improved technical solution of the present invention, the step 1 specifically includes the following steps:

(1) acquiring a facsimile image file, converting the image into an HSV color gamut image, and removing pixel points falling in a red interval;

(2) and determining a binarization threshold value at the pixel position according to the pixel value distribution of the neighborhood blocks of the pixels of the image, and carrying out binarization of the adaptive threshold value on the image to reduce the interference of noise.

As a further improved technical solution of the present invention, the step 2 includes finding the longest straight line in the image, and performing rotation correction on the image according to an included angle between the longest straight line and the horizontal direction.

As a further improved technical scheme of the invention, the step 4 comprises the following steps:

(1) setting a range of a length threshold and a range of a width threshold of the font outline;

(2) carrying out outline retrieval on the header area, screening out outlines with the length within the range of the length threshold value of the font outline and the width within the range of the width threshold value of the font outline, wherein the screened outlines are the font outlines;

(3) and fusing the font outlines, extracting the colors of the font outlines, and combining the fonts with the font outlines with similar colors and the distance between each font outline being less than half of the width of the font outline into a complete field.

As a further improved technical scheme of the invention, the step 5 comprises the following steps:

(1) detecting the number of fields of the header area;

(2) if the number of fields is 0, the image is not classified;

(3) if the number of the fields is 1, classifying the image by adopting a method of a machine learning SVM classifier;

(4) if the number of the fields is larger than 1, recognizing the font of the header area through OCR, matching the font of the header area with the type name in the image recognition library so as to realize classification, dividing the matched total word number by the correctly matched total word number of the fields, comparing the obtained result with a preset threshold value, if the matched total word number is larger than the preset threshold value, successfully classifying, and otherwise, failing to classify.

As a further improved technical scheme of the invention, the step 6 comprises the following steps:

(1) loading the template information which is manufactured in advance;

(2) extracting the images successfully classified in the step 5, and finding all the outlines containing the surrounding frame in the outlines of the maximum surrounding frame in the images;

(3) setting a length threshold range and a width threshold range of the surrounding frame, and screening out the surrounding frame with the length within the length threshold range and the width within the width threshold range of the surrounding frame;

(4) scanning and sequencing all the bounding boxes from top to bottom and from left to right according to the position information of the screened bounding boxes, realizing the positioning of the form, and searching an area to be identified in the form according to the template information;

(5) judging whether the outside of the form needs to be identified or not according to the template information, if the outside information of the form needs to be identified, extracting the outline of a field outside the form, screening the outline of the font outside the form by adopting the method in the step 4 and fusing the outline of the font, so that the outline of the font is combined into a complete field, determining an area to be identified outside the form according to the relative position of the field recorded in the template information and the maximum surrounding frame in the image, and positioning the position of the field to be identified except the maximum surrounding frame according to the field recorded in the template information.

As a further improved technical scheme of the invention, the step 7 comprises the following steps:

(1) intercepting a field picture according to the position information of the area to be identified in the step 6;

(2) and identifying the well-positioned field through OCR identification.

As a further improved technical scheme of the invention, the step 8 comprises the following steps:

(1) extracting a field identified by the OCR;

(2) optimizing different field types, and removing non-digital parts of lower case fields; screening out blank spaces, non-numbers, years, months and days for the date field;

(3) and optimizing the dictionary, namely matching the fields recognized by the OCR with the fields in the dictionary library by establishing a dictionary library form, replacing the fields in the dictionary library with the fields recognized by the OCR to optimize and update the fields in the dictionary library if the matching score is larger than a preset threshold value, and supplementing the manually confirmed correct fields into the dictionary library, wherein the matching score is equal to the total number of the words recognized by the OCR divided by the total number of the matched words in the current dictionary library.

The method can be used for rapidly classifying the formatted fax files and extracting the information, and has the advantages of high classifying speed, accurate classification and high information extraction accuracy. In the prior art, a fax image is searched and classified, but field information cannot be extracted; there are some cases where an image is recognized, but the recognition function of formatting a facsimile image cannot be realized. Therefore, an effective method for extracting formatted fax file information does not exist at present, the method provided by the text fills the technical gap, the office work efficiency is improved, the productivity is released, and the labor cost is saved.

Drawings

FIG. 1 is a flow chart of the present invention.

Detailed Description

The following further illustrates an embodiment of the invention according to fig. 1:

referring to fig. 1, the present embodiment is applicable to any formatted fax, where the formatted fax is an image fax with a form, and the present embodiment takes a fax of a bill as an example, specifically as follows:

step 1: acquiring an image file of a fax of a bill, and performing binarization on an image according to a self-adaptive threshold value to reduce noise interference;

and step 3: finding the outline of the maximum surrounding frame of the table in the corrected image, and intercepting the ticket head area of the image from the upper area of the maximum surrounding frame of the table in the image;

step 6: extracting successfully classified images, and positioning the areas to be identified in the images (including the inside and the outside of the table);

and 8: the identified fields are optimized.

In this embodiment, the step 1 specifically includes the following steps:

(1) acquiring a facsimile image file, transferring the image to an image with an HSV color gamut, and removing pixel points (removing red chapters) falling in a red interval;

Preferably, the step 2 is to find the longest straight line in the image, and perform rotation correction on the image according to an included angle between the longest straight line and the horizontal direction.

In this embodiment, the step 4 includes the following steps:

(2) carrying out outline retrieval on the header area, screening out the outline of which the length is within the range of the length threshold value of the font outline and the width is within the range of the width threshold value of the font outline, wherein the screened outline is the font outline;

(3) and fusing the font outlines, extracting the colors of the font outlines, and combining the fonts with the same color and the distance between each font outline being less than half of the width of the font outline into a complete field.

In this embodiment, the step 5 includes the following steps:

(1) detecting the number of fields of the header area;

(2) if the number of the fields is 0, the images are not classified, and the process exits;

(3) if the number of the fields is 1, classifying the images by adopting a method of machine learning SVM classifier, wherein the SVM classifier needs to train a large number of headers in advance, and bills which are not distinguished by the SVM classifier directly exit, and the machine learning SVM classifier in the prior art is adopted in the embodiment;

(4) if the number of the fields is more than 1, recognizing the font of the header area through OCR, matching the font of the header area with the type name in the image recognition library so as to realize classification, dividing the matched total word number by the correctly matched total word number of the fields, comparing the obtained result with a preset threshold Thr, if the matched total word number is more than the preset threshold, successfully classifying, otherwise, failing to classify, and exiting.

Preferably, the step 6 comprises the following steps:

(1) making template information, and loading the template information which is made in advance;

(2) extracting successfully classified images, and finding out all contours containing bounding boxes in the contours of the maximum bounding box in the images;

(3) setting a length threshold range and a width threshold range of the surrounding frame, and screening out the surrounding frame of which the length is within the length threshold range of the surrounding frame and the width is within the width threshold range of the surrounding frame;

(4) scanning and sequencing all the bounding boxes from top to bottom and from left to right according to the position information of the screened bounding boxes to realize the positioning of the table, searching the region to be identified in the table according to the template information (judging the position of the region to be identified in the table according to the template information so as to judge whether the region to be identified is outside the table or not;

(5) determining whether the outside of the form needs to be identified or not according to the template information, if the outside information of the form needs to be identified, extracting the outline of a field outside the form, screening the outline of the font outside the form by adopting the method in the step 4 and fusing the outline of the font, so as to combine the outline of the font into a complete field, determining an area to be identified outside the form according to the relative position of the field recorded in the template information and the maximum surrounding frame in the image, and positioning the position of the field needing to be identified outside the maximum surrounding frame according to the field recorded in the template information.

In this embodiment, the step 7 includes the following steps:

(2) and identifying the well-positioned field through OCR identification.

In this embodiment, the step 8 includes the following steps:

(1) extracting a field identified by the OCR;

(2) optimizing different field types, and removing non-digital parts of lower case fields; screening out blank spaces, non-numbers and 'year, month and day' of the date field;

(3) and (3) dictionary optimization, namely matching the fields recognized by the OCR with the fields in the dictionary library by establishing a dictionary library form, replacing the fields in the dictionary library with the fields recognized by the OCR to optimize and update the fields in the dictionary library if the matching score is greater than a preset threshold scorETR, and continuously supplementing manually confirmed correct fields into the dictionary library, wherein the matching score is equal to the total number of the words recognized by the OCR divided by the total number of the matched words in the dictionary library.

The scope of the present invention includes, but is not limited to, the above embodiments, and the present invention is defined by the appended claims, and any alterations, modifications, and improvements that may occur to those skilled in the art are all within the scope of the present invention.

Claims

1. A method for classifying and extracting information of formatted fax based on OCR is characterized in that: the method specifically comprises the following steps:

step 6: loading template information which is made in advance, extracting images which are successfully classified, finding all outlines which contain enclosing frames in the outlines of the largest enclosing frames in the images, setting the length threshold range and the width threshold range of the enclosing frames, screening out the enclosing frames with the lengths within the length threshold range and the widths within the width threshold range of the enclosing frames, scanning and sequencing all the enclosing frames from top to bottom according to the position information of the screened enclosing frames from left to right to realize the positioning of the form, searching an area to be identified in the form according to the template information, judging whether the outside of the form needs to be identified according to the template information, if the outside information of the form needs to be identified, extracting field outlines outside the form, screening the outside font outlines of the form by adopting the method of step 4 and fusing the font outlines so as to merge the font outlines into a complete field, determining an area to be identified outside the form according to the relative position of the field recorded in the template information and the maximum bounding box in the image, and positioning the field position needing to be identified except the maximum bounding box according to the field recorded in the template information;

and 8: and performing type optimization and dictionary optimization on the identified different fields.

2. An OCR based formatted facsimile classification and information extraction method as recited in claim 1, further comprising:

the step 1 specifically comprises the following steps:

3. An OCR based formatted facsimile classification and information extraction method as recited in claim 2, further comprising:

and the step 2 comprises finding the longest straight line in the image, and performing rotation correction on the image according to the included angle between the longest straight line and the horizontal direction.

4. An OCR-based formatted fax classification and information extraction method according to claim 3, wherein: the step 4 comprises the following steps:

5. An OCR-based formatted fax classification and information extraction method according to claim 4, wherein:

the step 5 comprises the following steps:

(1) detecting the number of fields of the header area;

(2) if the number of fields is 0, the image is not classified;

6. An OCR based formatted facsimile classification and information extraction method as recited in claim 1, further comprising:

the step 7 comprises the following steps:

(2) and identifying the well-positioned field through OCR identification.

7. An OCR based formatted facsimile classification and information extraction method as recited in claim 1, further comprising:

the step 8 comprises the following steps:

(1) extracting a field identified by the OCR;