CN107133621B - Method for classifying and extracting information of formatted fax based on OCR - Google Patents

Method for classifying and extracting information of formatted fax based on OCR Download PDF

Info

Publication number
CN107133621B
CN107133621B CN201710334784.5A CN201710334784A CN107133621B CN 107133621 B CN107133621 B CN 107133621B CN 201710334784 A CN201710334784 A CN 201710334784A CN 107133621 B CN107133621 B CN 107133621B
Authority
CN
China
Prior art keywords
image
fields
ocr
font
identified
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710334784.5A
Other languages
Chinese (zh)
Other versions
CN107133621A (en
Inventor
于志文
车少帅
胡笳
吴洲洋
周玲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Clp Hongxin Information Technology Co ltd
Original Assignee
Clp Hongxin Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Clp Hongxin Information Technology Co ltd filed Critical Clp Hongxin Information Technology Co ltd
Priority to CN201710334784.5A priority Critical patent/CN107133621B/en
Publication of CN107133621A publication Critical patent/CN107133621A/en
Application granted granted Critical
Publication of CN107133621B publication Critical patent/CN107133621B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/24Aligning, centring, orientation detection or correction of the image
    • G06V10/242Aligning, centring, orientation detection or correction of the image by image rotation, e.g. by 90 degrees
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/28Quantising the image, e.g. histogram thresholding for discrimination between background and foreground patterns
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/30Noise filtering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/14Image acquisition
    • G06V30/148Segmentation of character regions
    • G06V30/158Segmentation of character regions using character size, text spacings or pitch estimation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Character Discrimination (AREA)
  • Character Input (AREA)

Abstract

The invention discloses a method for classifying and extracting information of a formatted fax based on OCR (optical character recognition), which comprises the following steps: carrying out binarization of an adaptive threshold value on a fax image; correcting the image; finding the outline of the maximum surrounding frame of the table in the corrected image, and intercepting the header area of the image from the upper area of the maximum surrounding frame of the table in the image; screening font outlines in the header area and fusing the font outlines; detecting the number of fields after the header area is combined, and classifying the images; extracting successfully classified images, and positioning the region to be identified in the images; identifying fields of the areas to be identified in the table according to an OCR (optical character recognition) technology; the identified fields are optimized. The invention has the advantages of improving the working efficiency of office work, liberating the productivity of staff, realizing the conversion from unstructured data to structured data, and being suitable for formatted faxes, namely faxes of form images, such as standardized contracts, homemade certificates, bills and the like.

Description

Method for classifying and extracting information of formatted fax based on OCR
Technical Field
The invention relates to the field of image processing, in particular to a method for classifying and extracting information of a formatted fax based on OCR.
Background
With the development of science and technology, business communication across countries and regions is more and more frequent, and faxes are greatly used in office systems because of the special legal effect compared with other file transmission modes. The formatted fax documents contain a large amount of useful information, and at present, the fax documents need to be classified manually, and important information in the fax documents needs to be extracted manually, so that the efficiency is low. An efficient and fast file classification and information extraction method is urgently needed, the working efficiency of staff is improved, the labor cost is reduced, and the productivity is released.
Chinese patent publication No. CN101876999 discloses a method for generating a fax index, a message analyzing device and a fax retrieval system, which perform layout analysis on a fax message, extract feature information in the fax message, establish a tag for the fax message according to the extracted feature information, and use the tag as an index of the fax message, so that a user can search for a corresponding fax message according to the tag. However, the system can only realize the classification and the indexing of the files, and the extraction of the key information in the files is difficult to realize.
Chinese patent publication No. CN102222289 discloses a mobile phone financial management method and system based on OCR, which analyzes and identifies financial bills by means of OCR technology, but cannot classify fax images and extract information for formatted scanned fax documents.
Disclosure of Invention
The invention aims to solve the technical problem of providing a method for classifying and extracting information of formatted faxes based on OCR (optical character recognition), aiming at overcoming the defects of the prior art, and the method for classifying and extracting information of formatted faxes based on OCR has the advantages of improving the working efficiency of office work, liberating the productivity of workers and realizing the conversion of unstructured data into structured data.
In order to achieve the technical purpose, the technical scheme adopted by the invention is as follows:
a method for classifying and extracting information of formatted faxes based on OCR specifically comprises the following steps:
step 1: acquiring a fax image file, and performing binarization on an image according to a self-adaptive threshold value to reduce noise interference;
step 2: determining the inclination angle of the image, and correcting the image;
and step 3: finding the outline of the maximum surrounding frame of the table in the corrected image, and intercepting the header area of the image from the upper area of the maximum surrounding frame of the table in the image;
and 4, step 4: screening font outlines in the header area and fusing the font outlines so as to combine the font outlines into a complete field;
and 5: detecting the number of fields after the header area is combined, and classifying the images according to the number of the fields of the header area and the content of the fields;
step 6: extracting successfully classified images, and positioning the region to be identified in the images;
and 7: identifying the field of the area to be identified in the table according to the position of the area to be identified in the table and an OCR (optical character recognition) technology;
and 8: the identified fields are optimized.
As a further improved technical solution of the present invention, the step 1 specifically includes the following steps:
(1) acquiring a facsimile image file, converting the image into an HSV color gamut image, and removing pixel points falling in a red interval;
(2) and determining a binarization threshold value at the pixel position according to the pixel value distribution of the neighborhood blocks of the pixels of the image, and carrying out binarization of the adaptive threshold value on the image to reduce the interference of noise.
As a further improved technical solution of the present invention, the step 2 includes finding the longest straight line in the image, and performing rotation correction on the image according to an included angle between the longest straight line and the horizontal direction.
As a further improved technical scheme of the invention, the step 4 comprises the following steps:
(1) setting a range of a length threshold and a range of a width threshold of the font outline;
(2) carrying out outline retrieval on the header area, screening out outlines with the length within the range of the length threshold value of the font outline and the width within the range of the width threshold value of the font outline, wherein the screened outlines are the font outlines;
(3) and fusing the font outlines, extracting the colors of the font outlines, and combining the fonts with the font outlines with similar colors and the distance between each font outline being less than half of the width of the font outline into a complete field.
As a further improved technical scheme of the invention, the step 5 comprises the following steps:
(1) detecting the number of fields of the header area;
(2) if the number of fields is 0, the image is not classified;
(3) if the number of the fields is 1, classifying the image by adopting a method of a machine learning SVM classifier;
(4) if the number of the fields is larger than 1, recognizing the font of the header area through OCR, matching the font of the header area with the type name in the image recognition library so as to realize classification, dividing the matched total word number by the correctly matched total word number of the fields, comparing the obtained result with a preset threshold value, if the matched total word number is larger than the preset threshold value, successfully classifying, and otherwise, failing to classify.
As a further improved technical scheme of the invention, the step 6 comprises the following steps:
(1) loading the template information which is manufactured in advance;
(2) extracting the images successfully classified in the step 5, and finding all the outlines containing the surrounding frame in the outlines of the maximum surrounding frame in the images;
(3) setting a length threshold range and a width threshold range of the surrounding frame, and screening out the surrounding frame with the length within the length threshold range and the width within the width threshold range of the surrounding frame;
(4) scanning and sequencing all the bounding boxes from top to bottom and from left to right according to the position information of the screened bounding boxes, realizing the positioning of the form, and searching an area to be identified in the form according to the template information;
(5) judging whether the outside of the form needs to be identified or not according to the template information, if the outside information of the form needs to be identified, extracting the outline of a field outside the form, screening the outline of the font outside the form by adopting the method in the step 4 and fusing the outline of the font, so that the outline of the font is combined into a complete field, determining an area to be identified outside the form according to the relative position of the field recorded in the template information and the maximum surrounding frame in the image, and positioning the position of the field to be identified except the maximum surrounding frame according to the field recorded in the template information.
As a further improved technical scheme of the invention, the step 7 comprises the following steps:
(1) intercepting a field picture according to the position information of the area to be identified in the step 6;
(2) and identifying the well-positioned field through OCR identification.
As a further improved technical scheme of the invention, the step 8 comprises the following steps:
(1) extracting a field identified by the OCR;
(2) optimizing different field types, and removing non-digital parts of lower case fields; screening out blank spaces, non-numbers, years, months and days for the date field;
(3) and optimizing the dictionary, namely matching the fields recognized by the OCR with the fields in the dictionary library by establishing a dictionary library form, replacing the fields in the dictionary library with the fields recognized by the OCR to optimize and update the fields in the dictionary library if the matching score is larger than a preset threshold value, and supplementing the manually confirmed correct fields into the dictionary library, wherein the matching score is equal to the total number of the words recognized by the OCR divided by the total number of the matched words in the current dictionary library.
The method can be used for rapidly classifying the formatted fax files and extracting the information, and has the advantages of high classifying speed, accurate classification and high information extraction accuracy. In the prior art, a fax image is searched and classified, but field information cannot be extracted; there are some cases where an image is recognized, but the recognition function of formatting a facsimile image cannot be realized. Therefore, an effective method for extracting formatted fax file information does not exist at present, the method provided by the text fills the technical gap, the office work efficiency is improved, the productivity is released, and the labor cost is saved.
Drawings
FIG. 1 is a flow chart of the present invention.
Detailed Description
The following further illustrates an embodiment of the invention according to fig. 1:
referring to fig. 1, the present embodiment is applicable to any formatted fax, where the formatted fax is an image fax with a form, and the present embodiment takes a fax of a bill as an example, specifically as follows:
a method for classifying and extracting information of formatted faxes based on OCR specifically comprises the following steps:
step 1: acquiring an image file of a fax of a bill, and performing binarization on an image according to a self-adaptive threshold value to reduce noise interference;
step 2: determining the inclination angle of the image, and correcting the image;
and step 3: finding the outline of the maximum surrounding frame of the table in the corrected image, and intercepting the ticket head area of the image from the upper area of the maximum surrounding frame of the table in the image;
and 4, step 4: screening font outlines in the header area and fusing the font outlines so as to combine the font outlines into a complete field;
and 5: detecting the number of fields after the header area is combined, and classifying the images according to the number of the fields of the header area and the content of the fields;
step 6: extracting successfully classified images, and positioning the areas to be identified in the images (including the inside and the outside of the table);
and 7: identifying the field of the area to be identified in the table according to the position of the area to be identified in the table and an OCR (optical character recognition) technology;
and 8: the identified fields are optimized.
In this embodiment, the step 1 specifically includes the following steps:
(1) acquiring a facsimile image file, transferring the image to an image with an HSV color gamut, and removing pixel points (removing red chapters) falling in a red interval;
(2) and determining a binarization threshold value at the pixel position according to the pixel value distribution of the neighborhood blocks of the pixels of the image, and carrying out binarization of the adaptive threshold value on the image to reduce the interference of noise.
Preferably, the step 2 is to find the longest straight line in the image, and perform rotation correction on the image according to an included angle between the longest straight line and the horizontal direction.
In this embodiment, the step 4 includes the following steps:
(1) setting a range of a length threshold and a range of a width threshold of the font outline;
(2) carrying out outline retrieval on the header area, screening out the outline of which the length is within the range of the length threshold value of the font outline and the width is within the range of the width threshold value of the font outline, wherein the screened outline is the font outline;
(3) and fusing the font outlines, extracting the colors of the font outlines, and combining the fonts with the same color and the distance between each font outline being less than half of the width of the font outline into a complete field.
In this embodiment, the step 5 includes the following steps:
(1) detecting the number of fields of the header area;
(2) if the number of the fields is 0, the images are not classified, and the process exits;
(3) if the number of the fields is 1, classifying the images by adopting a method of machine learning SVM classifier, wherein the SVM classifier needs to train a large number of headers in advance, and bills which are not distinguished by the SVM classifier directly exit, and the machine learning SVM classifier in the prior art is adopted in the embodiment;
(4) if the number of the fields is more than 1, recognizing the font of the header area through OCR, matching the font of the header area with the type name in the image recognition library so as to realize classification, dividing the matched total word number by the correctly matched total word number of the fields, comparing the obtained result with a preset threshold Thr, if the matched total word number is more than the preset threshold, successfully classifying, otherwise, failing to classify, and exiting.
Preferably, the step 6 comprises the following steps:
(1) making template information, and loading the template information which is made in advance;
(2) extracting successfully classified images, and finding out all contours containing bounding boxes in the contours of the maximum bounding box in the images;
(3) setting a length threshold range and a width threshold range of the surrounding frame, and screening out the surrounding frame of which the length is within the length threshold range of the surrounding frame and the width is within the width threshold range of the surrounding frame;
(4) scanning and sequencing all the bounding boxes from top to bottom and from left to right according to the position information of the screened bounding boxes to realize the positioning of the table, searching the region to be identified in the table according to the template information (judging the position of the region to be identified in the table according to the template information so as to judge whether the region to be identified is outside the table or not;
(5) determining whether the outside of the form needs to be identified or not according to the template information, if the outside information of the form needs to be identified, extracting the outline of a field outside the form, screening the outline of the font outside the form by adopting the method in the step 4 and fusing the outline of the font, so as to combine the outline of the font into a complete field, determining an area to be identified outside the form according to the relative position of the field recorded in the template information and the maximum surrounding frame in the image, and positioning the position of the field needing to be identified outside the maximum surrounding frame according to the field recorded in the template information.
In this embodiment, the step 7 includes the following steps:
(1) intercepting a field picture according to the position information of the area to be identified in the step 6;
(2) and identifying the well-positioned field through OCR identification.
In this embodiment, the step 8 includes the following steps:
(1) extracting a field identified by the OCR;
(2) optimizing different field types, and removing non-digital parts of lower case fields; screening out blank spaces, non-numbers and 'year, month and day' of the date field;
(3) and (3) dictionary optimization, namely matching the fields recognized by the OCR with the fields in the dictionary library by establishing a dictionary library form, replacing the fields in the dictionary library with the fields recognized by the OCR to optimize and update the fields in the dictionary library if the matching score is greater than a preset threshold scorETR, and continuously supplementing manually confirmed correct fields into the dictionary library, wherein the matching score is equal to the total number of the words recognized by the OCR divided by the total number of the matched words in the dictionary library.
The method can be used for rapidly classifying the formatted fax files and extracting the information, and has the advantages of high classifying speed, accurate classification and high information extraction accuracy. In the prior art, a fax image is searched and classified, but field information cannot be extracted; there are some cases where an image is recognized, but the recognition function of formatting a facsimile image cannot be realized. Therefore, an effective method for extracting formatted fax file information does not exist at present, the method provided by the text fills the technical gap, the office work efficiency is improved, the productivity is released, and the labor cost is saved.
The scope of the present invention includes, but is not limited to, the above embodiments, and the present invention is defined by the appended claims, and any alterations, modifications, and improvements that may occur to those skilled in the art are all within the scope of the present invention.

Claims (7)

1. A method for classifying and extracting information of formatted fax based on OCR is characterized in that: the method specifically comprises the following steps:
step 1: acquiring a fax image file, and performing binarization on an image according to a self-adaptive threshold value to reduce noise interference;
step 2: determining the inclination angle of the image, and correcting the image;
and step 3: finding the outline of the maximum surrounding frame of the table in the corrected image, and intercepting the header area of the image from the upper area of the maximum surrounding frame of the table in the image;
and 4, step 4: screening font outlines in the header area and fusing the font outlines so as to combine the font outlines into a complete field;
and 5: detecting the number of fields after the header area is combined, and classifying the images according to the number of the fields of the header area and the content of the fields;
step 6: loading template information which is made in advance, extracting images which are successfully classified, finding all outlines which contain enclosing frames in the outlines of the largest enclosing frames in the images, setting the length threshold range and the width threshold range of the enclosing frames, screening out the enclosing frames with the lengths within the length threshold range and the widths within the width threshold range of the enclosing frames, scanning and sequencing all the enclosing frames from top to bottom according to the position information of the screened enclosing frames from left to right to realize the positioning of the form, searching an area to be identified in the form according to the template information, judging whether the outside of the form needs to be identified according to the template information, if the outside information of the form needs to be identified, extracting field outlines outside the form, screening the outside font outlines of the form by adopting the method of step 4 and fusing the font outlines so as to merge the font outlines into a complete field, determining an area to be identified outside the form according to the relative position of the field recorded in the template information and the maximum bounding box in the image, and positioning the field position needing to be identified except the maximum bounding box according to the field recorded in the template information;
and 7: identifying the field of the area to be identified in the table according to the position of the area to be identified in the table and an OCR (optical character recognition) technology;
and 8: and performing type optimization and dictionary optimization on the identified different fields.
2. An OCR based formatted facsimile classification and information extraction method as recited in claim 1, further comprising:
the step 1 specifically comprises the following steps:
(1) acquiring a facsimile image file, converting the image into an HSV color gamut image, and removing pixel points falling in a red interval;
(2) and determining a binarization threshold value at the pixel position according to the pixel value distribution of the neighborhood blocks of the pixels of the image, and carrying out binarization of the adaptive threshold value on the image to reduce the interference of noise.
3. An OCR based formatted facsimile classification and information extraction method as recited in claim 2, further comprising:
and the step 2 comprises finding the longest straight line in the image, and performing rotation correction on the image according to the included angle between the longest straight line and the horizontal direction.
4. An OCR-based formatted fax classification and information extraction method according to claim 3, wherein: the step 4 comprises the following steps:
(1) setting a range of a length threshold and a range of a width threshold of the font outline;
(2) carrying out outline retrieval on the header area, screening out outlines with the length within the range of the length threshold value of the font outline and the width within the range of the width threshold value of the font outline, wherein the screened outlines are the font outlines;
(3) and fusing the font outlines, extracting the colors of the font outlines, and combining the fonts with the font outlines with similar colors and the distance between each font outline being less than half of the width of the font outline into a complete field.
5. An OCR-based formatted fax classification and information extraction method according to claim 4, wherein:
the step 5 comprises the following steps:
(1) detecting the number of fields of the header area;
(2) if the number of fields is 0, the image is not classified;
(3) if the number of the fields is 1, classifying the image by adopting a method of a machine learning SVM classifier;
(4) if the number of the fields is larger than 1, recognizing the font of the header area through OCR, matching the font of the header area with the type name in the image recognition library so as to realize classification, dividing the matched total word number by the correctly matched total word number of the fields, comparing the obtained result with a preset threshold value, if the matched total word number is larger than the preset threshold value, successfully classifying, and otherwise, failing to classify.
6. An OCR based formatted facsimile classification and information extraction method as recited in claim 1, further comprising:
the step 7 comprises the following steps:
(1) intercepting a field picture according to the position information of the area to be identified in the step 6;
(2) and identifying the well-positioned field through OCR identification.
7. An OCR based formatted facsimile classification and information extraction method as recited in claim 1, further comprising:
the step 8 comprises the following steps:
(1) extracting a field identified by the OCR;
(2) optimizing different field types, and removing non-digital parts of lower case fields; screening out blank spaces, non-numbers, years, months and days for the date field;
(3) and optimizing the dictionary, namely matching the fields recognized by the OCR with the fields in the dictionary library by establishing a dictionary library form, replacing the fields in the dictionary library with the fields recognized by the OCR to optimize and update the fields in the dictionary library if the matching score is larger than a preset threshold value, and supplementing the manually confirmed correct fields into the dictionary library, wherein the matching score is equal to the total number of the words recognized by the OCR divided by the total number of the matched words in the current dictionary library.
CN201710334784.5A 2017-05-12 2017-05-12 Method for classifying and extracting information of formatted fax based on OCR Active CN107133621B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710334784.5A CN107133621B (en) 2017-05-12 2017-05-12 Method for classifying and extracting information of formatted fax based on OCR

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710334784.5A CN107133621B (en) 2017-05-12 2017-05-12 Method for classifying and extracting information of formatted fax based on OCR

Publications (2)

Publication Number Publication Date
CN107133621A CN107133621A (en) 2017-09-05
CN107133621B true CN107133621B (en) 2020-09-29

Family

ID=59733140

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710334784.5A Active CN107133621B (en) 2017-05-12 2017-05-12 Method for classifying and extracting information of formatted fax based on OCR

Country Status (1)

Country Link
CN (1) CN107133621B (en)

Families Citing this family (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107633239B (en) * 2017-10-18 2020-11-03 中电鸿信信息科技有限公司 Bill classification and bill field extraction method based on deep learning and OCR
CN107862303B (en) * 2017-11-30 2019-04-26 平安科技(深圳)有限公司 Information identifying method, electronic device and the readable storage medium storing program for executing of form class diagram picture
CN108038504B (en) * 2017-12-11 2019-12-27 深圳房讯通信息技术有限公司 Method for analyzing content of house property certificate photo
CN110119648A (en) * 2018-02-05 2019-08-13 国家计算机网络与信息安全管理中心 A kind of facsimile signal classification method based on optical character identification
CN108509401B (en) * 2018-03-05 2022-01-28 平安普惠企业管理有限公司 Contract generation method and device, computer equipment and storage medium
CN108830133B (en) * 2018-04-17 2020-02-21 平安科技(深圳)有限公司 Contract image picture identification method, electronic device and readable storage medium
CN109816118B (en) * 2019-01-25 2022-12-06 上海深杳智能科技有限公司 Method and terminal for creating structured document based on deep learning model
CN110674332B (en) * 2019-08-01 2022-11-15 南昌市微轲联信息技术有限公司 Motor vehicle digital electronic archive classification method based on OCR and text mining
CN111767769A (en) * 2019-08-14 2020-10-13 北京京东尚科信息技术有限公司 Text extraction method and device, electronic equipment and storage medium
CN112560859A (en) * 2020-11-20 2021-03-26 中电鸿信信息科技有限公司 Intelligent academic calendar information extraction method based on machine vision and natural language processing
CN112528984A (en) * 2020-12-18 2021-03-19 平安银行股份有限公司 Image information extraction method, device, electronic equipment and storage medium
CN112733518A (en) * 2021-01-14 2021-04-30 卫宁健康科技集团股份有限公司 Table template generation method, device, equipment and storage medium
CN112732955A (en) * 2021-03-31 2021-04-30 国网浙江省电力有限公司 Financial certificate storage and recording method in standard cost accounting
CN115273111B (en) * 2022-06-27 2023-04-18 北京互时科技股份有限公司 Device for identifying drawing material sheet without template

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102750541A (en) * 2011-04-22 2012-10-24 北京文通科技有限公司 Document image classifying distinguishing method and device
CN103208004A (en) * 2013-03-15 2013-07-17 北京英迈杰科技有限公司 Automatic recognition and extraction method and device for bill information area
CN103258198A (en) * 2013-04-26 2013-08-21 四川大学 Extraction method for characters in form document image

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP3864246B2 (en) * 2001-05-30 2006-12-27 インターナショナル・ビジネス・マシーンズ・コーポレーション Image processing method, image processing system, and program
JP4067799B2 (en) * 2001-09-07 2008-03-26 日立オムロンターミナルソリューションズ株式会社 Image recognition apparatus and stand type image scanner used therefor

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102750541A (en) * 2011-04-22 2012-10-24 北京文通科技有限公司 Document image classifying distinguishing method and device
CN103208004A (en) * 2013-03-15 2013-07-17 北京英迈杰科技有限公司 Automatic recognition and extraction method and device for bill information area
CN103258198A (en) * 2013-04-26 2013-08-21 四川大学 Extraction method for characters in form document image

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
A knowledge-based table recognition method for Chinese bank statement images;Liang Xu et al;《2016 IEEE International Conference on Image Processing (ICIP)》;20160928;3279-3283 *
基于OCR的文档图片检测与信息提取系统的研究;邹亚劼;《万方学位论文库》;20160504;全文 *
文档图像拼接技术研究;高鸿;《万方学位论文库》;20111030;第2.1节,5.3节 *
银行票据识别系统的研究;杜刚;《中国优秀硕士学位论文全文数据库(电子期刊)信息科技辑》;20160715;第2.3节,第3.2.1节,第4-6章 *

Also Published As

Publication number Publication date
CN107133621A (en) 2017-09-05

Similar Documents

Publication Publication Date Title
CN107133621B (en) Method for classifying and extracting information of formatted fax based on OCR
CN110766014B (en) Bill information positioning method, system and computer readable storage medium
CN107633239B (en) Bill classification and bill field extraction method based on deep learning and OCR
US11380113B2 (en) Methods for mobile image capture of vehicle identification numbers in a non-document
US11164027B2 (en) Deep learning based license plate identification method, device, equipment, and storage medium
EP1052593B1 (en) Form search apparatus and method
US8508756B2 (en) Image forming apparatus having capability for recognition and extraction of annotations and additionally written portions
US8565474B2 (en) Paragraph recognition in an optical character recognition (OCR) process
US20050196074A1 (en) Method and system for searching form features for form identification
CN113158808B (en) Method, medium and equipment for Chinese ancient book character recognition, paragraph grouping and layout reconstruction
CN101855640B (en) Method for image analysis, especially for mobile wireless device
JP2002312385A (en) Document automated dividing device
CN112508011A (en) OCR (optical character recognition) method and device based on neural network
JP2013084071A (en) Form recognition method and form recognition device
CN112949471A (en) Domestic CPU-based electronic official document identification reproduction method and system
CN113191348A (en) Template-based text structured extraction method and tool
JP2018042067A (en) Image processing system, image processing method, and information processing device
CN111832497B (en) Text detection post-processing method based on geometric features
Karanje et al. Survey on text detection, segmentation and recognition from a natural scene images
US7865130B2 (en) Material processing apparatus, material processing method, and material processing program product
CN102682308A (en) Imaging processing method and device
Lang et al. Physical layout analysis of partly annotated newspaper images
Shafait Geometric Layout Analysis of scanned documents
KR102651258B1 (en) A device and a method for deep learning based character recognition
CN115063114A (en) Contract additional recording automation method, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: 210005 No. 268, Hanzhoung Road, Nanjing, Jiangsu

Applicant after: CLP Hongxin Information Technology Co., Ltd

Address before: 210005 No. 268, Hanzhoung Road, Nanjing, Jiangsu

Applicant before: Jiangsu Hongxin System Integration Co., Ltd.

CB02 Change of applicant information
GR01 Patent grant
GR01 Patent grant