CN105528604B - A kind of bill automatic identification and processing system based on OCR - Google Patents

A kind of bill automatic identification and processing system based on OCR Download PDF

Info

Publication number
CN105528604B
CN105528604B CN201610070970.8A CN201610070970A CN105528604B CN 105528604 B CN105528604 B CN 105528604B CN 201610070970 A CN201610070970 A CN 201610070970A CN 105528604 B CN105528604 B CN 105528604B
Authority
CN
China
Prior art keywords
text
bill
block
image
identification
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201610070970.8A
Other languages
Chinese (zh)
Other versions
CN105528604A (en
Inventor
高学
金连文
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
South China University of Technology SCUT
Original Assignee
South China University of Technology SCUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by South China University of Technology SCUT filed Critical South China University of Technology SCUT
Priority to CN201610070970.8A priority Critical patent/CN105528604B/en
Publication of CN105528604A publication Critical patent/CN105528604A/en
Application granted granted Critical
Publication of CN105528604B publication Critical patent/CN105528604B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/24Aligning, centring, orientation detection or correction of the image
    • G06V10/243Aligning, centring, orientation detection or correction of the image by compensating for image skew or non-uniform image deformations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/28Character recognition specially adapted to the type of the alphabet, e.g. Latin alphabet
    • G06V30/287Character recognition specially adapted to the type of the alphabet, e.g. Latin alphabet of Kanji, Hiragana or Katakana characters
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/28Character recognition specially adapted to the type of the alphabet, e.g. Latin alphabet
    • G06V30/293Character recognition specially adapted to the type of the alphabet, e.g. Latin alphabet of characters other than Kanji, Hiragana or Katakana

Abstract

The present invention provides a kind of bill automatic identification and processing system based on OCR, including image capture module, rapid image binarization block, text block detection and locating module, the pinpoint module of single column text block, the accurate positioning of multicolumn text block and segmentation module, text identification module and bill images retrieval module.The high definition acquisition that the present invention can not only complete invoice bill images is stored with compression, additionally it is possible to carry out robust, accurately positioning and identification to alphabetic characters such as purchase and sale side's information, merchandise news, dates of making out an invoice in every bill.Bill images identification is able to carry out convenient retrieval.The system has the characteristics that processing and accuracy of identification is high, cost is relatively low, robustness is good, high degree of automation, can be widely applied in the computer automation management of bill, such as the certification of bill, the filing and inquiry of bill.

Description

A kind of bill automatic identification and processing system based on OCR
Technical field
The invention belongs to pattern-recognitions and field of artificial intelligence, automatic more particularly to a kind of bill based on OCR Identification and processing system.
Background technique
Bill automatic identification and processing technique based on OCR refer to through equipment such as computers, utilize OCR technique (optics Character recognition) symbol in paper-bill is automatically extracted and identified, and carry out respective handling.It is to realize that bill calculates One of the key technology that machine automatically processes.Although the development such as e-payment, electronic bill is increasing, traditionally on paper bill is still It is one of widely used mode in practical work and life, such as all kinds of paper invoices, financial document.Existing paper-bill Computer, which automatically processes, generally has following methods: (1) automatic collection and storage of bill images, is generally set by special The standby automatic collection for carrying out paper-bill and compression storage.But due to do not have to carry out in bill the automatic identification of the information such as text with Processing, it is difficult to the automatically retrieval and subsequent effective management by ticket contents are carried out, as the computer of ticket contents is verified automatically With verification etc..(2) manual entry of ticket contents generally carries out manual typing and guarantor to billing information by artificial mode It deposits, in order to which subsequent bill computer manages automatically.This mode is not suitable for extensive bill and automatically processes, and record by hand Enter also to be easy to appear typing mistake, human cost is higher.(3) the bill Computer Automatic Recognition with simple format and processing, This mode is generally directed to the relatively simple bill of format, such as banker's check.Billing information one to be identified in this mode As there is fixed geometric position or special sprocket bit, bill may be implemented by sprocket bit or simple geometry conversion The extraction of information, and utilize the identification of OCR technique progress character.For with complex space of a whole page bill, especially invoice ticket According to since bill is many kinds of, the complicated multiplicity of bill page format, there is presently no a kind of general methods or equipment to have Text information in effect ground automatic identification bill.For the above analysis, the present invention is directed to the invoice bill with complicated format Information automatic input and processing provide a kind of effective automatic identification and processing especially for VAT invoice bill Method and system.
Summary of the invention
It is an object of the invention to overcome the shortcomings of above-mentioned bill processing mode and system, a kind of quick, high-precision is provided VAT invoice bill automatic identification and processing system, its main feature is that using high speed scanner carry out invoice bill image adopt Collection, quickly, accurately can extract and identify the seller and buyer's enterprise name and Taxpayer Identification Number, quotient in VAT invoice The billing informations such as product information (including product name, measurement unit, quantity, the amount of money, amount of tax to be paid), date of making out an invoice, and ticket can be completed It saves and retrieves according to the compression of image.
A kind of bill automatic identification and processing system based on OCR, including bill images acquisition module, rapid image two-value Change module, text block detection and locating module, the pinpoint module of single column text block, multicolumn text block accurate positioning with point Cut module, text identification module and bill images retrieval module;After image capture module acquires invoice bill images, rapid image Binarization block carries out binary conversion treatment to image, and text block detection carries out text block detection and positioning, and root with locating module According to the horizontal division line inclination angle determined in detection, tilt detection and correction are carried out to image;According to the text block of positioning, lead to respectively The accurate positioning of the pinpoint module, multicolumn text block of crossing single column text block is accurately positioned and is divided with segmentation module, To obtain billing information line of text to be identified;Line of text is divided into cardinar number word string and Chinese character string by line of text identification module Two kinds, identifying processing is carried out respectively;The bill images of acquisition and recognition result are associated storage by bill images retrieval module, The retrieval in bill images data is completed, retrieval content includes purchaser's information of bill images, pin side's information, merchandise news, opens The ticket date.
Further, system further includes high-definition camera, realizes the acquisition of invoice image, and by image transmitting to high-performance Computer saves;User need to only be placed in invoice is smooth below camera, carry out invoice by image capture module triggering camera Image Acquisition, and it is saved in the specified directory of computer.
Further, after user's activation system, this system is waited for, and smooth invoice is placed in camera by user Lower section calculates trigger signal according to sequential frame image by image capture module, and triggers camera and carry out invoice Image Acquisition, and It is saved in the specified directory of PC computer;Then, system carries out automatically processing and identifying for bill images, extracts increase with identification respectively Seller and buyer's title and Taxpayer Identification Number, merchandise news, date information of making out an invoice in value tax invoice bill, and complete bill The compression of picture saves and identification information association;According to the bill images of generation and identification information data, system completes bill As retrieval and filing automatic management, and can according to identification information, realize the billing information to related tax authority's database into Row authentication management.
Further, bill images acquisition module is responsible for carrying out triggering control to high-definition camera, and by the bill of acquisition Image transmitting is stored to high-performance computer;The frame-to-frame differences for the image sequence that trigger signal is acquired according to high-definition camera carries out It calculates;If adjacent interframe image difference is less than given threshold, and the foreground pixel points proportion of image is greater than predetermined value, then It sends trigger signal and carries out bill images acquisition.
Further, rapid image binarization block carries out binaryzation to invoice image, using based on maximum between-cluster variance The binarization method of method and local block;It image is subjected to gray processing first, and is divided into N number of subregion, N takes here Value is set according to stroke width, then in each subregion, determines image binaryzation using maximum between-cluster variance method Threshold value T, and image binaryzation is carried out according to T;If image maximum and minimal gray value difference are less than preset value in subregion, this is set Subregion is background.
Further, text block detection and locating module divide content to be identified according to the format of VAT invoice bill For purchaser's information, pin side's information, merchandise news, date text block of making out an invoice, and using based on straight-line detection method positioning and point Cut out corresponding text image block;First with Hough transform line detection method detection horizontal line of the inclination angle less than 45 °, and according to The horizontal line tilt angle of detection carries out the slant correction of image;Then, longest preceding 5 horizontal lines are taken respectively, and utilize line spacing Geometric proportion constraint, orient the horizontal division line of VAT invoice;Finally according to horizontal subdivision line and billing information block Geometric position, orient invoice text information block to be identified;The text information block of positioning is divided into single column text block and multicolumn Text block, single column text block include purchaser's title and Taxpayer Identification Number, pin side's title and Taxpayer Identification Number, make out an invoice the date; Multicolumn text block includes merchandise news block, and merchandise news block includes product name, measurement unit, quantity, the amount of money, amount of tax to be paid column;Such as The segmentation failure of fruit text block, then be judged as invalid bill.
Further, the pinpoint module of single column text block mainly handles the accurate positioning of single column text block, to each Single column text image block is passed through number by the stroke that row is scanned, and calculates every row first, is then passed through with stroke several maximum Behavior initial row, respectively to upper and lower scanning, the stroke of continuous K row, which passes through to count, if it exists is less than predetermined threshold (K is determined by experiment), Then think for row edge, to position a line text;Then it repeats the above steps to remainder image, it is all until navigating to Until line of text;Finally using the high approximately equal characteristic of line of text in one text block, the high abnormal row of row is rejected, thus smart Determine the line of text to be identified in each single column text image block in position;If line of text segmentation failure, is judged as invalid bill;
The accurate positioning of multicolumn text block mainly handles the accurate positioning of multicolumn text block, i.e. merchandise news with segmentation module Image block;First with Hough transform line detection method is based on, the vertical divider in image block is positioned, and utilize separation Geometric proportion constraint condition between line, rejects invalid separator bar, to orient product name, measurement unit, quantity, gold Volume, the corresponding picture position in amount of tax to be paid column;The characteristic for finally meeting horizontal alignment according to every merchandise news picture position, thus smart It determines position and is partitioned into line of text image to be identified;If line of text segmentation failure, is judged as invalid bill.
Further, line of text is divided into cardinar number word string and two kinds of Chinese character string by line of text identification module, is carried out respectively Identifying processing;The identification process of line of text are as follows: the line character segmentation analyzed based on upright projection is carried out first, then using dynamic The method of planning calculates optimal segmentation path, obtains character identification result;Identification for cardinar number word string, to each candidate characters Divide block, calculate 8 direction gradient features first, and carry out LDA dimensionality reduction, then in the space of feature vectors after dimensionality reduction, utilizes Arest neighbors method is classified, and the recognition credibility of each candidate characters is obtained, and is updated to the optimum segmentation of line of text identification In path computing;Identification for Chinese character string divides block to each candidate characters, calculates 8 direction gradient features first, and Carry out LDA dimensionality reduction;In space of feature vectors after dimensionality reduction, classified using arest neighbors method, obtains each candidate characters Recognition credibility, and by recognition credibility and binary language model information and the ratio of width to height of neighboring candidate Character segmentation block Geological information is integrated, and the recognition credibility after synthesis is updated in the optimal segmentation path calculating of line of text identification.
The present invention compared with technology, has the advantages that with existing bill images computer processing method
(1) acquisition of bill images is carried out using high-definition camera, system structure is simple, and easy to use, hardware cost is low. If not only at high cost using high speed scanner, operation is also not convenient enough.
(2) due to the image processing algorithm using robust, including image binaryzation method and based on the straight of Hough transform Line detection and text positioning method, system can preferably adapt to variation and the bill page format of use environment illumination Certain adjustment, system stability are preferable.
(3) the characteristics of being directed to bill images, due to using effective segmentation of the characters and their identification algorithm, cooperates this system High-definition camera, the available clearly bill images of this system, and carry out bill storage filing, character recognition accurately compared with It is high.Experimental results show that the character identification rate of this system can achieve 98% or more.
(4) the extracted ticket text information of this system has carried out effective association, Ke Yiguang with the bill images of acquisition In the general computer management applied to bill, such as the certification of bill, the inquiry of bill, to preferably solve bill images Automatic processing needs.
Detailed description of the invention
Fig. 1 is the processing flow schematic diagram of this bill automatic identification and processing system.
Fig. 2 is the image procossing of this bill automatic identification and processing system and the flow chart of identification module.
Specific embodiment
The present invention is further specifically described with reference to the accompanying drawing, embodiments of the present invention are not limited thereto.
Fig. 1 is the bill processing flow schematic diagram of this bill automatic identification and processing system.As shown in Figure 1, the place of system Manage process are as follows: after user starts this Hardware & software system, this system is waited for, and smooth invoice is placed in and takes the photograph by user As calculating trigger signal according to sequential frame image by image capture module below head, and triggers camera progress invoice image and adopt Collection, and it is saved in the specified directory of PC computer.Then, system carries out bill images and automatically processing and identify, extract respectively with Identify that (including product name, metering are single for seller and buyer's title in VAT invoice bill and Taxpayer Identification Number, merchandise news Position, quantity, the amount of money, the amount of tax to be paid), the billing informations such as date of making out an invoice, and the compression for completing bill images saves and identification information is closed Connection.The bill images and identification information data generated according to system, system can complete bill images retrieval and automate with filing Management, and can realize that the billing information to related tax authority's database carries out authentication management etc. according to identification information.
Fig. 2 is the flow chart of the image procossing and identification module of this bill automatic identification and processing system.As shown in Figure 2, The flow chart of image procossing and identification module are as follows:, will be according to after the image processing module of this system receives invoice bill ticket image It is secondary that binary conversion treatment, text block detection and positioning are carried out to image, and according to the horizontal division line inclination angle determined in detection, to figure As carrying out tilt detection and correction.According to the text block of positioning, text block is divided into single column text block and multicolumn text block by system, It is accurately positioned and is divided respectively, to obtain billing information line of text to be identified.Then it successively carries out based on projection point The line character of analysis is divided, the confidence level of candidate characters calculates, the calculating and character recognition of optimal segmentation path.Finally identification is believed Breath and the bill images of acquisition carry out efficient association and store predetermined.
The present invention realizes a kind of bill automatic identification and processing system based on OCR, utilizes high-definition camera and high-performance Computer.
High-definition camera realizes the acquisition of invoice image, and image transmitting to high-performance computer is saved.User only need by Invoice is smooth to be placed in below camera, carries out invoice Image Acquisition by image capture module triggering camera, and be saved in electricity The specified directory of brain.
(1) image capture module
It is responsible for carrying out triggering control to high-definition camera, and the bill images of acquisition is transferred to high-performance computer and are deposited Storage.The frame-to-frame differences for the image sequence that trigger signal is acquired according to high-definition camera is calculated.If adjacent interframe image difference is small It is greater than predetermined value in the foreground pixel points proportion of certain threshold value, and image, then sends trigger signal and carry out bill images Acquisition.
(2) rapid image binarization block
Binaryzation is carried out to invoice image.It is influenced to adapt to the brings such as ambient lighting, here using based on maximum kind Between Variance Method and the binarization method of local block.Image is subjected to gray processing first, and is divided into subregion, here N Value set according to stroke width, then in each subregion, determine image two using maximum between-cluster variance method Value threshold value T, and image binaryzation is carried out according to T.If image maximum and minimal gray value difference are less than preset value in subregion, Setting the subregion is background.
(3) text block detection and locating module
According to the format of VAT invoice bill, by content to be identified be divided into purchaser's information, pin side's information, merchandise news, It makes out an invoice the text blocks such as date, and positions text image block corresponding to being partitioned into using the method based on straight-line detection.First with Hough transform line detection method detects horizontal line of the inclination angle less than 45 °, and carries out image according to the horizontal line tilt angle of detection Slant correction.Then, longest preceding 5 horizontal lines are taken respectively, and are constrained using the geometric proportion of line spacing, and value-added tax hair is oriented The horizontal division line of ticket.Finally according to the geometric position of horizontal subdivision line and billing information block, invoice to be identified is oriented Text information block.The text information block of positioning is divided into single column text block and multicolumn text block, and single column text block includes purchaser's title With Taxpayer Identification Number, pin side's title and Taxpayer Identification Number, date of making out an invoice etc.;Multicolumn text block includes merchandise news block (packet Include the columns such as product name, measurement unit, quantity, the amount of money, the amount of tax to be paid).If text block segmentation failure, is returned as invalid bill.
(4) pinpoint module of single column text block
The main accurate positioning for handling single column text block, to each single column text image block, algorithm is swept by row first The stroke retouched, and calculate every row passes through number, then passes through several maximum behavior initial rows with stroke, respectively to upper and lower scanning, if Number is passed through there are the stroke of continuous K row and is less than predetermined threshold (K is determined by experiment), then it is assumed that be row edge, to position a line Text.Then it repeats the above steps to remainder image, until navigating to all line of text.Finally utilize one text The high approximately equal characteristic of line of text in block rejects the high abnormal row of row, to be accurately positioned in each single column text image block Line of text to be identified.If line of text segmentation failure, is returned as invalid bill.
(5) accurate positioning of multicolumn text block and segmentation module
The accurate positioning of main processing multicolumn text block, i.e. merchandise news image block.First with straight based on Hough transform Line detecting method positions the vertical divider in image block, and using the geometric proportion constraint condition between separator bar, it is invalid to reject Separator bar, to orient the corresponding picture position in the columns such as product name, measurement unit, quantity, the amount of money, the amount of tax to be paid.Last root Meet the characteristic of horizontal alignment according to every merchandise news picture position, to be accurately positioned and be partitioned into line of text figure to be identified Picture.If line of text segmentation failure, is returned as invalid bill.
(6) text identification module
Here line of text is divided into cardinar number word string (such as Taxpayer Identification Number) and two kinds of Chinese character string, is identified respectively Processing.The identification process of line of text are as follows: carry out the line character segmentation analyzed based on upright projection first, then use Dynamic Programming Method calculate optimal segmentation path, obtain character identification result.Each candidate characters are divided in identification for cardinar number word string Block calculates 8 direction gradient features first, and carries out LDA dimensionality reduction, then in the space of feature vectors after dimensionality reduction, using nearest Adjacent method is classified, and the recognition credibility of each candidate characters is obtained, and is updated to the optimal segmentation path of line of text identification In calculating.Identification for Chinese character string divides block to each candidate characters, calculates 8 direction gradient features first, and carry out LDA dimensionality reduction.In space of feature vectors after dimensionality reduction, classified using arest neighbors method, obtains the knowledge of each candidate characters Other confidence level, and recognition credibility and binary language model information and the ratio of width to height of neighboring candidate Character segmentation block etc. is several What information is integrated, and the recognition credibility after synthesis is updated in the optimal segmentation path calculating of line of text identification.
(7) bill images retrieval module
By the way that the bill images of acquisition and recognition result are associated storage, the retrieval in completion bill images data, Retrieve purchaser's information that content includes bill images, pin side's information, merchandise news, date of making out an invoice etc..It can be convenient filing bill The retrieval of image.
Above-described embodiment is preferred embodiments of the present invention, but embodiments of the present invention are not by above-described embodiment Limitation, it is other any without departing from spirit of the invention and change made under technology, modification or substitution, it should be equivalent and set It changes, is included within the scope of the present invention.

Claims (7)

1. a kind of bill automatic identification and processing system based on OCR, it is characterised in that including bill images acquisition module, quickly Image binaryzation module, text block detection and locating module, the pinpoint module of list column text block, multicolumn text block it is accurate Positioning and segmentation module, text identification module and bill images retrieval module;After image capture module acquires invoice bill images, Rapid image binarization block carries out binary conversion treatment to image, and text block, which is detected, to be carried out text block detection with locating module and determine Position, and according to the horizontal division line inclination angle determined in detection, tilt detection and correction are carried out to image;According to the text of positioning Block, it is accurately fixed to be carried out respectively by the pinpoint module of single column text block, the accurate positioning of multicolumn text block and segmentation module Position and segmentation, to obtain billing information line of text to be identified;Line of text identification module by line of text be divided into cardinar number word string and Two kinds of Chinese character string, identifying processing is carried out respectively;Bill images retrieval module by the bill images of acquisition and recognition result into Row associated storage, completes the retrieval in bill images data, retrieval content include purchaser's information of bill images, pin side's information, Merchandise news is made out an invoice the date;
After user's activation system, this system is waited for, and smooth invoice is placed in below camera by user, is adopted by image Collect module and trigger signal is calculated according to sequential frame image, and triggers camera and carry out invoice Image Acquisition, and be saved in PC computer Specified directory;Then, system carries out automatically processing and identifying for bill images, extracts respectively and identification VAT invoice bill In seller and buyer's title and Taxpayer Identification Number, merchandise news, date information of making out an invoice, and complete bill images compression protect It deposits and identification information is associated with;According to the bill images of generation and identification information data, system is completed bill images retrieval and is returned Shelves automatic management, and can realize that the billing information to related tax authority's database carries out authentication management according to identification information.
2. the bill automatic identification and processing system according to claim 1 based on OCR, it is characterised in that: further include height Clear camera realizes the acquisition of invoice image, and image transmitting to high-performance computer is saved;User need to only put invoice is smooth It is placed in below camera, invoice Image Acquisition is carried out by image capture module triggering camera, and be saved in the specified mesh of computer Record.
3. the bill automatic identification and processing system according to claim 1 based on OCR, it is characterised in that: bill images Acquisition module is responsible for carrying out triggering control to high-definition camera, and the bill images of acquisition are transferred to high-performance computer and are deposited Storage;The frame-to-frame differences for the image sequence that trigger signal is acquired according to high-definition camera is calculated;If adjacent interframe image difference is small It is greater than predetermined value in the foreground pixel points proportion of given threshold, and image, then sends trigger signal and carry out bill images Acquisition.
4. the bill automatic identification and processing system according to claim 1 based on OCR, it is characterised in that: rapid image Binarization block carries out binaryzation to invoice image, using the binaryzation side based on maximum between-cluster variance method and local block Method;Image is subjected to gray processing first, and is divided into N number of subregion, the value of N is set according to stroke width here, so Afterwards in each subregion, image binaryzation threshold value T is determined using maximum between-cluster variance method, and image two-value is carried out according to T Change;If image maximum and minimal gray value difference are less than preset value in subregion, setting the subregion is background.
5. the bill automatic identification and processing system according to claim 1 based on OCR, it is characterised in that: text block inspection It surveys and locating module is according to the format of VAT invoice bill, content to be identified is divided into purchaser's information, pin side's information, commodity and is believed It ceases, date text block of making out an invoice, and positions text image block corresponding to being partitioned into using the method based on straight-line detection;First with Hough transform line detection method detects horizontal line of the inclination angle less than 45 °, and carries out image according to the horizontal line tilt angle of detection Slant correction;Then, longest preceding 5 horizontal lines are taken respectively, and are constrained using the geometric proportion of line spacing, and value-added tax hair is oriented The horizontal division line of ticket;Finally according to the geometric position of horizontal subdivision line and billing information block, invoice to be identified is oriented Text information block;The text information block of positioning is divided into single column text block and multicolumn text block, and single column text block includes purchaser's title With Taxpayer Identification Number, pin side's title and Taxpayer Identification Number, make out an invoice the date;Multicolumn text block includes merchandise news block, commodity Block of information includes product name, measurement unit, quantity, the amount of money, amount of tax to be paid column;If text block segmentation failure, is judged as invalid Bill.
6. the bill automatic identification and processing system according to claim 1 based on OCR, it is characterised in that: single column text The pinpoint module of block mainly handles the accurate positioning of single column text block, to each single column text image block, first by traveling Row scanning, and the stroke for calculating every row passes through number, then passes through several maximum behavior initial rows with stroke, respectively to sweeping up and down It retouches, the stroke of continuous K row passes through number less than predetermined threshold if it exists, then it is assumed that is row edge, to position a line text;Then It repeats the above steps to remainder image, until navigating to all line of text;Finally utilize text in one text block The high approximately equal characteristic of row rejects the high abnormal row of row, to be accurately positioned to be identified in each single column text image block Line of text;If line of text segmentation failure, is judged as invalid bill;
The accurate positioning of multicolumn text block mainly handles the accurate positioning of multicolumn text block, i.e. merchandise news image with segmentation module Block;First with Hough transform line detection method is based on, the vertical divider in image block is positioned, and using between separator bar Geometric proportion constraint condition, invalid separator bar is rejected, to orient product name, measurement unit, quantity, the amount of money, tax The corresponding picture position in volume column;Finally meet the characteristic of horizontal alignment according to every merchandise news picture position, thus accurately fixed Position be partitioned into line of text image to be identified;If line of text segmentation failure, is judged as invalid bill.
7. the bill automatic identification and processing system according to claim 1 based on OCR, it is characterised in that: line of text is known Line of text is divided into cardinar number word string and two kinds of Chinese character string by other module, carries out identifying processing respectively;The identification process of line of text Are as follows: the line character segmentation analyzed based on upright projection is carried out first, and optimum segmentation road is then calculated using the method for Dynamic Programming Diameter obtains character identification result;Identification for cardinar number word string divides block to each candidate characters, calculates 8 direction gradients first Feature, and LDA dimensionality reduction is carried out, then in the space of feature vectors after dimensionality reduction, classified using arest neighbors method, is obtained every The recognition credibility of a candidate characters, and be updated in the optimal segmentation path calculating of line of text identification;For Chinese character string Identification, to each candidate characters divide block, calculate 8 direction gradient features first, and carry out LDA dimensionality reduction;Spy after dimensionality reduction It levies in vector space, is classified using arest neighbors method, obtain the recognition credibility of each candidate characters, and will identify credible The ratio of width to height geological information of degree and binary language model information and neighboring candidate Character segmentation block is integrated, and will be integrated During the optimal segmentation path that recognition credibility afterwards is updated to line of text identification calculates.
CN201610070970.8A 2016-01-31 2016-01-31 A kind of bill automatic identification and processing system based on OCR Expired - Fee Related CN105528604B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610070970.8A CN105528604B (en) 2016-01-31 2016-01-31 A kind of bill automatic identification and processing system based on OCR

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610070970.8A CN105528604B (en) 2016-01-31 2016-01-31 A kind of bill automatic identification and processing system based on OCR

Publications (2)

Publication Number Publication Date
CN105528604A CN105528604A (en) 2016-04-27
CN105528604B true CN105528604B (en) 2018-12-11

Family

ID=55770818

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610070970.8A Expired - Fee Related CN105528604B (en) 2016-01-31 2016-01-31 A kind of bill automatic identification and processing system based on OCR

Country Status (1)

Country Link
CN (1) CN105528604B (en)

Families Citing this family (47)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106485246B (en) * 2016-09-19 2019-07-16 北京小米移动软件有限公司 Character identifying method and device
CN106650714A (en) * 2016-10-08 2017-05-10 迪堡金融设备有限公司 Paper note serial number identification method and apparatus
CN106485243B (en) * 2016-10-31 2019-10-22 用友网络科技股份有限公司 A kind of bank slip recognition error correction method and device
CN108242050A (en) * 2016-12-27 2018-07-03 航天信息股份有限公司 The processing method and processing device of electronic invoice
CN106886776A (en) * 2017-02-23 2017-06-23 山东浪潮云服务信息科技有限公司 The application model of license electronization is realized in a kind of utilization image recognition
CN107133571A (en) * 2017-04-11 2017-09-05 上海众开信息科技有限公司 A kind of system and method that paper invoice is automatically generated to financial statement
CN107133618B (en) * 2017-04-24 2021-03-19 北京中安未来科技有限公司 Electronic certificate identification triggering method and device
CN107169488A (en) * 2017-05-03 2017-09-15 四川长虹电器股份有限公司 A kind of correction system and antidote of bill scan image
CN107194400B (en) * 2017-05-31 2019-12-20 北京天宇星空科技有限公司 Financial reimbursement full ticket image recognition processing method
CN107392260B (en) * 2017-06-08 2020-03-17 中国民生银行股份有限公司 Error calibration method and device for character recognition result
CN107679442A (en) * 2017-06-23 2018-02-09 平安科技(深圳)有限公司 Method, apparatus, computer equipment and the storage medium of document Data Enter
CN109299798A (en) * 2017-07-25 2019-02-01 阿里巴巴集团控股有限公司 Processing method, device and the electronic equipment of travel information
CN109426814B (en) * 2017-08-22 2023-02-24 顺丰科技有限公司 Method, system and equipment for positioning and identifying specific plate of invoice picture
CN107622266B (en) * 2017-09-21 2019-05-07 平安科技(深圳)有限公司 A kind of processing method, storage medium and the server of OCR identification
CN107633239B (en) * 2017-10-18 2020-11-03 中电鸿信信息科技有限公司 Bill classification and bill field extraction method based on deep learning and OCR
CN109840520A (en) * 2017-11-24 2019-06-04 中国移动通信集团广东有限公司 A kind of invoice key message recognition methods and system
CN110109907B (en) * 2017-12-27 2021-08-24 航天信息股份有限公司 Tax data storage and query method and device
CN109993619B (en) * 2017-12-29 2022-09-30 北京京东尚科信息技术有限公司 Data processing method
CN108446699A (en) * 2018-02-08 2018-08-24 东华大学 Identity card pictorial information identifying system under a kind of complex scene
CN108460418B (en) * 2018-03-07 2021-09-28 南京邮电大学 Invoice classification method based on character recognition and semantic analysis
CN108460381B (en) * 2018-03-13 2022-06-10 南京邮电大学 Invoice reimbursement information positioning and intercepting method based on image recognition
CN108549890A (en) * 2018-03-22 2018-09-18 南京邮电大学 Invoice tilt detection based on image recognition and geometric correction method
CN108549843A (en) * 2018-03-22 2018-09-18 南京邮电大学 A kind of VAT invoice recognition methods based on image procossing
CN108734849B (en) * 2018-04-25 2020-11-13 新浪网技术(中国)有限公司 Automatic invoice true-checking method and system
CN110457973A (en) * 2018-05-07 2019-11-15 北京中海汇银财税服务有限公司 A kind of method and system of bank slip recognition
CN108717543B (en) * 2018-05-14 2022-01-14 北京市商汤科技开发有限公司 Invoice identification method and device and computer storage medium
CN109034159B (en) * 2018-05-28 2021-05-28 北京捷通华声科技股份有限公司 Image information extraction method and device
CN109271910A (en) * 2018-09-04 2019-01-25 阿里巴巴集团控股有限公司 A kind of Text region, character translation method and apparatus
CN109544774A (en) * 2018-11-30 2019-03-29 上海贞众创空间管理有限公司 A kind of smart tickets archival device
CN109726710A (en) * 2018-12-27 2019-05-07 平安科技(深圳)有限公司 Invoice information acquisition method, electronic device and readable storage medium storing program for executing
CN110263239B (en) * 2019-05-31 2023-08-22 平安科技(深圳)有限公司 Invoice identification method and device, storage medium and computer equipment
CN110675270A (en) * 2019-09-05 2020-01-10 平安健康保险股份有限公司 Method and device for determining medical insurance deduction amount based on invoice information
CN110659607A (en) * 2019-09-23 2020-01-07 天津车之家数据信息技术有限公司 Data checking method, device and system and computing equipment
CN110895690A (en) * 2019-10-11 2020-03-20 南京邮电大学 Invoice positioning method based on openCV morphology
CN111126319A (en) * 2019-12-27 2020-05-08 山东旗帜信息有限公司 Invoice identification method and device
CN111209827B (en) * 2019-12-31 2023-07-14 中国南方电网有限责任公司 Method and system for OCR (optical character recognition) bill problem based on feature detection
CN111209865A (en) * 2020-01-06 2020-05-29 中科鼎富(北京)科技发展有限公司 File content extraction method and device, electronic equipment and storage medium
US11570099B2 (en) 2020-02-04 2023-01-31 Bank Of America Corporation System and method for autopartitioning and processing electronic resources
CN111444793A (en) * 2020-03-13 2020-07-24 安诚迈科(北京)信息技术有限公司 Bill recognition method, equipment, storage medium and device based on OCR
CN111291741B (en) * 2020-05-13 2020-11-03 太平金融科技服务(上海)有限公司 Receipt identification method and device, computer equipment and storage medium
CN112784014B (en) * 2021-01-15 2022-03-25 中国核动力研究设计院 Safe full-text retrieval system and method based on multi-source heterogeneous system
CN112949450B (en) * 2021-02-25 2024-01-23 北京百度网讯科技有限公司 Bill processing method, device, electronic equipment and storage medium
CN112966583A (en) * 2021-02-26 2021-06-15 深圳壹账通智能科技有限公司 Image processing method, image processing device, computer equipment and storage medium
CN112699860B (en) * 2021-03-24 2021-06-22 成都新希望金融信息有限公司 Method for automatically extracting and sorting effective information in personal tax APP operation video
CN114120322B (en) * 2022-01-26 2022-05-10 深圳爱莫科技有限公司 Order commodity quantity identification result correction method and processing equipment
CN114662462A (en) * 2022-03-10 2022-06-24 江西工程学院 Accounting data processing method and system
CN114550194B (en) * 2022-04-26 2022-08-19 北京北大软件工程股份有限公司 Method and device for identifying letters and visitors

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101447017A (en) * 2008-11-27 2009-06-03 浙江工业大学 Method and system for quickly identifying and counting votes on the basis of layout analysis
CN101751121A (en) * 2008-12-12 2010-06-23 汉王科技股份有限公司 OCR-based wireless scanning input device and method
CN104112128A (en) * 2014-06-19 2014-10-22 中国工商银行股份有限公司 Digital image processing system applied to bill image character recognition and method

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1843276A1 (en) * 2006-04-03 2007-10-10 Océ-Technologies B.V. Method for automated processing of hard copy text documents

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101447017A (en) * 2008-11-27 2009-06-03 浙江工业大学 Method and system for quickly identifying and counting votes on the basis of layout analysis
CN101751121A (en) * 2008-12-12 2010-06-23 汉王科技股份有限公司 OCR-based wireless scanning input device and method
CN104112128A (en) * 2014-06-19 2014-10-22 中国工商银行股份有限公司 Digital image processing system applied to bill image character recognition and method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
基于OCR快递单据识别的研究与实现;胡提坤;《中国优秀硕士学位论文全文数据库 信息科技辑》;20140915(第09期);I138-821 *
银行票据手写数字串识别的预处理与分割;刘培根;《中国优秀硕士学位论文全文数据库 信息科技辑》;20120715(第07期);I138-2179 *

Also Published As

Publication number Publication date
CN105528604A (en) 2016-04-27

Similar Documents

Publication Publication Date Title
CN105528604B (en) A kind of bill automatic identification and processing system based on OCR
US10943105B2 (en) Document field detection and parsing
CN108491799B (en) Intelligent sales counter commodity management method and system based on image recognition
Grüning et al. Read-bad: A new dataset and evaluation scheme for baseline detection in archival documents
RU2679209C2 (en) Processing of electronic documents for invoices recognition
CN108717543B (en) Invoice identification method and device and computer storage medium
US9396404B2 (en) Robust industrial optical character recognition
CN103914680B (en) A kind of spray printing character picture identification and check system and method
CN104217203A (en) Complex background card face information identification method and system
CN113963147B (en) Key information extraction method and system based on semantic segmentation
CN105809205A (en) Classification method and system for hyperspectral images
JP2023536174A (en) OCR-based document analysis system and method using virtual cells
JP3078318B2 (en) Character recognition method and apparatus including locating and extracting predetermined data from a document
CN114511866A (en) Data auditing method, device, system, processor and machine-readable storage medium
CN113469005A (en) Recognition method of bank receipt, related device and storage medium
US20220036063A1 (en) Document information extraction for computer manipulation
CN117037198A (en) Bank statement identification method
CN111428725A (en) Data structuring processing method and device and electronic equipment
Tran et al. A novel approach for text detection in images using structural features
Shweka et al. Automatic extraction of catalog data from digital images of historical manuscripts
Ai et al. Geometry preserving active polygon-incorporated sign detection algorithm
Fiel et al. Writer identification on historical Glagolitic documents
CN116343237A (en) Bill identification method based on deep learning and knowledge graph
Nehra et al. Benchmarking of text segmentation in devnagari handwritten document
Bogahawatte et al. Online Digital Cheque Clearance and Verification System using Block Chain

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20181211