CN105528604B - A kind of bill automatic identification and processing system based on OCR - Google Patents
A kind of bill automatic identification and processing system based on OCR Download PDFInfo
- Publication number
- CN105528604B CN105528604B CN201610070970.8A CN201610070970A CN105528604B CN 105528604 B CN105528604 B CN 105528604B CN 201610070970 A CN201610070970 A CN 201610070970A CN 105528604 B CN105528604 B CN 105528604B
- Authority
- CN
- China
- Prior art keywords
- text
- bill
- block
- image
- identification
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
- 238000012545 processing Methods 0.000 title claims abstract description 42
- 230000011218 segmentation Effects 0.000 claims abstract description 34
- 238000001514 detection method Methods 0.000 claims abstract description 28
- 230000006835 compression Effects 0.000 claims abstract description 6
- 238000007906 compression Methods 0.000 claims abstract description 6
- 238000000034 method Methods 0.000 claims description 36
- 230000009467 reduction Effects 0.000 claims description 12
- 238000005259 measurement Methods 0.000 claims description 7
- 230000008569 process Effects 0.000 claims description 7
- 238000012937 correction Methods 0.000 claims description 6
- 239000013598 vector Substances 0.000 claims description 6
- HUTDUHSNJYTCAR-UHFFFAOYSA-N ancymidol Chemical compound C1=CC(OC)=CC=C1C(O)(C=1C=NC=NC=1)C1CC1 HUTDUHSNJYTCAR-UHFFFAOYSA-N 0.000 claims description 5
- 238000006243 chemical reaction Methods 0.000 claims description 4
- 239000000284 extract Substances 0.000 claims description 4
- 230000002159 abnormal effect Effects 0.000 claims description 3
- 230000008859 change Effects 0.000 claims description 3
- 230000004913 activation Effects 0.000 claims description 2
- 238000007689 inspection Methods 0.000 claims 1
- 238000010408 sweeping Methods 0.000 claims 1
- 238000007726 management method Methods 0.000 description 6
- 238000005516 engineering process Methods 0.000 description 3
- 238000004458 analytical method Methods 0.000 description 2
- 230000015572 biosynthetic process Effects 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 238000002474 experimental method Methods 0.000 description 2
- 238000003786 synthesis reaction Methods 0.000 description 2
- 238000013473 artificial intelligence Methods 0.000 description 1
- 210000004556 brain Anatomy 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 235000013399 edible fruits Nutrition 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000005611 electricity Effects 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 238000007429 general method Methods 0.000 description 1
- PCHJSUWPFVWCPO-UHFFFAOYSA-N gold Chemical compound [Au] PCHJSUWPFVWCPO-UHFFFAOYSA-N 0.000 description 1
- 239000010931 gold Substances 0.000 description 1
- 229910052737 gold Inorganic materials 0.000 description 1
- 238000005286 illumination Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000003909 pattern recognition Methods 0.000 description 1
- 238000003672 processing method Methods 0.000 description 1
- 238000000926 separation method Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/24—Aligning, centring, orientation detection or correction of the image
- G06V10/243—Aligning, centring, orientation detection or correction of the image by compensating for image skew or non-uniform image deformations
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/26—Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
- G06V30/28—Character recognition specially adapted to the type of the alphabet, e.g. Latin alphabet
- G06V30/287—Character recognition specially adapted to the type of the alphabet, e.g. Latin alphabet of Kanji, Hiragana or Katakana characters
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
- G06V30/28—Character recognition specially adapted to the type of the alphabet, e.g. Latin alphabet
- G06V30/293—Character recognition specially adapted to the type of the alphabet, e.g. Latin alphabet of characters other than Kanji, Hiragana or Katakana
Abstract
The present invention provides a kind of bill automatic identification and processing system based on OCR, including image capture module, rapid image binarization block, text block detection and locating module, the pinpoint module of single column text block, the accurate positioning of multicolumn text block and segmentation module, text identification module and bill images retrieval module.The high definition acquisition that the present invention can not only complete invoice bill images is stored with compression, additionally it is possible to carry out robust, accurately positioning and identification to alphabetic characters such as purchase and sale side's information, merchandise news, dates of making out an invoice in every bill.Bill images identification is able to carry out convenient retrieval.The system has the characteristics that processing and accuracy of identification is high, cost is relatively low, robustness is good, high degree of automation, can be widely applied in the computer automation management of bill, such as the certification of bill, the filing and inquiry of bill.
Description
Technical field
The invention belongs to pattern-recognitions and field of artificial intelligence, automatic more particularly to a kind of bill based on OCR
Identification and processing system.
Background technique
Bill automatic identification and processing technique based on OCR refer to through equipment such as computers, utilize OCR technique (optics
Character recognition) symbol in paper-bill is automatically extracted and identified, and carry out respective handling.It is to realize that bill calculates
One of the key technology that machine automatically processes.Although the development such as e-payment, electronic bill is increasing, traditionally on paper bill is still
It is one of widely used mode in practical work and life, such as all kinds of paper invoices, financial document.Existing paper-bill
Computer, which automatically processes, generally has following methods: (1) automatic collection and storage of bill images, is generally set by special
The standby automatic collection for carrying out paper-bill and compression storage.But due to do not have to carry out in bill the automatic identification of the information such as text with
Processing, it is difficult to the automatically retrieval and subsequent effective management by ticket contents are carried out, as the computer of ticket contents is verified automatically
With verification etc..(2) manual entry of ticket contents generally carries out manual typing and guarantor to billing information by artificial mode
It deposits, in order to which subsequent bill computer manages automatically.This mode is not suitable for extensive bill and automatically processes, and record by hand
Enter also to be easy to appear typing mistake, human cost is higher.(3) the bill Computer Automatic Recognition with simple format and processing,
This mode is generally directed to the relatively simple bill of format, such as banker's check.Billing information one to be identified in this mode
As there is fixed geometric position or special sprocket bit, bill may be implemented by sprocket bit or simple geometry conversion
The extraction of information, and utilize the identification of OCR technique progress character.For with complex space of a whole page bill, especially invoice ticket
According to since bill is many kinds of, the complicated multiplicity of bill page format, there is presently no a kind of general methods or equipment to have
Text information in effect ground automatic identification bill.For the above analysis, the present invention is directed to the invoice bill with complicated format
Information automatic input and processing provide a kind of effective automatic identification and processing especially for VAT invoice bill
Method and system.
Summary of the invention
It is an object of the invention to overcome the shortcomings of above-mentioned bill processing mode and system, a kind of quick, high-precision is provided
VAT invoice bill automatic identification and processing system, its main feature is that using high speed scanner carry out invoice bill image adopt
Collection, quickly, accurately can extract and identify the seller and buyer's enterprise name and Taxpayer Identification Number, quotient in VAT invoice
The billing informations such as product information (including product name, measurement unit, quantity, the amount of money, amount of tax to be paid), date of making out an invoice, and ticket can be completed
It saves and retrieves according to the compression of image.
A kind of bill automatic identification and processing system based on OCR, including bill images acquisition module, rapid image two-value
Change module, text block detection and locating module, the pinpoint module of single column text block, multicolumn text block accurate positioning with point
Cut module, text identification module and bill images retrieval module;After image capture module acquires invoice bill images, rapid image
Binarization block carries out binary conversion treatment to image, and text block detection carries out text block detection and positioning, and root with locating module
According to the horizontal division line inclination angle determined in detection, tilt detection and correction are carried out to image;According to the text block of positioning, lead to respectively
The accurate positioning of the pinpoint module, multicolumn text block of crossing single column text block is accurately positioned and is divided with segmentation module,
To obtain billing information line of text to be identified;Line of text is divided into cardinar number word string and Chinese character string by line of text identification module
Two kinds, identifying processing is carried out respectively;The bill images of acquisition and recognition result are associated storage by bill images retrieval module,
The retrieval in bill images data is completed, retrieval content includes purchaser's information of bill images, pin side's information, merchandise news, opens
The ticket date.
Further, system further includes high-definition camera, realizes the acquisition of invoice image, and by image transmitting to high-performance
Computer saves;User need to only be placed in invoice is smooth below camera, carry out invoice by image capture module triggering camera
Image Acquisition, and it is saved in the specified directory of computer.
Further, after user's activation system, this system is waited for, and smooth invoice is placed in camera by user
Lower section calculates trigger signal according to sequential frame image by image capture module, and triggers camera and carry out invoice Image Acquisition, and
It is saved in the specified directory of PC computer;Then, system carries out automatically processing and identifying for bill images, extracts increase with identification respectively
Seller and buyer's title and Taxpayer Identification Number, merchandise news, date information of making out an invoice in value tax invoice bill, and complete bill
The compression of picture saves and identification information association;According to the bill images of generation and identification information data, system completes bill
As retrieval and filing automatic management, and can according to identification information, realize the billing information to related tax authority's database into
Row authentication management.
Further, bill images acquisition module is responsible for carrying out triggering control to high-definition camera, and by the bill of acquisition
Image transmitting is stored to high-performance computer;The frame-to-frame differences for the image sequence that trigger signal is acquired according to high-definition camera carries out
It calculates;If adjacent interframe image difference is less than given threshold, and the foreground pixel points proportion of image is greater than predetermined value, then
It sends trigger signal and carries out bill images acquisition.
Further, rapid image binarization block carries out binaryzation to invoice image, using based on maximum between-cluster variance
The binarization method of method and local block;It image is subjected to gray processing first, and is divided into N number of subregion, N takes here
Value is set according to stroke width, then in each subregion, determines image binaryzation using maximum between-cluster variance method
Threshold value T, and image binaryzation is carried out according to T;If image maximum and minimal gray value difference are less than preset value in subregion, this is set
Subregion is background.
Further, text block detection and locating module divide content to be identified according to the format of VAT invoice bill
For purchaser's information, pin side's information, merchandise news, date text block of making out an invoice, and using based on straight-line detection method positioning and point
Cut out corresponding text image block;First with Hough transform line detection method detection horizontal line of the inclination angle less than 45 °, and according to
The horizontal line tilt angle of detection carries out the slant correction of image;Then, longest preceding 5 horizontal lines are taken respectively, and utilize line spacing
Geometric proportion constraint, orient the horizontal division line of VAT invoice;Finally according to horizontal subdivision line and billing information block
Geometric position, orient invoice text information block to be identified;The text information block of positioning is divided into single column text block and multicolumn
Text block, single column text block include purchaser's title and Taxpayer Identification Number, pin side's title and Taxpayer Identification Number, make out an invoice the date;
Multicolumn text block includes merchandise news block, and merchandise news block includes product name, measurement unit, quantity, the amount of money, amount of tax to be paid column;Such as
The segmentation failure of fruit text block, then be judged as invalid bill.
Further, the pinpoint module of single column text block mainly handles the accurate positioning of single column text block, to each
Single column text image block is passed through number by the stroke that row is scanned, and calculates every row first, is then passed through with stroke several maximum
Behavior initial row, respectively to upper and lower scanning, the stroke of continuous K row, which passes through to count, if it exists is less than predetermined threshold (K is determined by experiment),
Then think for row edge, to position a line text;Then it repeats the above steps to remainder image, it is all until navigating to
Until line of text;Finally using the high approximately equal characteristic of line of text in one text block, the high abnormal row of row is rejected, thus smart
Determine the line of text to be identified in each single column text image block in position;If line of text segmentation failure, is judged as invalid bill;
The accurate positioning of multicolumn text block mainly handles the accurate positioning of multicolumn text block, i.e. merchandise news with segmentation module
Image block;First with Hough transform line detection method is based on, the vertical divider in image block is positioned, and utilize separation
Geometric proportion constraint condition between line, rejects invalid separator bar, to orient product name, measurement unit, quantity, gold
Volume, the corresponding picture position in amount of tax to be paid column;The characteristic for finally meeting horizontal alignment according to every merchandise news picture position, thus smart
It determines position and is partitioned into line of text image to be identified;If line of text segmentation failure, is judged as invalid bill.
Further, line of text is divided into cardinar number word string and two kinds of Chinese character string by line of text identification module, is carried out respectively
Identifying processing;The identification process of line of text are as follows: the line character segmentation analyzed based on upright projection is carried out first, then using dynamic
The method of planning calculates optimal segmentation path, obtains character identification result;Identification for cardinar number word string, to each candidate characters
Divide block, calculate 8 direction gradient features first, and carry out LDA dimensionality reduction, then in the space of feature vectors after dimensionality reduction, utilizes
Arest neighbors method is classified, and the recognition credibility of each candidate characters is obtained, and is updated to the optimum segmentation of line of text identification
In path computing;Identification for Chinese character string divides block to each candidate characters, calculates 8 direction gradient features first, and
Carry out LDA dimensionality reduction;In space of feature vectors after dimensionality reduction, classified using arest neighbors method, obtains each candidate characters
Recognition credibility, and by recognition credibility and binary language model information and the ratio of width to height of neighboring candidate Character segmentation block
Geological information is integrated, and the recognition credibility after synthesis is updated in the optimal segmentation path calculating of line of text identification.
The present invention compared with technology, has the advantages that with existing bill images computer processing method
(1) acquisition of bill images is carried out using high-definition camera, system structure is simple, and easy to use, hardware cost is low.
If not only at high cost using high speed scanner, operation is also not convenient enough.
(2) due to the image processing algorithm using robust, including image binaryzation method and based on the straight of Hough transform
Line detection and text positioning method, system can preferably adapt to variation and the bill page format of use environment illumination
Certain adjustment, system stability are preferable.
(3) the characteristics of being directed to bill images, due to using effective segmentation of the characters and their identification algorithm, cooperates this system
High-definition camera, the available clearly bill images of this system, and carry out bill storage filing, character recognition accurately compared with
It is high.Experimental results show that the character identification rate of this system can achieve 98% or more.
(4) the extracted ticket text information of this system has carried out effective association, Ke Yiguang with the bill images of acquisition
In the general computer management applied to bill, such as the certification of bill, the inquiry of bill, to preferably solve bill images
Automatic processing needs.
Detailed description of the invention
Fig. 1 is the processing flow schematic diagram of this bill automatic identification and processing system.
Fig. 2 is the image procossing of this bill automatic identification and processing system and the flow chart of identification module.
Specific embodiment
The present invention is further specifically described with reference to the accompanying drawing, embodiments of the present invention are not limited thereto.
Fig. 1 is the bill processing flow schematic diagram of this bill automatic identification and processing system.As shown in Figure 1, the place of system
Manage process are as follows: after user starts this Hardware & software system, this system is waited for, and smooth invoice is placed in and takes the photograph by user
As calculating trigger signal according to sequential frame image by image capture module below head, and triggers camera progress invoice image and adopt
Collection, and it is saved in the specified directory of PC computer.Then, system carries out bill images and automatically processing and identify, extract respectively with
Identify that (including product name, metering are single for seller and buyer's title in VAT invoice bill and Taxpayer Identification Number, merchandise news
Position, quantity, the amount of money, the amount of tax to be paid), the billing informations such as date of making out an invoice, and the compression for completing bill images saves and identification information is closed
Connection.The bill images and identification information data generated according to system, system can complete bill images retrieval and automate with filing
Management, and can realize that the billing information to related tax authority's database carries out authentication management etc. according to identification information.
Fig. 2 is the flow chart of the image procossing and identification module of this bill automatic identification and processing system.As shown in Figure 2,
The flow chart of image procossing and identification module are as follows:, will be according to after the image processing module of this system receives invoice bill ticket image
It is secondary that binary conversion treatment, text block detection and positioning are carried out to image, and according to the horizontal division line inclination angle determined in detection, to figure
As carrying out tilt detection and correction.According to the text block of positioning, text block is divided into single column text block and multicolumn text block by system,
It is accurately positioned and is divided respectively, to obtain billing information line of text to be identified.Then it successively carries out based on projection point
The line character of analysis is divided, the confidence level of candidate characters calculates, the calculating and character recognition of optimal segmentation path.Finally identification is believed
Breath and the bill images of acquisition carry out efficient association and store predetermined.
The present invention realizes a kind of bill automatic identification and processing system based on OCR, utilizes high-definition camera and high-performance
Computer.
High-definition camera realizes the acquisition of invoice image, and image transmitting to high-performance computer is saved.User only need by
Invoice is smooth to be placed in below camera, carries out invoice Image Acquisition by image capture module triggering camera, and be saved in electricity
The specified directory of brain.
(1) image capture module
It is responsible for carrying out triggering control to high-definition camera, and the bill images of acquisition is transferred to high-performance computer and are deposited
Storage.The frame-to-frame differences for the image sequence that trigger signal is acquired according to high-definition camera is calculated.If adjacent interframe image difference is small
It is greater than predetermined value in the foreground pixel points proportion of certain threshold value, and image, then sends trigger signal and carry out bill images
Acquisition.
(2) rapid image binarization block
Binaryzation is carried out to invoice image.It is influenced to adapt to the brings such as ambient lighting, here using based on maximum kind
Between Variance Method and the binarization method of local block.Image is subjected to gray processing first, and is divided into subregion, here N
Value set according to stroke width, then in each subregion, determine image two using maximum between-cluster variance method
Value threshold value T, and image binaryzation is carried out according to T.If image maximum and minimal gray value difference are less than preset value in subregion,
Setting the subregion is background.
(3) text block detection and locating module
According to the format of VAT invoice bill, by content to be identified be divided into purchaser's information, pin side's information, merchandise news,
It makes out an invoice the text blocks such as date, and positions text image block corresponding to being partitioned into using the method based on straight-line detection.First with
Hough transform line detection method detects horizontal line of the inclination angle less than 45 °, and carries out image according to the horizontal line tilt angle of detection
Slant correction.Then, longest preceding 5 horizontal lines are taken respectively, and are constrained using the geometric proportion of line spacing, and value-added tax hair is oriented
The horizontal division line of ticket.Finally according to the geometric position of horizontal subdivision line and billing information block, invoice to be identified is oriented
Text information block.The text information block of positioning is divided into single column text block and multicolumn text block, and single column text block includes purchaser's title
With Taxpayer Identification Number, pin side's title and Taxpayer Identification Number, date of making out an invoice etc.;Multicolumn text block includes merchandise news block (packet
Include the columns such as product name, measurement unit, quantity, the amount of money, the amount of tax to be paid).If text block segmentation failure, is returned as invalid bill.
(4) pinpoint module of single column text block
The main accurate positioning for handling single column text block, to each single column text image block, algorithm is swept by row first
The stroke retouched, and calculate every row passes through number, then passes through several maximum behavior initial rows with stroke, respectively to upper and lower scanning, if
Number is passed through there are the stroke of continuous K row and is less than predetermined threshold (K is determined by experiment), then it is assumed that be row edge, to position a line
Text.Then it repeats the above steps to remainder image, until navigating to all line of text.Finally utilize one text
The high approximately equal characteristic of line of text in block rejects the high abnormal row of row, to be accurately positioned in each single column text image block
Line of text to be identified.If line of text segmentation failure, is returned as invalid bill.
(5) accurate positioning of multicolumn text block and segmentation module
The accurate positioning of main processing multicolumn text block, i.e. merchandise news image block.First with straight based on Hough transform
Line detecting method positions the vertical divider in image block, and using the geometric proportion constraint condition between separator bar, it is invalid to reject
Separator bar, to orient the corresponding picture position in the columns such as product name, measurement unit, quantity, the amount of money, the amount of tax to be paid.Last root
Meet the characteristic of horizontal alignment according to every merchandise news picture position, to be accurately positioned and be partitioned into line of text figure to be identified
Picture.If line of text segmentation failure, is returned as invalid bill.
(6) text identification module
Here line of text is divided into cardinar number word string (such as Taxpayer Identification Number) and two kinds of Chinese character string, is identified respectively
Processing.The identification process of line of text are as follows: carry out the line character segmentation analyzed based on upright projection first, then use Dynamic Programming
Method calculate optimal segmentation path, obtain character identification result.Each candidate characters are divided in identification for cardinar number word string
Block calculates 8 direction gradient features first, and carries out LDA dimensionality reduction, then in the space of feature vectors after dimensionality reduction, using nearest
Adjacent method is classified, and the recognition credibility of each candidate characters is obtained, and is updated to the optimal segmentation path of line of text identification
In calculating.Identification for Chinese character string divides block to each candidate characters, calculates 8 direction gradient features first, and carry out
LDA dimensionality reduction.In space of feature vectors after dimensionality reduction, classified using arest neighbors method, obtains the knowledge of each candidate characters
Other confidence level, and recognition credibility and binary language model information and the ratio of width to height of neighboring candidate Character segmentation block etc. is several
What information is integrated, and the recognition credibility after synthesis is updated in the optimal segmentation path calculating of line of text identification.
(7) bill images retrieval module
By the way that the bill images of acquisition and recognition result are associated storage, the retrieval in completion bill images data,
Retrieve purchaser's information that content includes bill images, pin side's information, merchandise news, date of making out an invoice etc..It can be convenient filing bill
The retrieval of image.
Above-described embodiment is preferred embodiments of the present invention, but embodiments of the present invention are not by above-described embodiment
Limitation, it is other any without departing from spirit of the invention and change made under technology, modification or substitution, it should be equivalent and set
It changes, is included within the scope of the present invention.
Claims (7)
1. a kind of bill automatic identification and processing system based on OCR, it is characterised in that including bill images acquisition module, quickly
Image binaryzation module, text block detection and locating module, the pinpoint module of list column text block, multicolumn text block it is accurate
Positioning and segmentation module, text identification module and bill images retrieval module;After image capture module acquires invoice bill images,
Rapid image binarization block carries out binary conversion treatment to image, and text block, which is detected, to be carried out text block detection with locating module and determine
Position, and according to the horizontal division line inclination angle determined in detection, tilt detection and correction are carried out to image;According to the text of positioning
Block, it is accurately fixed to be carried out respectively by the pinpoint module of single column text block, the accurate positioning of multicolumn text block and segmentation module
Position and segmentation, to obtain billing information line of text to be identified;Line of text identification module by line of text be divided into cardinar number word string and
Two kinds of Chinese character string, identifying processing is carried out respectively;Bill images retrieval module by the bill images of acquisition and recognition result into
Row associated storage, completes the retrieval in bill images data, retrieval content include purchaser's information of bill images, pin side's information,
Merchandise news is made out an invoice the date;
After user's activation system, this system is waited for, and smooth invoice is placed in below camera by user, is adopted by image
Collect module and trigger signal is calculated according to sequential frame image, and triggers camera and carry out invoice Image Acquisition, and be saved in PC computer
Specified directory;Then, system carries out automatically processing and identifying for bill images, extracts respectively and identification VAT invoice bill
In seller and buyer's title and Taxpayer Identification Number, merchandise news, date information of making out an invoice, and complete bill images compression protect
It deposits and identification information is associated with;According to the bill images of generation and identification information data, system is completed bill images retrieval and is returned
Shelves automatic management, and can realize that the billing information to related tax authority's database carries out authentication management according to identification information.
2. the bill automatic identification and processing system according to claim 1 based on OCR, it is characterised in that: further include height
Clear camera realizes the acquisition of invoice image, and image transmitting to high-performance computer is saved;User need to only put invoice is smooth
It is placed in below camera, invoice Image Acquisition is carried out by image capture module triggering camera, and be saved in the specified mesh of computer
Record.
3. the bill automatic identification and processing system according to claim 1 based on OCR, it is characterised in that: bill images
Acquisition module is responsible for carrying out triggering control to high-definition camera, and the bill images of acquisition are transferred to high-performance computer and are deposited
Storage;The frame-to-frame differences for the image sequence that trigger signal is acquired according to high-definition camera is calculated;If adjacent interframe image difference is small
It is greater than predetermined value in the foreground pixel points proportion of given threshold, and image, then sends trigger signal and carry out bill images
Acquisition.
4. the bill automatic identification and processing system according to claim 1 based on OCR, it is characterised in that: rapid image
Binarization block carries out binaryzation to invoice image, using the binaryzation side based on maximum between-cluster variance method and local block
Method;Image is subjected to gray processing first, and is divided into N number of subregion, the value of N is set according to stroke width here, so
Afterwards in each subregion, image binaryzation threshold value T is determined using maximum between-cluster variance method, and image two-value is carried out according to T
Change;If image maximum and minimal gray value difference are less than preset value in subregion, setting the subregion is background.
5. the bill automatic identification and processing system according to claim 1 based on OCR, it is characterised in that: text block inspection
It surveys and locating module is according to the format of VAT invoice bill, content to be identified is divided into purchaser's information, pin side's information, commodity and is believed
It ceases, date text block of making out an invoice, and positions text image block corresponding to being partitioned into using the method based on straight-line detection;First with
Hough transform line detection method detects horizontal line of the inclination angle less than 45 °, and carries out image according to the horizontal line tilt angle of detection
Slant correction;Then, longest preceding 5 horizontal lines are taken respectively, and are constrained using the geometric proportion of line spacing, and value-added tax hair is oriented
The horizontal division line of ticket;Finally according to the geometric position of horizontal subdivision line and billing information block, invoice to be identified is oriented
Text information block;The text information block of positioning is divided into single column text block and multicolumn text block, and single column text block includes purchaser's title
With Taxpayer Identification Number, pin side's title and Taxpayer Identification Number, make out an invoice the date;Multicolumn text block includes merchandise news block, commodity
Block of information includes product name, measurement unit, quantity, the amount of money, amount of tax to be paid column;If text block segmentation failure, is judged as invalid
Bill.
6. the bill automatic identification and processing system according to claim 1 based on OCR, it is characterised in that: single column text
The pinpoint module of block mainly handles the accurate positioning of single column text block, to each single column text image block, first by traveling
Row scanning, and the stroke for calculating every row passes through number, then passes through several maximum behavior initial rows with stroke, respectively to sweeping up and down
It retouches, the stroke of continuous K row passes through number less than predetermined threshold if it exists, then it is assumed that is row edge, to position a line text;Then
It repeats the above steps to remainder image, until navigating to all line of text;Finally utilize text in one text block
The high approximately equal characteristic of row rejects the high abnormal row of row, to be accurately positioned to be identified in each single column text image block
Line of text;If line of text segmentation failure, is judged as invalid bill;
The accurate positioning of multicolumn text block mainly handles the accurate positioning of multicolumn text block, i.e. merchandise news image with segmentation module
Block;First with Hough transform line detection method is based on, the vertical divider in image block is positioned, and using between separator bar
Geometric proportion constraint condition, invalid separator bar is rejected, to orient product name, measurement unit, quantity, the amount of money, tax
The corresponding picture position in volume column;Finally meet the characteristic of horizontal alignment according to every merchandise news picture position, thus accurately fixed
Position be partitioned into line of text image to be identified;If line of text segmentation failure, is judged as invalid bill.
7. the bill automatic identification and processing system according to claim 1 based on OCR, it is characterised in that: line of text is known
Line of text is divided into cardinar number word string and two kinds of Chinese character string by other module, carries out identifying processing respectively;The identification process of line of text
Are as follows: the line character segmentation analyzed based on upright projection is carried out first, and optimum segmentation road is then calculated using the method for Dynamic Programming
Diameter obtains character identification result;Identification for cardinar number word string divides block to each candidate characters, calculates 8 direction gradients first
Feature, and LDA dimensionality reduction is carried out, then in the space of feature vectors after dimensionality reduction, classified using arest neighbors method, is obtained every
The recognition credibility of a candidate characters, and be updated in the optimal segmentation path calculating of line of text identification;For Chinese character string
Identification, to each candidate characters divide block, calculate 8 direction gradient features first, and carry out LDA dimensionality reduction;Spy after dimensionality reduction
It levies in vector space, is classified using arest neighbors method, obtain the recognition credibility of each candidate characters, and will identify credible
The ratio of width to height geological information of degree and binary language model information and neighboring candidate Character segmentation block is integrated, and will be integrated
During the optimal segmentation path that recognition credibility afterwards is updated to line of text identification calculates.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610070970.8A CN105528604B (en) | 2016-01-31 | 2016-01-31 | A kind of bill automatic identification and processing system based on OCR |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610070970.8A CN105528604B (en) | 2016-01-31 | 2016-01-31 | A kind of bill automatic identification and processing system based on OCR |
Publications (2)
Publication Number | Publication Date |
---|---|
CN105528604A CN105528604A (en) | 2016-04-27 |
CN105528604B true CN105528604B (en) | 2018-12-11 |
Family
ID=55770818
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610070970.8A Expired - Fee Related CN105528604B (en) | 2016-01-31 | 2016-01-31 | A kind of bill automatic identification and processing system based on OCR |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105528604B (en) |
Families Citing this family (47)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106485246B (en) * | 2016-09-19 | 2019-07-16 | 北京小米移动软件有限公司 | Character identifying method and device |
CN106650714A (en) * | 2016-10-08 | 2017-05-10 | 迪堡金融设备有限公司 | Paper note serial number identification method and apparatus |
CN106485243B (en) * | 2016-10-31 | 2019-10-22 | 用友网络科技股份有限公司 | A kind of bank slip recognition error correction method and device |
CN108242050A (en) * | 2016-12-27 | 2018-07-03 | 航天信息股份有限公司 | The processing method and processing device of electronic invoice |
CN106886776A (en) * | 2017-02-23 | 2017-06-23 | 山东浪潮云服务信息科技有限公司 | The application model of license electronization is realized in a kind of utilization image recognition |
CN107133571A (en) * | 2017-04-11 | 2017-09-05 | 上海众开信息科技有限公司 | A kind of system and method that paper invoice is automatically generated to financial statement |
CN107133618B (en) * | 2017-04-24 | 2021-03-19 | 北京中安未来科技有限公司 | Electronic certificate identification triggering method and device |
CN107169488A (en) * | 2017-05-03 | 2017-09-15 | 四川长虹电器股份有限公司 | A kind of correction system and antidote of bill scan image |
CN107194400B (en) * | 2017-05-31 | 2019-12-20 | 北京天宇星空科技有限公司 | Financial reimbursement full ticket image recognition processing method |
CN107392260B (en) * | 2017-06-08 | 2020-03-17 | 中国民生银行股份有限公司 | Error calibration method and device for character recognition result |
CN107679442A (en) * | 2017-06-23 | 2018-02-09 | 平安科技(深圳)有限公司 | Method, apparatus, computer equipment and the storage medium of document Data Enter |
CN109299798A (en) * | 2017-07-25 | 2019-02-01 | 阿里巴巴集团控股有限公司 | Processing method, device and the electronic equipment of travel information |
CN109426814B (en) * | 2017-08-22 | 2023-02-24 | 顺丰科技有限公司 | Method, system and equipment for positioning and identifying specific plate of invoice picture |
CN107622266B (en) * | 2017-09-21 | 2019-05-07 | 平安科技(深圳)有限公司 | A kind of processing method, storage medium and the server of OCR identification |
CN107633239B (en) * | 2017-10-18 | 2020-11-03 | 中电鸿信信息科技有限公司 | Bill classification and bill field extraction method based on deep learning and OCR |
CN109840520A (en) * | 2017-11-24 | 2019-06-04 | 中国移动通信集团广东有限公司 | A kind of invoice key message recognition methods and system |
CN110109907B (en) * | 2017-12-27 | 2021-08-24 | 航天信息股份有限公司 | Tax data storage and query method and device |
CN109993619B (en) * | 2017-12-29 | 2022-09-30 | 北京京东尚科信息技术有限公司 | Data processing method |
CN108446699A (en) * | 2018-02-08 | 2018-08-24 | 东华大学 | Identity card pictorial information identifying system under a kind of complex scene |
CN108460418B (en) * | 2018-03-07 | 2021-09-28 | 南京邮电大学 | Invoice classification method based on character recognition and semantic analysis |
CN108460381B (en) * | 2018-03-13 | 2022-06-10 | 南京邮电大学 | Invoice reimbursement information positioning and intercepting method based on image recognition |
CN108549890A (en) * | 2018-03-22 | 2018-09-18 | 南京邮电大学 | Invoice tilt detection based on image recognition and geometric correction method |
CN108549843A (en) * | 2018-03-22 | 2018-09-18 | 南京邮电大学 | A kind of VAT invoice recognition methods based on image procossing |
CN108734849B (en) * | 2018-04-25 | 2020-11-13 | 新浪网技术(中国)有限公司 | Automatic invoice true-checking method and system |
CN110457973A (en) * | 2018-05-07 | 2019-11-15 | 北京中海汇银财税服务有限公司 | A kind of method and system of bank slip recognition |
CN108717543B (en) * | 2018-05-14 | 2022-01-14 | 北京市商汤科技开发有限公司 | Invoice identification method and device and computer storage medium |
CN109034159B (en) * | 2018-05-28 | 2021-05-28 | 北京捷通华声科技股份有限公司 | Image information extraction method and device |
CN109271910A (en) * | 2018-09-04 | 2019-01-25 | 阿里巴巴集团控股有限公司 | A kind of Text region, character translation method and apparatus |
CN109544774A (en) * | 2018-11-30 | 2019-03-29 | 上海贞众创空间管理有限公司 | A kind of smart tickets archival device |
CN109726710A (en) * | 2018-12-27 | 2019-05-07 | 平安科技(深圳)有限公司 | Invoice information acquisition method, electronic device and readable storage medium storing program for executing |
CN110263239B (en) * | 2019-05-31 | 2023-08-22 | 平安科技(深圳)有限公司 | Invoice identification method and device, storage medium and computer equipment |
CN110675270A (en) * | 2019-09-05 | 2020-01-10 | 平安健康保险股份有限公司 | Method and device for determining medical insurance deduction amount based on invoice information |
CN110659607A (en) * | 2019-09-23 | 2020-01-07 | 天津车之家数据信息技术有限公司 | Data checking method, device and system and computing equipment |
CN110895690A (en) * | 2019-10-11 | 2020-03-20 | 南京邮电大学 | Invoice positioning method based on openCV morphology |
CN111126319A (en) * | 2019-12-27 | 2020-05-08 | 山东旗帜信息有限公司 | Invoice identification method and device |
CN111209827B (en) * | 2019-12-31 | 2023-07-14 | 中国南方电网有限责任公司 | Method and system for OCR (optical character recognition) bill problem based on feature detection |
CN111209865A (en) * | 2020-01-06 | 2020-05-29 | 中科鼎富(北京)科技发展有限公司 | File content extraction method and device, electronic equipment and storage medium |
US11570099B2 (en) | 2020-02-04 | 2023-01-31 | Bank Of America Corporation | System and method for autopartitioning and processing electronic resources |
CN111444793A (en) * | 2020-03-13 | 2020-07-24 | 安诚迈科(北京)信息技术有限公司 | Bill recognition method, equipment, storage medium and device based on OCR |
CN111291741B (en) * | 2020-05-13 | 2020-11-03 | 太平金融科技服务(上海)有限公司 | Receipt identification method and device, computer equipment and storage medium |
CN112784014B (en) * | 2021-01-15 | 2022-03-25 | 中国核动力研究设计院 | Safe full-text retrieval system and method based on multi-source heterogeneous system |
CN112949450B (en) * | 2021-02-25 | 2024-01-23 | 北京百度网讯科技有限公司 | Bill processing method, device, electronic equipment and storage medium |
CN112966583A (en) * | 2021-02-26 | 2021-06-15 | 深圳壹账通智能科技有限公司 | Image processing method, image processing device, computer equipment and storage medium |
CN112699860B (en) * | 2021-03-24 | 2021-06-22 | 成都新希望金融信息有限公司 | Method for automatically extracting and sorting effective information in personal tax APP operation video |
CN114120322B (en) * | 2022-01-26 | 2022-05-10 | 深圳爱莫科技有限公司 | Order commodity quantity identification result correction method and processing equipment |
CN114662462A (en) * | 2022-03-10 | 2022-06-24 | 江西工程学院 | Accounting data processing method and system |
CN114550194B (en) * | 2022-04-26 | 2022-08-19 | 北京北大软件工程股份有限公司 | Method and device for identifying letters and visitors |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101447017A (en) * | 2008-11-27 | 2009-06-03 | 浙江工业大学 | Method and system for quickly identifying and counting votes on the basis of layout analysis |
CN101751121A (en) * | 2008-12-12 | 2010-06-23 | 汉王科技股份有限公司 | OCR-based wireless scanning input device and method |
CN104112128A (en) * | 2014-06-19 | 2014-10-22 | 中国工商银行股份有限公司 | Digital image processing system applied to bill image character recognition and method |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP1843276A1 (en) * | 2006-04-03 | 2007-10-10 | Océ-Technologies B.V. | Method for automated processing of hard copy text documents |
-
2016
- 2016-01-31 CN CN201610070970.8A patent/CN105528604B/en not_active Expired - Fee Related
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101447017A (en) * | 2008-11-27 | 2009-06-03 | 浙江工业大学 | Method and system for quickly identifying and counting votes on the basis of layout analysis |
CN101751121A (en) * | 2008-12-12 | 2010-06-23 | 汉王科技股份有限公司 | OCR-based wireless scanning input device and method |
CN104112128A (en) * | 2014-06-19 | 2014-10-22 | 中国工商银行股份有限公司 | Digital image processing system applied to bill image character recognition and method |
Non-Patent Citations (2)
Title |
---|
基于OCR快递单据识别的研究与实现;胡提坤;《中国优秀硕士学位论文全文数据库 信息科技辑》;20140915(第09期);I138-821 * |
银行票据手写数字串识别的预处理与分割;刘培根;《中国优秀硕士学位论文全文数据库 信息科技辑》;20120715(第07期);I138-2179 * |
Also Published As
Publication number | Publication date |
---|---|
CN105528604A (en) | 2016-04-27 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105528604B (en) | A kind of bill automatic identification and processing system based on OCR | |
US10943105B2 (en) | Document field detection and parsing | |
CN108491799B (en) | Intelligent sales counter commodity management method and system based on image recognition | |
Grüning et al. | Read-bad: A new dataset and evaluation scheme for baseline detection in archival documents | |
RU2679209C2 (en) | Processing of electronic documents for invoices recognition | |
CN108717543B (en) | Invoice identification method and device and computer storage medium | |
US9396404B2 (en) | Robust industrial optical character recognition | |
CN103914680B (en) | A kind of spray printing character picture identification and check system and method | |
CN104217203A (en) | Complex background card face information identification method and system | |
CN113963147B (en) | Key information extraction method and system based on semantic segmentation | |
CN105809205A (en) | Classification method and system for hyperspectral images | |
JP2023536174A (en) | OCR-based document analysis system and method using virtual cells | |
JP3078318B2 (en) | Character recognition method and apparatus including locating and extracting predetermined data from a document | |
CN114511866A (en) | Data auditing method, device, system, processor and machine-readable storage medium | |
CN113469005A (en) | Recognition method of bank receipt, related device and storage medium | |
US20220036063A1 (en) | Document information extraction for computer manipulation | |
CN117037198A (en) | Bank statement identification method | |
CN111428725A (en) | Data structuring processing method and device and electronic equipment | |
Tran et al. | A novel approach for text detection in images using structural features | |
Shweka et al. | Automatic extraction of catalog data from digital images of historical manuscripts | |
Ai et al. | Geometry preserving active polygon-incorporated sign detection algorithm | |
Fiel et al. | Writer identification on historical Glagolitic documents | |
CN116343237A (en) | Bill identification method based on deep learning and knowledge graph | |
Nehra et al. | Benchmarking of text segmentation in devnagari handwritten document | |
Bogahawatte et al. | Online Digital Cheque Clearance and Verification System using Block Chain |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20181211 |