CN110427853B - Intelligent bill information extraction processing method - Google Patents

Intelligent bill information extraction processing method Download PDF

Info

Publication number
CN110427853B
CN110427853B CN201910672641.4A CN201910672641A CN110427853B CN 110427853 B CN110427853 B CN 110427853B CN 201910672641 A CN201910672641 A CN 201910672641A CN 110427853 B CN110427853 B CN 110427853B
Authority
CN
China
Prior art keywords
bill
information
keyword
identification
type
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910672641.4A
Other languages
Chinese (zh)
Other versions
CN110427853A (en
Inventor
郭其超
毅力奇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Yinuo Prospect Finance And Taxation Technology Co ltd
Original Assignee
Beijing Yinuo Prospect Finance And Taxation Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Yinuo Prospect Finance And Taxation Technology Co ltd filed Critical Beijing Yinuo Prospect Finance And Taxation Technology Co ltd
Priority to CN201910672641.4A priority Critical patent/CN110427853B/en
Publication of CN110427853A publication Critical patent/CN110427853A/en
Application granted granted Critical
Publication of CN110427853B publication Critical patent/CN110427853B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content
    • G06V30/414Extracting the geometrical structure, e.g. layout tree; Block segmentation, e.g. bounding boxes for graphics or text
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content
    • G06V30/418Document matching, e.g. of document images

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Graphics (AREA)
  • Geometry (AREA)
  • Character Discrimination (AREA)
  • Image Analysis (AREA)

Abstract

The embodiment of the invention discloses a method for extracting and processing intelligent bill information, which comprises the following steps: shooting a bill to generate a bill electronic picture; determining an identification range, and identifying the maximum range occupied by the electronic bill picture; defining an identification sequence, defining the identification sequence according to the bill typesetting rule, and sequencing the identification result; identifying bill information, scanning effective electronic areas according to an identification sequence, judging the current bill type, extracting bill type data, and simultaneously acquiring keyword classification information contained in the current bill type; defining an information extraction rule, and prescribing the information extraction rule according to the bill type; implementing information extraction operation, identifying bill information, and searching key information in a bill picture according to an information extraction rule; the scheme adopts a mode of dynamically defining the spatial relationship to extract the structural information of the bill, and can well process line feed, truncation and vertical texts in the bill.

Description

Intelligent bill information extraction processing method
Technical Field
The embodiment of the invention relates to the technical field of picture information identification, in particular to an intelligent bill information extraction processing method.
Background
The instrument is a written voucher reflecting the relationship of the claim and debt. The instruments are generated in the market exchange and circulation, reflecting the creditor-debt relationship between the parties. Specifically, in the exchange of property (merchandise, currency and other property rights), both parties each enjoy certain rights and obligations in the property, i.e., a creditory-liability relationship occurs, which requires written identification and presentation to ensure that both parties achieve their respective rights and obligations. It is on this basis that the ticket is produced. Without a true creditor-creditee relationship, there is no instrument. Therefore, the written voucher reflecting the debt and debt relationship is one of the basic properties of the bill.
With the growing growth of enterprises and the increasing water flow of enterprises, the number of enterprise bills is also increasing, wherein bills issued by suppliers, bills required to be reimbursed by enterprise employees on business are involved. In the face of thousands of enterprise bills, how to quickly and efficiently automatically identify the enterprise bills based on the computer technology becomes a hot point of concern.
However, the existing bill processing mode has the following defects: the extraction of the bill information adopts a fixed template mode, the conditions of line feed, truncation, vertical texts and the like in the bill cannot be processed, and the extracted information has a large difference with the actual condition.
Disclosure of Invention
Therefore, the embodiment of the invention provides an intelligent bill information extraction processing method, which adopts a scanning mode from top to bottom and from left to right to identify bill pictures and adopts a mode of dynamically defining spatial relationship to extract structured bill information, can well process line feed, truncation and vertical texts in bills, and solves the problem that the existing technology is difficult to correctly extract complex bill information.
In order to achieve the above object, an embodiment of the present invention provides the following: a method for extracting and processing intelligent bill information comprises the following steps:
step 100, shooting a bill to generate a bill electronic picture;
step 200, determining an identification range, and identifying the maximum range occupied by the electronic bill picture;
step 300, defining an identification sequence, defining the identification sequence according to the bill typesetting rule, and sequencing the identification result;
step 400, identifying bill information, scanning effective electronic areas according to an identification sequence, judging the current bill type, extracting bill type data, and simultaneously acquiring keyword classification information contained in the current bill type;
500, defining an information extraction rule, and specifying the information extraction rule according to the bill type;
and 600, implementing information extraction operation, identifying the bill information, and searching key information in the bill picture according to the information extraction rule.
As a preferred embodiment of the present invention, in step 200, the specific steps of determining the identification range are:
step 201, determining the position of the line where the bill is raised and the position of the line where the bill is finished according to the bill typesetting sequence from top to bottom;
step 202, determining the position of the column where the leftmost bill is located and the position of the column where the rightmost bill is located according to the bill typesetting sequence from left to right;
step 203, determining the area occupied by the bill electronic picture according to the row-column relationship of the step 201 and the step 202, and cutting the bill electronic picture along the row-column position to generate a bill information graph convenient to identify;
and step 204, establishing a rectangular coordinate system by taking the row-column intersection position of the upper left corner as an origin, and rotationally correcting the bill information graph in the rectangular coordinate system.
In step 201 and step 202, the position of the line where the bill is raised from the bill head information is 0-2 depth units, the position of the line where the bill is ended from the bill end information is 0-2 depth units, the position of the column where the left side of the bill is located from the information of the left side of the bill is 0-2 width units, and the position of the column where the right side of the bill is located from the information of the right side of the bill is 0-2 width units.
As a preferable scheme of the present invention, in the step 204, the specific step of correcting the position of the bill image is:
determining an included angle between a line where the bill head-up is located and an X axis of a rectangular coordinate system;
and rotating the whole bill information graph along the origin of the rectangular coordinate system until the edge row and line of the bill information graph coincide with the coordinate axis of the rectangular coordinate system.
As a preferred aspect of the present invention, in step 300, the identification sequence is specifically an identification sequence of each column of the ticket information from top to bottom and from left to right, wherein when identifying each row from top to bottom, the ticket information is identified from left to right in each row.
As a preferred embodiment of the present invention, in step 400, the specific steps of identifying the ticket information are:
step 401, determining the scanning identification depth and the scanning identification width of the electronic bill map, and defining the area of the scanning identification depth and the scanning identification width as a scanning identification unit;
step 402, scanning the bill information graph by the scanning recognition unit from the origin of the rectangular coordinate system according to the sequence from top to bottom and then from left to right, or the sequence from left to right and then from top to bottom;
step 403, splicing the scanning results into a picture, identifying the character information of the spliced picture in real time, comparing the identification information with the type of the existing bill in real time, and determining the type of the current bill;
step 404, according to the determined bill type, confirming the keyword classification contained in the current bill type, continuously working by the scanning and identifying unit, and when the keyword information of the splicing map is complete, cutting the complete keyword and the information in the keyword range from the splicing map into an information map for storage, wherein incomplete character information is reserved on the original splicing map;
and step 405, the scanning identification unit continues working, and step 404 is repeated until the whole bill information image is scanned.
As a preferred scheme of the present invention, in step 403, after a complete bill type appears in the mosaic, the bill type information is cut and stored, and the specific steps are as follows:
selecting the adjacent positions of the leftmost character and the lowermost character of the bill type information as absolute anchor points A on the splicing map;
selecting the adjacent positions of the rightmost character and the topmost character of the bill type information as absolute anchor points B on the splicing map;
setting a rectangular wire frame between the absolute anchor point A and the absolute anchor point B, and delineating the bill type information content;
the absolute anchor point A and the absolute anchor point B are respectively deviated along the left and right directions to obtain an offset anchor point A1 and an offset anchor point B1, and the bill type information content is defined again;
and calculating useful information of each range result, and determining the keyword content using the useful information.
As a preferred scheme of the present invention, in step 404, the specific step of cutting the ticket information keyword is as follows:
determining complete keyword classification contained in information of one line of the splicing diagram, and taking the leftmost side of each complete keyword as a cutting demarcation point;
taking the distance between the lower end of each keyword and the upper end of the corresponding keyword range as cutting depth, taking the total length of the keyword range and the keywords as cutting width, and performing rectangular cutting on each keyword and the keyword range to obtain an information graph for storage;
each cutting information graph is used for storing a keyword and corresponding keyword information.
As a preferred embodiment of the present invention, when determining the cutting depth, 1 to 2 depth cells at the lower end of the keyword are used as the lower boundary, 0 to 1 depth cell at the upper end of the keyword range is used as the upper boundary, and the distance between the upper boundary and the lower boundary is used as the cutting depth.
As a preferred embodiment of the present invention, in step 600, the specific steps of implementing the information extraction operation are:
step 601, converting all the information graphs in step 400 into characters recognizable for a computer by adopting an attention-based image-to-character model in deep learning, and generating a corresponding keyword information set, wherein each keyword information element in the keyword information set is represented as a category: content ";
step 602, extracting the required keyword category under the current bill type according to the information extraction rule defined in step 500;
step 603, outputting and displaying the extracted keyword category and the keyword content.
The embodiment of the invention has the following advantages:
(1) When the identification range is determined by cutting the bill picture, cutting off marginal blank areas of the bill picture, and carrying out angle correction on the whole bill picture to ensure that the bill picture is normally displayed and can be independently filed as a bill basis;
(2) When extracting the bill information, scanning the bill pictures from top to bottom and from left to right to realize complete scanning of the bill information, extracting the bill structural information by adopting a mode of dynamically defining a spatial relationship, and well processing line feed, truncation and vertical texts in the bill, so that the bill information can be accurately extracted for post-processing use, the conditions of error and leakage of the bill information or unmatched information are prevented, and the stability and the precision of intelligent identification of the bill information are improved;
(3) The invention only needs to manually input the bill picture, realizes the cutting processing of the bill and the extraction of the keywords and the keyword range by means of an intelligent identification technology, has convenient operation, does not need excessive manual interference, and is simple to realize, thereby facilitating the repeated input and arrangement operation of a large amount of bill information.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below. It should be apparent that the drawings in the following description are merely exemplary, and that other embodiments can be derived from the drawings provided by those of ordinary skill in the art without inventive effort.
The structures, ratios, sizes, and the like shown in the present specification are only used for matching with the contents disclosed in the specification, so as to be understood and read by those skilled in the art, and are not used to limit the conditions that the present invention can be implemented, so that the present invention has no technical significance, and any structural modifications, changes in the ratio relationship, or adjustments of the sizes, without affecting the effects and the achievable by the present invention, should still fall within the range that the technical contents disclosed in the present invention can cover.
Fig. 1 is a schematic flow chart of a bill information extraction processing method according to an embodiment of the present invention;
Detailed Description
The present invention is described in terms of particular embodiments, other advantages and features of the invention will become apparent to those skilled in the art from the following disclosure, and it is to be understood that the described embodiments are merely exemplary of the invention and that it is not intended to limit the invention to the particular embodiments disclosed. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
As shown in fig. 1, the invention provides an intelligent bill information extraction processing method, which adopts an intelligent identification mode to replace manual information input, effectively avoids manual errors, and improves the processing efficiency of bill information.
When the intelligent recognition technology is used for processing the bill information, the invention is mainly characterized in that:
(1) When the identification range is determined by cutting the bill picture, cutting off marginal blank areas of the bill picture, and carrying out angle correction on the whole bill picture to ensure that the bill picture is normally displayed in a rectangular coordinate system and can be independently filed as a bill basis;
(2) When extracting the bill information, scanning the bill pictures from top to bottom and from left to right, and individually classifying the bill types and the information to be extracted and intercepting the bill types and the information to be extracted from the scanned pictures, so that the bill information can be accurately extracted for post-processing use, the condition that the bill information is wrong and missed or the information is not matched is prevented, and the stability and the precision of intelligently identifying the bill information are improved;
(3) The invention only needs to manually input the bill picture, realizes the cutting processing of the bill and the extraction of the keywords and the keyword range by means of an intelligent identification technology, has convenient operation, does not need excessive manual interference, and is simple to realize, thereby facilitating the repeated input and arrangement operation of a large amount of bill information.
The method specifically comprises the following steps:
and step 100, shooting the bill to generate an electronic bill picture.
When the bill is shot, the definition and the brightness of the electronic bill picture need to be noticed, and the problem that the cleaning degree of the electronic bill picture is poor to influence the reading of the information of the bill at the back is avoided.
And 200, determining an identification range, and identifying the maximum range occupied by the electronic bill picture.
The purpose of this step is to cut the marginal blank area in the note electronic picture, reduce the redundant scanning recognition operation in the later stage, therefore can reduce the complexity to the note information processing and extraction, raise the efficiency that the note information discerns.
In step 200, the specific steps of determining the identification range are as follows:
step 201, determining the position of the line where the bill is raised and the position of the line where the bill is finished according to the bill typesetting sequence from top to bottom;
step 202, determining the position of the column where the leftmost side of the bill is located and the position of the column where the rightmost side of the bill is located according to the bill typesetting sequence from left to right;
step 203, determining the area occupied by the bill electronic picture according to the row-column relationship of the step 201 and the step 202, and cutting the bill electronic picture along the row-column position to generate a bill information map convenient to identify.
The three steps can define the whole content of the bill information through the row and column relationship of up, down, left and right, and cut and remove the blank area of the electronic picture of the bill.
In addition, when the region occupied by the bill electronic picture is set, the cutting row-column relationship is respectively parallel to the circumferential edge of the bill, so that the rectangular cutting shape of the bill information picture can be ensured, and the angular relationship between the bill information picture and the rectangular coordinate system can be conveniently determined.
And the position of the row where the bill is raised is 0-2 depth units from the bill raising information, the position of the row where the bill is raised is 0-2 depth units from the bill tail information, the position of the left-most row of the bill is 0-2 width units from the left-most row of the bill, and the position of the right-most row of the bill is 0-2 width units from the right-most row of the bill.
The depth unit and the width unit of the invention are related to the size of the literal character of the bill information, generally, the height of one literal character is taken as a depth unit, the width of one literal character is taken as a width unit, and the size of the literal character can be determined according to the content font of each bill, and the size of the universal literal character can also be directly defined.
When the bill information graph is intercepted, the cutting space is not exactly equal to the area occupied by the bill information content, but the area of the cutting space is expanded outwards relative to the area occupied by the bill information content, so that the fault-tolerant rate is improved, and the condition that the bill information is lost is prevented.
And step 204, establishing a rectangular coordinate system by taking the row-column intersection position of the upper left corner as an origin, and rotationally correcting the bill information graph in the rectangular coordinate system.
The method has the advantages that the direction correction is carried out on the bill information graph, and the space coincidence of the bill information graph and the rectangular coordinate system is ensured.
The specific steps for correcting the position of the bill picture are as follows:
determining an included angle between a line where the bill head-up is located and an X axis of a rectangular coordinate system;
and rotating the whole bill information graph along the origin of the rectangular coordinate system until the edge row and line of the bill information graph coincide with the coordinate axis of the rectangular coordinate system.
Therefore, as one of the main characteristic points of the invention, when the identification range is determined by cutting the bill picture, the blank area of the edge of the bill picture is cut off, and the angle correction is performed on the whole bill picture, so that the bill picture can be normally displayed in a rectangular coordinate system and can be separately filed as the bill basis.
Step 300, defining an identification sequence, defining the identification sequence according to the bill typesetting rule, and sequencing the identification result.
The identification sequence is specifically an identification sequence of each column of the bill information from top to bottom and from left to right, wherein when identifying each row from top to bottom, the bill information is identified from left to right in each row.
That is to say, when the bill information graph after cutting is identified, the whole graph is covered according to the sequence from top to bottom and from left to right, so that the place where the information is mistakenly leaked can be avoided, meanwhile, the bill information can be matched timely according to the identification sequence, the bill information can be extracted timely, and the condition that the information is mistakenly matched when the information is extracted is avoided.
Step 400, identifying the bill information, scanning the effective electronic area according to the identification sequence, judging the current bill type, extracting the bill type data, and simultaneously acquiring the keyword classification information contained in the current bill type.
The step is divided into two parts of bill information classification for judging bill types and acquiring bill information diagrams, wherein the bill types comprise three categories including money orders, home orders and checks, and the money orders are divided into bank money orders and commercial money orders; the ticket is divided into a commercial ticket and a bank ticket; the check is divided into a registered check, an unregistered check, a marked check, a cash check and a transfer check, so the types of the bills are different, and the key information contained in the bills is also different.
Therefore, in step 400, the specific steps of identifying the ticket information are:
firstly, the scanning recognition depth and the scanning recognition width of the electronic bill map are determined, and the area of the scanning recognition depth and the scanning recognition width is defined as a scanning recognition unit.
The scanning recognition depth and the scanning recognition width are determined by the proportional relation between the depth unit and the width unit, generally speaking, the scanning recognition depth = K · depth unit, and the scanning recognition width = M · width unit, wherein K, M ≧ 1, the scanning recognition depth and the scanning recognition width form a scanning recognition unit, and then the scanning recognition unit scans on the bill electronic map, so that the information traversal of the bill electronic map can be realized.
Then, the scanning recognition unit scans the bill information graph from the origin of the rectangular coordinate system in the order from top to bottom and from left to right.
When the bill information image is scanned, the bill information image can be firstly divided into a plurality of lines according to the scanning recognition depth and then scanned from left to right in each line, or the bill information image is firstly divided into a plurality of columns according to the scanning recognition width and then scanned from top to bottom in each column, so that the problems of line changing, truncation and vertical text in the bill can be effectively adapted.
And then splicing the scanning results, identifying the character information of the spliced graph in real time, comparing the identification information with the existing bill type in real time, and determining the type of the current bill.
The method comprises the steps that when a scanning identification unit walks for one width unit, information identified by the scanning identification unit is spliced, and meanwhile, when the scanning identification unit walks for one depth unit, the identification information is spliced.
Therefore, the bill type stored in the existing bill type database is the sum of the bank name and the bill type, and specifically includes: the bill type of the bill information is compared with the existing bill type database, so that the type of the current bill can be determined, wherein the combination of the bank and the bill classification is used as effective bill category information.
The method is characterized in that the method extracts the bill structural information by adopting a way of dynamically defining spatial relationship through the recognition sequence from top to bottom and from left to left, can well process line feed, truncation and vertical texts in the bill, and can determine the basic information contained in the bill (particularly checks, remittances and home tickets) after determining the types of the bills, including information of a payee, a remitter, remittance time and the like, thereby determining the keywords of the bill information to be extracted and facilitating the subsequent scanning and recognition operation.
The specific steps of cutting, extracting and storing the information of the bill types are as follows:
selecting the adjacent positions of the leftmost character and the lowermost character of the bill type information as absolute anchor points A on the splicing map;
selecting the adjacent positions of the rightmost character and the topmost character of the bill type information as absolute anchor points B on the splicing map;
and setting a rectangular wire frame between the absolute anchor A and the absolute anchor B to define the content of the bill type information.
The implementation process of the above steps is to compare the current bill type information with the existing bill type database, so as to determine the classification of the current bill type (the current bill, the draft or the check), but because the existing bank has many classifications, if the existing bill type database is not updated in time, the bill type information is lost when being cut and extracted.
In order to avoid the loss of the bill type information, the next step of operation is carried out, the absolute anchor point A and the absolute anchor point B are respectively shifted along the left and right directions to obtain a shift anchor point A1 and a shift anchor point B1, and the bill type information content is defined again;
and repeating the operation, calculating useful information of each delineation range result, and determining to use complete bill type information.
That is, in order to avoid that the current bill type is not updated in time in the existing bill type database, for example, the bill type is a transfer check of a wide bank, and the information of the "wide bank" and the "transfer check" is not updated in time in the existing bill type database, when only the "transfer check" is selected and the "wide bank" is not defined, the complete bill type information can be determined by performing secondary extension on the selected cutting area through the invention.
The bill type information of "issuing bank" and "transfer check" will be specifically exemplified below:
firstly, comparing the existing database, using an absolute anchor point A and an absolute anchor point B to define the 'transfer check' information, and expanding the absolute anchor point A and the absolute anchor point B due to the lack of bank information;
and then, offsetting the absolute anchor point A and the absolute anchor point B along the left and right directions respectively to obtain an offset anchor point A1 and an offset anchor point B1, wherein the offset of the absolute anchor point A and the absolute anchor point B is related to the size of the character of the transfer check, and the offset is generally selected to be 1-1.5 times of the size of the character of the transfer check.
And finally, redefining the bill type information content to obtain complete bill type information.
Therefore, as the third main characteristic point of the invention, the invention can ensure that complete bill type information is extracted, accurately extract the bill information for post-processing use, prevent the bill information from being mistaken or unmatched, improve the stability and precision of intelligent identification bill information and facilitate the subsequent extraction of keyword information by performing extension delineation and cutting on the bill types.
After determining the bill type information, the content part of the bill also needs to be extracted, so that the keyword classification contained in the current bill type is determined according to the determined bill type, the scanning and identifying unit continues to work, when the keyword information of the splicing map is complete, the complete keyword and the information in the keyword range are cut from the splicing map as the information map to be stored, and incomplete character information is reserved on the original splicing map.
The method comprises the following specific steps of cutting the bill information keywords:
and determining the complete keyword classification contained in one line of information of the splicing map, and taking the leftmost side of each complete keyword as a cutting boundary point.
Since the scanning identification unit is in a mode of from top to bottom and from left to right, when the content of the bill is scanned, when the information of one line is complete, more than two keywords may exist, and therefore, the leftmost side of each complete keyword is used as a cutting demarcation point, and the cutting of the keyword content can be completed.
And taking the distance between the lower end of each keyword and the upper end of the corresponding keyword range as cutting depth, taking the total length of the keyword range and the keywords as cutting width, and performing rectangular cutting on each keyword and the keyword range to store the information graph.
When the cutting depth is determined, 1-2 depth units at the lower end of the keyword are used as a lower boundary, 0-1 depth unit at the upper end of the keyword range is used as an upper boundary, and the distance between the upper boundary and the lower boundary is used as the cutting depth, so that the condition of incomplete information interception can be avoided.
Each cutting information graph is used for storing a keyword and corresponding keyword information.
According to the steps, the extraction of the bill content can be completed.
And 500, defining an information extraction rule, and specifying the information extraction rule according to the bill type.
Since the content extracted in step 400 is all the information of the bills containing the keywords, and further filtering is needed to screen out useful content, the information extraction rule of each bill is defined, and the screening condition can be determined.
And 600, implementing information extraction operation, identifying the bill information, and searching key information in the bill picture according to the information extraction rule.
The specific steps for implementing the information extraction operation are as follows:
step 601, converting all the information graphs in step 400 into characters which can be recognized by a computer by adopting an attention-based image-to-character model in deep learning to generate a corresponding keyword information set, wherein each keyword information element in the keyword information set is represented as a category: content ";
step 602, extracting the required keyword category under the current bill type according to the information extraction rule defined in step 500;
and step 603, outputting and displaying the extracted keyword category and the keyword content.
The invention only needs to input the bill picture manually, realizes the cutting processing of the bill and the extraction of the keywords and the keyword range by means of an intelligent identification technology, has convenient operation, does not need excessive manual interference, and is simple to realize, thereby facilitating the repeated input and arrangement operation of a large amount of bill information.
Although the invention has been described in detail with respect to the general description and the specific embodiments, it will be apparent to those skilled in the art that modifications and improvements may be made based on the invention. Accordingly, such modifications and improvements are intended to be within the scope of the invention as claimed.

Claims (8)

1. A method for extracting and processing intelligent bill information is characterized by comprising the following steps:
step 100, shooting a bill to generate a bill electronic picture;
step 200, determining an identification range, and identifying the maximum range occupied by the electronic bill picture;
step 300, defining an identification sequence, defining the identification sequence according to the bill typesetting rule, and sequencing the identification result;
step 400, identifying bill information, scanning effective electronic areas according to an identification sequence, judging the current bill type, extracting bill type data, and simultaneously acquiring keyword classification information contained in the current bill type;
in step 400, the specific steps of identifying the ticket information are:
step 401, determining the scanning identification depth and the scanning identification width of the electronic bill map, and defining the area of the scanning identification depth and the scanning identification width as a scanning identification unit;
step 402, scanning the bill information graph by the scanning recognition unit from the origin of the rectangular coordinate system according to the sequence from top to bottom and then from left to right, or the sequence from left to right and then from top to bottom;
step 403, splicing the scanning results into a picture, identifying the character information of the spliced picture in real time, comparing the identification information with the type of the existing bill in real time, and determining the type of the current bill;
step 404, according to the determined bill type, confirming the keyword classification contained in the current bill type, continuously working by the scanning and identifying unit, and when the keyword information of the splicing map is complete, cutting the complete keyword and the information in the keyword range from the splicing map into an information map for storage, wherein incomplete text information is reserved on the original splicing map;
step 405, the scanning identification unit continues working, and step 404 is repeated until the whole bill information image is scanned;
in step 403, when the complete bill type appears in the spliced graph, the bill type information is cut and stored, and the specific steps are as follows:
selecting the adjacent positions of the leftmost character and the lowermost character of the bill type information as absolute anchor points A on the splicing map;
selecting the adjacent positions of the rightmost character and the topmost character of the bill type information as absolute anchor points B on the splicing map;
setting a rectangular wire frame between the absolute anchor point A and the absolute anchor point B, and delineating the bill type information content;
the absolute anchor point A and the absolute anchor point B respectively deviate along the left and right directions to obtain an offset anchor point A1 and an offset anchor point B1, and the bill type information content is defined again;
calculating useful information of each range result, and determining the keyword content using the useful information;
500, defining an information extraction rule, and specifying the information extraction rule according to the bill type;
and 600, implementing information extraction operation, identifying the bill information, and searching key information in the bill picture according to the information extraction rule.
2. The method for extracting and processing the intelligent ticket information according to claim 1, wherein in the step 200, the specific step of determining the identification range is:
step 201, determining the position of a bill head-up line and the position of a bill tail line according to the order of bill typesetting from top to bottom;
step 202, determining the position of the column where the leftmost bill is located and the position of the column where the rightmost bill is located according to the bill typesetting sequence from left to right;
step 203, determining the area occupied by the bill electronic picture according to the row-column relationship of the step 201 and the step 202, and cutting the bill electronic picture along the row-column position to generate a bill information graph convenient to identify;
and step 204, establishing a rectangular coordinate system by taking the row-column intersection position of the upper left corner as an origin, and rotationally correcting the bill information graph in the rectangular coordinate system.
3. The method for extracting and processing the intelligent bill information according to claim 2, wherein: in step 201 and step 202, the position of the row where the bill is raised from the bill raising information is 0-2 depth units, the position of the row where the bill is raised from the bill tail information is 0-2 depth units, the position of the column where the bill is leftmost from the bill leftmost information is 0-2 width units, and the position of the column where the bill is rightmost from the bill rightmost information is 0-2 width units.
4. The method for extracting and processing intelligent ticket information according to claim 2, wherein in the step 204, the specific step of correcting the ticket picture position is:
determining an included angle between a line where the bill head-up is located and an X axis of a rectangular coordinate system;
and rotating the whole bill information graph along the origin of the rectangular coordinate system until the edge ranks of the bill information graph coincide with the coordinate axes of the rectangular coordinate system.
5. The method of claim 1, wherein in step 300, the identification sequence is a column from top to bottom and from left to right, and the identification sequence of each row from top to bottom identifies the bill information from left to right in each row.
6. The method for extracting and processing the intelligent ticket information according to claim 1, wherein in step 404, the specific step of cutting the ticket information keyword comprises:
determining complete keyword classification contained in one line of information of the splicing map, and taking the leftmost side of each complete keyword as a cutting boundary point;
taking the distance between the lower end of each keyword and the upper end of the corresponding keyword range as a cutting depth, taking the total length of the keyword range and the keywords as a cutting width, and performing rectangular cutting on each keyword and the keyword range to obtain an information graph for storage;
each cutting information graph is used for storing a keyword and corresponding keyword information.
7. The method for extracting and processing the intelligent bill information according to claim 6, wherein: when the cutting depth is determined, 1-2 depth units at the lower end of the keyword are used as a lower boundary, 0-1 depth unit at the upper end of the keyword range is used as an upper boundary, and the distance between the upper boundary and the lower boundary is used as the cutting depth.
8. The method for extracting and processing the information of the intelligent ticket according to claim 1, wherein in step 600, the specific steps for implementing the information extraction operation are as follows:
step 601, converting all the information graphs in step 400 into characters which can be recognized by a computer by adopting an attention-based image-to-character model in deep learning to generate a corresponding keyword information set, wherein each keyword information element in the keyword information set is represented as a category: content ";
step 602, extracting the required keyword category under the current bill type according to the information extraction rule defined in step 500;
and step 603, outputting and displaying the extracted keyword category and the keyword content.
CN201910672641.4A 2019-07-24 2019-07-24 Intelligent bill information extraction processing method Active CN110427853B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910672641.4A CN110427853B (en) 2019-07-24 2019-07-24 Intelligent bill information extraction processing method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910672641.4A CN110427853B (en) 2019-07-24 2019-07-24 Intelligent bill information extraction processing method

Publications (2)

Publication Number Publication Date
CN110427853A CN110427853A (en) 2019-11-08
CN110427853B true CN110427853B (en) 2022-11-01

Family

ID=68412215

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910672641.4A Active CN110427853B (en) 2019-07-24 2019-07-24 Intelligent bill information extraction processing method

Country Status (1)

Country Link
CN (1) CN110427853B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111047261B (en) * 2019-12-11 2023-06-16 青岛盈智科技有限公司 Warehouse logistics order identification method and system
CN111275037B (en) * 2020-01-09 2021-06-08 上海知达教育科技有限公司 Bill identification method and device
CN111444793A (en) * 2020-03-13 2020-07-24 安诚迈科(北京)信息技术有限公司 Bill recognition method, equipment, storage medium and device based on OCR
CN113011249B (en) * 2021-01-29 2024-05-28 招商银行股份有限公司 Bill auditing method, device, equipment and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108777021A (en) * 2018-05-18 2018-11-09 北京大账房网络科技股份有限公司 It is a kind of to mix the bank slip recognition method and system swept based on scanner
CN109146846A (en) * 2018-07-17 2019-01-04 深圳大学 No-reference image quality evaluation system and method

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8255787B2 (en) * 2009-06-29 2012-08-28 International Business Machines Corporation Automated configuration of location-specific page anchors
CN102841885B (en) * 2011-06-21 2016-02-17 北大方正集团有限公司 Set up the method and apparatus of object anchoring relationship
CN103208004A (en) * 2013-03-15 2013-07-17 北京英迈杰科技有限公司 Automatic recognition and extraction method and device for bill information area
CN107194400B (en) * 2017-05-31 2019-12-20 北京天宇星空科技有限公司 Financial reimbursement full ticket image recognition processing method
CN107766809B (en) * 2017-10-09 2020-05-19 平安科技(深圳)有限公司 Electronic device, bill information identification method, and computer-readable storage medium
CN107958249B (en) * 2017-11-21 2020-09-11 众安信息技术服务有限公司 Text entry method based on image
CN108717545B (en) * 2018-05-18 2020-12-18 北京大账房网络科技股份有限公司 Bill identification method and system based on mobile phone photographing

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108777021A (en) * 2018-05-18 2018-11-09 北京大账房网络科技股份有限公司 It is a kind of to mix the bank slip recognition method and system swept based on scanner
CN109146846A (en) * 2018-07-17 2019-01-04 深圳大学 No-reference image quality evaluation system and method

Also Published As

Publication number Publication date
CN110427853A (en) 2019-11-08

Similar Documents

Publication Publication Date Title
CN110427853B (en) Intelligent bill information extraction processing method
CN108960223B (en) Method for automatically generating voucher based on intelligent bill identification
RU2737720C1 (en) Retrieving fields using neural networks without using templates
CN109657665B (en) Invoice batch automatic identification system based on deep learning
US8515208B2 (en) Method for document to template alignment
CN110442744B (en) Method and device for extracting target information in image, electronic equipment and readable medium
CN111489487B (en) Bill identification method, device, equipment and storage medium
CN108921166A (en) Medical bill class text detection recognition method and system based on deep neural network
CN111507251A (en) Method and device for positioning answer area in test question image and electronic equipment
CN112651289B (en) Value-added tax common invoice intelligent recognition and verification system and method thereof
CN109034155A (en) A kind of text detection and the method and system of identification
CN107358232A (en) Invoice recognition methods and identification and management system based on plug-in unit
JP2004139484A (en) Form processing device, program for implementing it, and program for creating form format
CN109977723A (en) Big bill picture character recognition methods
US9633256B2 (en) Methods and systems for efficient automated symbol recognition using multiple clusters of symbol patterns
CN113158895B (en) Bill identification method and device, electronic equipment and storage medium
CN111191652A (en) Certificate image identification method and device, electronic equipment and storage medium
CN110263616A (en) A kind of character recognition method, device, electronic equipment and storage medium
US20140268250A1 (en) Systems and methods for receipt-based mobile image capture
CN116092231A (en) Ticket identification method, ticket identification device, terminal equipment and storage medium
US20210240932A1 (en) Data extraction and ordering based on document layout analysis
CN116311299A (en) Method, device and system for identifying structured data of table
JPH08297704A (en) Automatic health insurance card recognition method and device and automatic aged person health insurance card recognition method and device
JP3463008B2 (en) Medium processing method and medium processing apparatus
CN114694159A (en) Engineering drawing BOM identification method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant