CN110427853B

CN110427853B - Intelligent bill information extraction processing method

Info

Publication number: CN110427853B
Application number: CN201910672641.4A
Authority: CN
Inventors: 郭其超; 毅力奇
Original assignee: Beijing Yinuo Prospect Finance And Taxation Technology Co ltd
Current assignee: Beijing Yinuo Prospect Finance And Taxation Technology Co ltd
Priority date: 2019-07-24
Filing date: 2019-07-24
Publication date: 2022-11-01
Anticipated expiration: 2039-07-24
Also published as: CN110427853A

Abstract

The embodiment of the invention discloses a method for extracting and processing intelligent bill information, which comprises the following steps: shooting a bill to generate a bill electronic picture; determining an identification range, and identifying the maximum range occupied by the electronic bill picture; defining an identification sequence, defining the identification sequence according to the bill typesetting rule, and sequencing the identification result; identifying bill information, scanning effective electronic areas according to an identification sequence, judging the current bill type, extracting bill type data, and simultaneously acquiring keyword classification information contained in the current bill type; defining an information extraction rule, and prescribing the information extraction rule according to the bill type; implementing information extraction operation, identifying bill information, and searching key information in a bill picture according to an information extraction rule; the scheme adopts a mode of dynamically defining the spatial relationship to extract the structural information of the bill, and can well process line feed, truncation and vertical texts in the bill.

Description

Intelligent bill information extraction processing method

Technical Field

The embodiment of the invention relates to the technical field of picture information identification, in particular to an intelligent bill information extraction processing method.

Background

The instrument is a written voucher reflecting the relationship of the claim and debt. The instruments are generated in the market exchange and circulation, reflecting the creditor-debt relationship between the parties. Specifically, in the exchange of property (merchandise, currency and other property rights), both parties each enjoy certain rights and obligations in the property, i.e., a creditory-liability relationship occurs, which requires written identification and presentation to ensure that both parties achieve their respective rights and obligations. It is on this basis that the ticket is produced. Without a true creditor-creditee relationship, there is no instrument. Therefore, the written voucher reflecting the debt and debt relationship is one of the basic properties of the bill.

With the growing growth of enterprises and the increasing water flow of enterprises, the number of enterprise bills is also increasing, wherein bills issued by suppliers, bills required to be reimbursed by enterprise employees on business are involved. In the face of thousands of enterprise bills, how to quickly and efficiently automatically identify the enterprise bills based on the computer technology becomes a hot point of concern.

However, the existing bill processing mode has the following defects: the extraction of the bill information adopts a fixed template mode, the conditions of line feed, truncation, vertical texts and the like in the bill cannot be processed, and the extracted information has a large difference with the actual condition.

Disclosure of Invention

Therefore, the embodiment of the invention provides an intelligent bill information extraction processing method, which adopts a scanning mode from top to bottom and from left to right to identify bill pictures and adopts a mode of dynamically defining spatial relationship to extract structured bill information, can well process line feed, truncation and vertical texts in bills, and solves the problem that the existing technology is difficult to correctly extract complex bill information.

In order to achieve the above object, an embodiment of the present invention provides the following: a method for extracting and processing intelligent bill information comprises the following steps:

step 100, shooting a bill to generate a bill electronic picture;

step 200, determining an identification range, and identifying the maximum range occupied by the electronic bill picture;

step 300, defining an identification sequence, defining the identification sequence according to the bill typesetting rule, and sequencing the identification result;

step 400, identifying bill information, scanning effective electronic areas according to an identification sequence, judging the current bill type, extracting bill type data, and simultaneously acquiring keyword classification information contained in the current bill type;

500, defining an information extraction rule, and specifying the information extraction rule according to the bill type;

and 600, implementing information extraction operation, identifying the bill information, and searching key information in the bill picture according to the information extraction rule.

As a preferred embodiment of the present invention, in step 200, the specific steps of determining the identification range are:

step 201, determining the position of the line where the bill is raised and the position of the line where the bill is finished according to the bill typesetting sequence from top to bottom;

step 202, determining the position of the column where the leftmost bill is located and the position of the column where the rightmost bill is located according to the bill typesetting sequence from left to right;

step 203, determining the area occupied by the bill electronic picture according to the row-column relationship of the step 201 and the step 202, and cutting the bill electronic picture along the row-column position to generate a bill information graph convenient to identify;

and step 204, establishing a rectangular coordinate system by taking the row-column intersection position of the upper left corner as an origin, and rotationally correcting the bill information graph in the rectangular coordinate system.

In step 201 and step 202, the position of the line where the bill is raised from the bill head information is 0-2 depth units, the position of the line where the bill is ended from the bill end information is 0-2 depth units, the position of the column where the left side of the bill is located from the information of the left side of the bill is 0-2 width units, and the position of the column where the right side of the bill is located from the information of the right side of the bill is 0-2 width units.

As a preferable scheme of the present invention, in the step 204, the specific step of correcting the position of the bill image is:

determining an included angle between a line where the bill head-up is located and an X axis of a rectangular coordinate system;

and rotating the whole bill information graph along the origin of the rectangular coordinate system until the edge row and line of the bill information graph coincide with the coordinate axis of the rectangular coordinate system.

As a preferred aspect of the present invention, in step 300, the identification sequence is specifically an identification sequence of each column of the ticket information from top to bottom and from left to right, wherein when identifying each row from top to bottom, the ticket information is identified from left to right in each row.

As a preferred embodiment of the present invention, in step 400, the specific steps of identifying the ticket information are:

step 401, determining the scanning identification depth and the scanning identification width of the electronic bill map, and defining the area of the scanning identification depth and the scanning identification width as a scanning identification unit;

step 402, scanning the bill information graph by the scanning recognition unit from the origin of the rectangular coordinate system according to the sequence from top to bottom and then from left to right, or the sequence from left to right and then from top to bottom;

step 403, splicing the scanning results into a picture, identifying the character information of the spliced picture in real time, comparing the identification information with the type of the existing bill in real time, and determining the type of the current bill;

step 404, according to the determined bill type, confirming the keyword classification contained in the current bill type, continuously working by the scanning and identifying unit, and when the keyword information of the splicing map is complete, cutting the complete keyword and the information in the keyword range from the splicing map into an information map for storage, wherein incomplete character information is reserved on the original splicing map;

and step 405, the scanning identification unit continues working, and step 404 is repeated until the whole bill information image is scanned.

As a preferred scheme of the present invention, in step 403, after a complete bill type appears in the mosaic, the bill type information is cut and stored, and the specific steps are as follows:

selecting the adjacent positions of the leftmost character and the lowermost character of the bill type information as absolute anchor points A on the splicing map;

selecting the adjacent positions of the rightmost character and the topmost character of the bill type information as absolute anchor points B on the splicing map;

setting a rectangular wire frame between the absolute anchor point A and the absolute anchor point B, and delineating the bill type information content;

the absolute anchor point A and the absolute anchor point B are respectively deviated along the left and right directions to obtain an offset anchor point A1 and an offset anchor point B1, and the bill type information content is defined again;

and calculating useful information of each range result, and determining the keyword content using the useful information.

As a preferred scheme of the present invention, in step 404, the specific step of cutting the ticket information keyword is as follows:

determining complete keyword classification contained in information of one line of the splicing diagram, and taking the leftmost side of each complete keyword as a cutting demarcation point;

taking the distance between the lower end of each keyword and the upper end of the corresponding keyword range as cutting depth, taking the total length of the keyword range and the keywords as cutting width, and performing rectangular cutting on each keyword and the keyword range to obtain an information graph for storage;

each cutting information graph is used for storing a keyword and corresponding keyword information.

As a preferred embodiment of the present invention, when determining the cutting depth, 1 to 2 depth cells at the lower end of the keyword are used as the lower boundary, 0 to 1 depth cell at the upper end of the keyword range is used as the upper boundary, and the distance between the upper boundary and the lower boundary is used as the cutting depth.

As a preferred embodiment of the present invention, in step 600, the specific steps of implementing the information extraction operation are:

step 601, converting all the information graphs in step 400 into characters recognizable for a computer by adopting an attention-based image-to-character model in deep learning, and generating a corresponding keyword information set, wherein each keyword information element in the keyword information set is represented as a category: content ";

step 602, extracting the required keyword category under the current bill type according to the information extraction rule defined in step 500;

step 603, outputting and displaying the extracted keyword category and the keyword content.

The embodiment of the invention has the following advantages:

(1) When the identification range is determined by cutting the bill picture, cutting off marginal blank areas of the bill picture, and carrying out angle correction on the whole bill picture to ensure that the bill picture is normally displayed and can be independently filed as a bill basis;

(2) When extracting the bill information, scanning the bill pictures from top to bottom and from left to right to realize complete scanning of the bill information, extracting the bill structural information by adopting a mode of dynamically defining a spatial relationship, and well processing line feed, truncation and vertical texts in the bill, so that the bill information can be accurately extracted for post-processing use, the conditions of error and leakage of the bill information or unmatched information are prevented, and the stability and the precision of intelligent identification of the bill information are improved;

(3) The invention only needs to manually input the bill picture, realizes the cutting processing of the bill and the extraction of the keywords and the keyword range by means of an intelligent identification technology, has convenient operation, does not need excessive manual interference, and is simple to realize, thereby facilitating the repeated input and arrangement operation of a large amount of bill information.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below. It should be apparent that the drawings in the following description are merely exemplary, and that other embodiments can be derived from the drawings provided by those of ordinary skill in the art without inventive effort.

The structures, ratios, sizes, and the like shown in the present specification are only used for matching with the contents disclosed in the specification, so as to be understood and read by those skilled in the art, and are not used to limit the conditions that the present invention can be implemented, so that the present invention has no technical significance, and any structural modifications, changes in the ratio relationship, or adjustments of the sizes, without affecting the effects and the achievable by the present invention, should still fall within the range that the technical contents disclosed in the present invention can cover.

Fig. 1 is a schematic flow chart of a bill information extraction processing method according to an embodiment of the present invention;

Detailed Description

The present invention is described in terms of particular embodiments, other advantages and features of the invention will become apparent to those skilled in the art from the following disclosure, and it is to be understood that the described embodiments are merely exemplary of the invention and that it is not intended to limit the invention to the particular embodiments disclosed. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

As shown in fig. 1, the invention provides an intelligent bill information extraction processing method, which adopts an intelligent identification mode to replace manual information input, effectively avoids manual errors, and improves the processing efficiency of bill information.

When the intelligent recognition technology is used for processing the bill information, the invention is mainly characterized in that:

(1) When the identification range is determined by cutting the bill picture, cutting off marginal blank areas of the bill picture, and carrying out angle correction on the whole bill picture to ensure that the bill picture is normally displayed in a rectangular coordinate system and can be independently filed as a bill basis;

(2) When extracting the bill information, scanning the bill pictures from top to bottom and from left to right, and individually classifying the bill types and the information to be extracted and intercepting the bill types and the information to be extracted from the scanned pictures, so that the bill information can be accurately extracted for post-processing use, the condition that the bill information is wrong and missed or the information is not matched is prevented, and the stability and the precision of intelligently identifying the bill information are improved;

The method specifically comprises the following steps:

and step 100, shooting the bill to generate an electronic bill picture.

When the bill is shot, the definition and the brightness of the electronic bill picture need to be noticed, and the problem that the cleaning degree of the electronic bill picture is poor to influence the reading of the information of the bill at the back is avoided.

And 200, determining an identification range, and identifying the maximum range occupied by the electronic bill picture.

The purpose of this step is to cut the marginal blank area in the note electronic picture, reduce the redundant scanning recognition operation in the later stage, therefore can reduce the complexity to the note information processing and extraction, raise the efficiency that the note information discerns.

In step 200, the specific steps of determining the identification range are as follows:

step 202, determining the position of the column where the leftmost side of the bill is located and the position of the column where the rightmost side of the bill is located according to the bill typesetting sequence from left to right;

step 203, determining the area occupied by the bill electronic picture according to the row-column relationship of the step 201 and the step 202, and cutting the bill electronic picture along the row-column position to generate a bill information map convenient to identify.

The three steps can define the whole content of the bill information through the row and column relationship of up, down, left and right, and cut and remove the blank area of the electronic picture of the bill.

In addition, when the region occupied by the bill electronic picture is set, the cutting row-column relationship is respectively parallel to the circumferential edge of the bill, so that the rectangular cutting shape of the bill information picture can be ensured, and the angular relationship between the bill information picture and the rectangular coordinate system can be conveniently determined.

And the position of the row where the bill is raised is 0-2 depth units from the bill raising information, the position of the row where the bill is raised is 0-2 depth units from the bill tail information, the position of the left-most row of the bill is 0-2 width units from the left-most row of the bill, and the position of the right-most row of the bill is 0-2 width units from the right-most row of the bill.

The depth unit and the width unit of the invention are related to the size of the literal character of the bill information, generally, the height of one literal character is taken as a depth unit, the width of one literal character is taken as a width unit, and the size of the literal character can be determined according to the content font of each bill, and the size of the universal literal character can also be directly defined.

When the bill information graph is intercepted, the cutting space is not exactly equal to the area occupied by the bill information content, but the area of the cutting space is expanded outwards relative to the area occupied by the bill information content, so that the fault-tolerant rate is improved, and the condition that the bill information is lost is prevented.

The method has the advantages that the direction correction is carried out on the bill information graph, and the space coincidence of the bill information graph and the rectangular coordinate system is ensured.

The specific steps for correcting the position of the bill picture are as follows:

Therefore, as one of the main characteristic points of the invention, when the identification range is determined by cutting the bill picture, the blank area of the edge of the bill picture is cut off, and the angle correction is performed on the whole bill picture, so that the bill picture can be normally displayed in a rectangular coordinate system and can be separately filed as the bill basis.

Step 300, defining an identification sequence, defining the identification sequence according to the bill typesetting rule, and sequencing the identification result.

The identification sequence is specifically an identification sequence of each column of the bill information from top to bottom and from left to right, wherein when identifying each row from top to bottom, the bill information is identified from left to right in each row.

That is to say, when the bill information graph after cutting is identified, the whole graph is covered according to the sequence from top to bottom and from left to right, so that the place where the information is mistakenly leaked can be avoided, meanwhile, the bill information can be matched timely according to the identification sequence, the bill information can be extracted timely, and the condition that the information is mistakenly matched when the information is extracted is avoided.

Step 400, identifying the bill information, scanning the effective electronic area according to the identification sequence, judging the current bill type, extracting the bill type data, and simultaneously acquiring the keyword classification information contained in the current bill type.

The step is divided into two parts of bill information classification for judging bill types and acquiring bill information diagrams, wherein the bill types comprise three categories including money orders, home orders and checks, and the money orders are divided into bank money orders and commercial money orders; the ticket is divided into a commercial ticket and a bank ticket; the check is divided into a registered check, an unregistered check, a marked check, a cash check and a transfer check, so the types of the bills are different, and the key information contained in the bills is also different.

Therefore, in step 400, the specific steps of identifying the ticket information are:

firstly, the scanning recognition depth and the scanning recognition width of the electronic bill map are determined, and the area of the scanning recognition depth and the scanning recognition width is defined as a scanning recognition unit.

The scanning recognition depth and the scanning recognition width are determined by the proportional relation between the depth unit and the width unit, generally speaking, the scanning recognition depth = K · depth unit, and the scanning recognition width = M · width unit, wherein K, M ≧ 1, the scanning recognition depth and the scanning recognition width form a scanning recognition unit, and then the scanning recognition unit scans on the bill electronic map, so that the information traversal of the bill electronic map can be realized.

Then, the scanning recognition unit scans the bill information graph from the origin of the rectangular coordinate system in the order from top to bottom and from left to right.

When the bill information image is scanned, the bill information image can be firstly divided into a plurality of lines according to the scanning recognition depth and then scanned from left to right in each line, or the bill information image is firstly divided into a plurality of columns according to the scanning recognition width and then scanned from top to bottom in each column, so that the problems of line changing, truncation and vertical text in the bill can be effectively adapted.

And then splicing the scanning results, identifying the character information of the spliced graph in real time, comparing the identification information with the existing bill type in real time, and determining the type of the current bill.

The method comprises the steps that when a scanning identification unit walks for one width unit, information identified by the scanning identification unit is spliced, and meanwhile, when the scanning identification unit walks for one depth unit, the identification information is spliced.

Therefore, the bill type stored in the existing bill type database is the sum of the bank name and the bill type, and specifically includes: the bill type of the bill information is compared with the existing bill type database, so that the type of the current bill can be determined, wherein the combination of the bank and the bill classification is used as effective bill category information.

The method is characterized in that the method extracts the bill structural information by adopting a way of dynamically defining spatial relationship through the recognition sequence from top to bottom and from left to left, can well process line feed, truncation and vertical texts in the bill, and can determine the basic information contained in the bill (particularly checks, remittances and home tickets) after determining the types of the bills, including information of a payee, a remitter, remittance time and the like, thereby determining the keywords of the bill information to be extracted and facilitating the subsequent scanning and recognition operation.

The specific steps of cutting, extracting and storing the information of the bill types are as follows:

and setting a rectangular wire frame between the absolute anchor A and the absolute anchor B to define the content of the bill type information.

The implementation process of the above steps is to compare the current bill type information with the existing bill type database, so as to determine the classification of the current bill type (the current bill, the draft or the check), but because the existing bank has many classifications, if the existing bill type database is not updated in time, the bill type information is lost when being cut and extracted.

In order to avoid the loss of the bill type information, the next step of operation is carried out, the absolute anchor point A and the absolute anchor point B are respectively shifted along the left and right directions to obtain a shift anchor point A1 and a shift anchor point B1, and the bill type information content is defined again;

and repeating the operation, calculating useful information of each delineation range result, and determining to use complete bill type information.

That is, in order to avoid that the current bill type is not updated in time in the existing bill type database, for example, the bill type is a transfer check of a wide bank, and the information of the "wide bank" and the "transfer check" is not updated in time in the existing bill type database, when only the "transfer check" is selected and the "wide bank" is not defined, the complete bill type information can be determined by performing secondary extension on the selected cutting area through the invention.

The bill type information of "issuing bank" and "transfer check" will be specifically exemplified below:

firstly, comparing the existing database, using an absolute anchor point A and an absolute anchor point B to define the 'transfer check' information, and expanding the absolute anchor point A and the absolute anchor point B due to the lack of bank information;

and then, offsetting the absolute anchor point A and the absolute anchor point B along the left and right directions respectively to obtain an offset anchor point A1 and an offset anchor point B1, wherein the offset of the absolute anchor point A and the absolute anchor point B is related to the size of the character of the transfer check, and the offset is generally selected to be 1-1.5 times of the size of the character of the transfer check.

And finally, redefining the bill type information content to obtain complete bill type information.

Therefore, as the third main characteristic point of the invention, the invention can ensure that complete bill type information is extracted, accurately extract the bill information for post-processing use, prevent the bill information from being mistaken or unmatched, improve the stability and precision of intelligent identification bill information and facilitate the subsequent extraction of keyword information by performing extension delineation and cutting on the bill types.

After determining the bill type information, the content part of the bill also needs to be extracted, so that the keyword classification contained in the current bill type is determined according to the determined bill type, the scanning and identifying unit continues to work, when the keyword information of the splicing map is complete, the complete keyword and the information in the keyword range are cut from the splicing map as the information map to be stored, and incomplete character information is reserved on the original splicing map.

The method comprises the following specific steps of cutting the bill information keywords:

and determining the complete keyword classification contained in one line of information of the splicing map, and taking the leftmost side of each complete keyword as a cutting boundary point.

Since the scanning identification unit is in a mode of from top to bottom and from left to right, when the content of the bill is scanned, when the information of one line is complete, more than two keywords may exist, and therefore, the leftmost side of each complete keyword is used as a cutting demarcation point, and the cutting of the keyword content can be completed.

And taking the distance between the lower end of each keyword and the upper end of the corresponding keyword range as cutting depth, taking the total length of the keyword range and the keywords as cutting width, and performing rectangular cutting on each keyword and the keyword range to store the information graph.

When the cutting depth is determined, 1-2 depth units at the lower end of the keyword are used as a lower boundary, 0-1 depth unit at the upper end of the keyword range is used as an upper boundary, and the distance between the upper boundary and the lower boundary is used as the cutting depth, so that the condition of incomplete information interception can be avoided.

According to the steps, the extraction of the bill content can be completed.

And 500, defining an information extraction rule, and specifying the information extraction rule according to the bill type.

Since the content extracted in step 400 is all the information of the bills containing the keywords, and further filtering is needed to screen out useful content, the information extraction rule of each bill is defined, and the screening condition can be determined.

The specific steps for implementing the information extraction operation are as follows:

step 601, converting all the information graphs in step 400 into characters which can be recognized by a computer by adopting an attention-based image-to-character model in deep learning to generate a corresponding keyword information set, wherein each keyword information element in the keyword information set is represented as a category: content ";

and step 603, outputting and displaying the extracted keyword category and the keyword content.

The invention only needs to input the bill picture manually, realizes the cutting processing of the bill and the extraction of the keywords and the keyword range by means of an intelligent identification technology, has convenient operation, does not need excessive manual interference, and is simple to realize, thereby facilitating the repeated input and arrangement operation of a large amount of bill information.

Although the invention has been described in detail with respect to the general description and the specific embodiments, it will be apparent to those skilled in the art that modifications and improvements may be made based on the invention. Accordingly, such modifications and improvements are intended to be within the scope of the invention as claimed.

Claims

1. A method for extracting and processing intelligent bill information is characterized by comprising the following steps:

step 100, shooting a bill to generate a bill electronic picture;

in step 400, the specific steps of identifying the ticket information are:

step 404, according to the determined bill type, confirming the keyword classification contained in the current bill type, continuously working by the scanning and identifying unit, and when the keyword information of the splicing map is complete, cutting the complete keyword and the information in the keyword range from the splicing map into an information map for storage, wherein incomplete text information is reserved on the original splicing map;

step 405, the scanning identification unit continues working, and step 404 is repeated until the whole bill information image is scanned;

in step 403, when the complete bill type appears in the spliced graph, the bill type information is cut and stored, and the specific steps are as follows:

the absolute anchor point A and the absolute anchor point B respectively deviate along the left and right directions to obtain an offset anchor point A1 and an offset anchor point B1, and the bill type information content is defined again;

calculating useful information of each range result, and determining the keyword content using the useful information;

2. The method for extracting and processing the intelligent ticket information according to claim 1, wherein in the step 200, the specific step of determining the identification range is:

step 201, determining the position of a bill head-up line and the position of a bill tail line according to the order of bill typesetting from top to bottom;

3. The method for extracting and processing the intelligent bill information according to claim 2, wherein: in step 201 and step 202, the position of the row where the bill is raised from the bill raising information is 0-2 depth units, the position of the row where the bill is raised from the bill tail information is 0-2 depth units, the position of the column where the bill is leftmost from the bill leftmost information is 0-2 width units, and the position of the column where the bill is rightmost from the bill rightmost information is 0-2 width units.

4. The method for extracting and processing intelligent ticket information according to claim 2, wherein in the step 204, the specific step of correcting the ticket picture position is:

and rotating the whole bill information graph along the origin of the rectangular coordinate system until the edge ranks of the bill information graph coincide with the coordinate axes of the rectangular coordinate system.

5. The method of claim 1, wherein in step 300, the identification sequence is a column from top to bottom and from left to right, and the identification sequence of each row from top to bottom identifies the bill information from left to right in each row.

6. The method for extracting and processing the intelligent ticket information according to claim 1, wherein in step 404, the specific step of cutting the ticket information keyword comprises:

determining complete keyword classification contained in one line of information of the splicing map, and taking the leftmost side of each complete keyword as a cutting boundary point;

taking the distance between the lower end of each keyword and the upper end of the corresponding keyword range as a cutting depth, taking the total length of the keyword range and the keywords as a cutting width, and performing rectangular cutting on each keyword and the keyword range to obtain an information graph for storage;

7. The method for extracting and processing the intelligent bill information according to claim 6, wherein: when the cutting depth is determined, 1-2 depth units at the lower end of the keyword are used as a lower boundary, 0-1 depth unit at the upper end of the keyword range is used as an upper boundary, and the distance between the upper boundary and the lower boundary is used as the cutting depth.

8. The method for extracting and processing the information of the intelligent ticket according to claim 1, wherein in step 600, the specific steps for implementing the information extraction operation are as follows: