CN111401002A - Method, device and computer storage medium for automatically identifying PDF electronic receipt information - Google Patents
Method, device and computer storage medium for automatically identifying PDF electronic receipt information Download PDFInfo
- Publication number
- CN111401002A CN111401002A CN202010164140.8A CN202010164140A CN111401002A CN 111401002 A CN111401002 A CN 111401002A CN 202010164140 A CN202010164140 A CN 202010164140A CN 111401002 A CN111401002 A CN 111401002A
- Authority
- CN
- China
- Prior art keywords
- field
- template
- electronic receipt
- bank
- pdf electronic
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 25
- 238000004590 computer program Methods 0.000 claims description 8
- 238000010586 diagram Methods 0.000 description 2
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/242—Query formulation
- G06F16/2433—Query languages
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q40/00—Finance; Insurance; Tax strategies; Processing of corporate or income taxes
- G06Q40/02—Banking, e.g. interest calculation or account maintenance
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Business, Economics & Management (AREA)
- Theoretical Computer Science (AREA)
- Accounting & Taxation (AREA)
- Finance (AREA)
- General Physics & Mathematics (AREA)
- Development Economics (AREA)
- Technology Law (AREA)
- Strategic Management (AREA)
- General Business, Economics & Management (AREA)
- Marketing (AREA)
- Economics (AREA)
- Mathematical Physics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- General Engineering & Computer Science (AREA)
- Financial Or Insurance-Related Operations Such As Payment And Settlement (AREA)
Abstract
At present, hundreds or even thousands of bank receipts of many group companies at the end of the month are all matched to a fund settlement sheet and a business statement bill through manual work, and the work efficiency is urgently required to be improved and the cost is required to be reduced. Based on the above problems, the present invention provides a method for automatically identifying PDF electronic receipt information, comprising: presetting a bank template in a database, receiving a PDF electronic receipt task sent by a user, determining a corresponding bank template, reading required service content, inserting the required service content into a database service table, and automatically matching a fund settlement sheet and a service statement bill. The bank electronic PDF receipt file of the text content is identified as formatted data and analyzed through a method of presetting a template by banks, and then the fund settlement receipt and the business report bill are automatically and sequentially associated, so that the work pain point that the workload is large, the time consumption is long and the efficiency is low when a cashier holds a paper bank receipt to manually check accounts is solved.
Description
Technical Field
The invention relates to the technical field of computers, in particular to a method, a device and a storage medium for automatically identifying PDF electronic receipt information.
Background
The bank receipt is the original basis for the enterprise to compile the bookkeeping voucher, and the enterprise has corresponding receipt as the evidence when receiving and paying. The receipt content mainly comprises detailed information such as date, serial number, account number, currency, amount and the like, and each account has a receipt. Therefore, a large amount of receipt is processed in the capital management of the corporate company.
At present, the capital receipt and payment control of group companies to subordinate enterprises is higher and higher, and at the end of the month, hundreds or even thousands of bank receipts are matched to a capital settlement sheet and a business report bill by manpower, so that simple and repeated labor is a very time-consuming matter, is a work pain point for cashier, and urgently needs to improve work efficiency and reduce cost.
Disclosure of Invention
Based on the above problems, the present invention provides a method for automatically identifying PDF electronic receipt information, which aims to accurately obtain the required text content and get rid of the current situation that a cashier holds a paper bank receipt to perform manual reconciliation.
Currently, there are many ways to read a PDF document of text content, for example: the ITestSharp and the PdfBox can be read out in a character string mode, but the format of the electronic PDF receipt between banks is not uniform, and the problem of non-uniform format also exists in the same bank, so that the read character sequence is various, and the required text content cannot be accurately identified and acquired in a fixed mode.
Therefore, the acquired character string can only be automatically analyzed according to a certain logic rule, and the required text content can be acquired more accurately by presetting the template and the methods of the preposed field and the postpositional field of a certain field.
In order to achieve the above object, the present invention provides a method for automatically identifying PDF electronic receipt information, comprising:
s1, receiving a PDF electronic receipt task sent by a user;
s2, determining a corresponding bank template;
s3 reading the required service content;
s4 inserting a business table of the database;
s5 automatically matches the fund settlement bill and the business report bill.
Preferably, step S1 is preceded by the steps of:
s0 presets the bank template.
Further, step S0 includes:
s101, reading PDF electronic receipt text information of each bank;
s102, establishing a template preset table according to PDF electronic receipt text information of each bank;
s103, establishing a field preset table according to PDF electronic receipt text information of each bank;
s104, analyzing the character data and presetting the data of the preposed field and the postpositive field.
Preferably, the template preset table in step S102 includes fields with data types of VARCHAR: internal code, bank number, bank name, template number and template name.
Preferably, the field preset table in step S103 includes fields with data types of VARCHAR: inner code, field name, field number, start field, end field, start field sequence number.
Further, step S2 includes:
and circularly traversing the data of the template preset table, acquiring the field name data of the field preset information table corresponding to each template, searching the read text contents one by one until the unique template is searched, prompting that the plurality of bank templates are searched and the configuration of the templates is checked if the plurality of templates are searched, and prompting that the corresponding bank template cannot be searched if the matched templates cannot be searched.
Further, step S3 includes:
after the template is determined, searching the read text content according to the start field and the end field in the field preset table, if the start field data in the text content has a repetition value, determining the position of the start text according to the sequence number of the start field, and searching the first matched end field afterwards, wherein the content of the middle part of the two fields is the required service content.
The invention also provides a device for automatically identifying the PDF electronic receipt information, which comprises:
a memory for storing a computer program;
and the processor is used for executing the computer program to realize any one of the above methods for automatically identifying the PDF electronic receipt information.
The invention also provides a computer storage medium, which stores a computer program, and when the computer program is executed, the computer storage medium is positioned in equipment to execute any one of the above methods for automatically identifying the PDF electronic receipt information.
The invention reads the bank electronic PDF receipt file of the text content into the system by a method of presetting templates by banks, and identifies the file as formatted data. The formatted data is analyzed through a preset format, and then the fund settlement sheet and the business report bill are automatically and sequentially associated, so that the working pain points of large workload, long consumed time and low efficiency when a cashier holds a paper bank receipt to perform manual account checking are solved.
In addition, the invention can flexibly define the template format by banks, and the PDF receipt format of the same bank is different and can define the corresponding template format. The method for presetting and flexibly identifying the preposed field and the postposition field of a certain field can accurately acquire the required text content, and further automatically match a fund settlement list and a business bill according to the acquired content.
Drawings
FIG. 1 is a flow chart of the method of the present invention;
FIG. 2 is a diagram illustrating text contents searched and read according to a start field and an end field in a field preset table according to the present invention;
fig. 3 is a schematic diagram of a PDF electronic receipt of a certain bank in the embodiment.
Detailed Description
In order to better illustrate and facilitate the understanding of the process of the invention, examples are presented for the purpose of illustration. It should be noted that the examples are only for illustration and should not be taken as a basis for limiting the scope of the present invention.
The invention provides a method for automatically identifying PDF electronic receipt information, which comprises the following steps of:
according to a certain bank PDF electronic receipt (figure 3), the text content read by using the GetTextFromPage method of PdfTextExtractor class of ITestSharp program set of C # is as follows:
the template preset table structure is designed as table 1:
serial number | Name of field | Field identification | Data type |
1 | Inner code | ZJYHDZHDYSZB_NM | VARchar(40) |
2 | Bank number | ZJYHDZHDYSZB_YHBH | VARchar(100) |
3 | Name of bank | ZJYHDZHDYSZB_YHMC | VARchar(100) |
4 | Template numbering | ZJYHDZHDYSZB_MBBH | VARchar(100) |
5 | Name of template | ZJYHDZHDYSZB_MBMC | VARchar(100) |
TABLE 1
The field preset table structure is designed as table 2:
serial number | Name of field | Field identification | Data type |
1 | Inner code | ZJYHDZHDYS_NM | VARchar(40) |
2 | Name of field | ZJYHDZHDYS_ZDMC | VARchar(100) |
3 | Field numbering | ZJYHDZHDYS_ZDBH | VARchar(40) |
4 | Start field | ZJYHDZHDYS_KSZD | VARchar(100) |
5 | Termination field | ZJYHDZHDYS_ZZZD | VARchar(100) |
6 | Starting field sequence number | ZJYHDZHDYS_KSZDXH | VARchar(10) |
TABLE 2
Analyzing the word sequence, the template and the preset data of the prepositive field and the postpositive field of the service field to obtain a template preset table shown in table 3:
TABLE 3
The field preset table is as in table 4:
TABLE 4
Other bank templates are preset in the database in the same way.
When receiving the PDF electronic receipt task of the user, carrying out program analysis according to preset data, and the steps are as follows.
Determining templates, traversing and circularly traversing the data of the template preset table, acquiring ZJYHDZHDYS _ ZDMC line data of the field preset information table corresponding to each template, searching the read text contents one by one until a unique template is searched and matched, prompting that a plurality of bank templates are searched and template configuration is checked if a plurality of templates are searched, and prompting that the corresponding bank template cannot be searched if the matched templates cannot be searched.
Reading the required service content: after the template is determined, the read text content is searched according to ZJYHDZHDYS _ KSZD (start field) and ZJYHDZHDYS _ ZZZD (end field) in the field preset table, as shown in FIG. 2, if the start field data in the text content has a repetition value, the start text position is determined according to ZJYHDZHDYS _ KSZDXH (start field number), and then the first matched end field is searched afterwards, and the content in the middle of the two fields is the required service content.
And (4) inserting a database business table, namely forming an SQ L statement and inserting the SQ L statement into the business table according to the business content read in the step 2 and the corresponding ZJYHDZHDYS _ ZDBH (field number).
And automatically matching a fund settlement bill and a business report bill: and (4) searching a fund settlement list and a service bill according to the service table data formed in the step (3).
The above is only one embodiment of the present invention, and is not intended to limit the scope of protection. All equivalents made by using the contents of the specification and the attached drawings of the present invention fall within the protection scope of the present invention.
Claims (9)
1. A method for automatically identifying PDF electronic receipt information is characterized by comprising the following steps:
s1, receiving a PDF electronic receipt task sent by a user;
s2, determining a corresponding bank template;
s3 reading the required service content;
s4 inserting a business table of the database;
s5 automatically matches the fund settlement bill and the business report bill.
2. The method for automatically identifying the PDF electronic receipt information as claimed in claim 1, further comprising the step of, at step S1:
s0 presets the bank template.
3. The method for automatically identifying PDF electronic receipt information according to claim 2, wherein step S0 comprises:
s101, reading PDF electronic receipt text information of each bank;
s102, establishing a template preset table according to PDF electronic receipt text information of each bank;
s103, establishing a field preset table according to PDF electronic receipt text information of each bank;
s104, analyzing the character data and presetting the data of the preposed field and the postpositive field.
4. The method for automatically identifying the PDF electronic receipt information according to claim 3, wherein the template preset table in step S102 includes fields whose data types are all VARCHAR: internal code, bank number, bank name, template number and template name.
5. The method for automatically identifying PDF electronic receipt information according to claim 3, wherein said field preset table in step S103 includes fields whose data types are all VARCHAR: inner code, field name, field number, start field, end field, start field sequence number.
6. The method for automatically identifying the PDF electronic receipt information according to claim 5, wherein step S2 comprises:
circularly traversing the data of the template preset table, acquiring field name data of a field preset information table corresponding to each template, and searching the read text contents one by one until a unique template is searched and matched; if the plurality of templates are found, prompting that the plurality of bank templates are found and the configuration of the templates is checked; and if the matched template cannot be found, prompting that the corresponding bank template cannot be found.
7. The method for automatically identifying the PDF electronic receipt information according to claim 5, wherein step S3 comprises:
after the template is determined, searching the read text content according to the start field and the end field in the field preset table, if the start field data in the text content has a repetition value, determining the position of the start text according to the sequence number of the start field, and searching the first matched end field afterwards, wherein the content of the middle part of the two fields is the required service content.
8. An apparatus for automatically identifying PDF electronic receipt information, comprising:
a memory for storing a computer program;
a processor for executing said computer program to implement the method of automatically identifying PDF electronic receipt information according to any of the preceding claims 1 to 7.
9. A computer storage medium storing a computer program, wherein the computer program when executed causes an apparatus of the computer storage medium to perform a method of automatically identifying PDF electronic receipt information according to any of claims 1-7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010164140.8A CN111401002A (en) | 2020-03-11 | 2020-03-11 | Method, device and computer storage medium for automatically identifying PDF electronic receipt information |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010164140.8A CN111401002A (en) | 2020-03-11 | 2020-03-11 | Method, device and computer storage medium for automatically identifying PDF electronic receipt information |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111401002A true CN111401002A (en) | 2020-07-10 |
Family
ID=71430765
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010164140.8A Pending CN111401002A (en) | 2020-03-11 | 2020-03-11 | Method, device and computer storage medium for automatically identifying PDF electronic receipt information |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111401002A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112465618A (en) * | 2020-12-22 | 2021-03-09 | 航天信息股份有限公司企业服务分公司 | Universal importing method and system for bank statement |
CN113065936A (en) * | 2021-03-03 | 2021-07-02 | 浙江工贸职业技术学院 | Financial cloud network reimbursement system and equipment |
CN113741995A (en) * | 2021-08-09 | 2021-12-03 | 太逗科技集团有限公司 | Method, device, equipment and medium for automatically confirming receipt by bypassing bank control |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6131092A (en) * | 1992-08-07 | 2000-10-10 | Masand; Brij | System and method for identifying matches of query patterns to document text in a document textbase |
US8392472B1 (en) * | 2009-11-05 | 2013-03-05 | Adobe Systems Incorporated | Auto-classification of PDF forms by dynamically defining a taxonomy and vocabulary from PDF form fields |
CN108960223A (en) * | 2018-05-18 | 2018-12-07 | 北京大账房网络科技股份有限公司 | The method for automatically generating voucher based on bill intelligent recognition |
CN109271410A (en) * | 2018-08-31 | 2019-01-25 | 平安科技(深圳)有限公司 | Extracting method, device and the computer readable storage medium of bank receipt |
CN109685477A (en) * | 2018-12-28 | 2019-04-26 | 北京爱康鼎科技有限公司 | Accounting process systems and processing method |
CN110390000A (en) * | 2019-07-30 | 2019-10-29 | 同方赛威讯信息技术有限公司 | A kind of legal documents automatic identification generates system and method |
CN110727703A (en) * | 2019-09-23 | 2020-01-24 | 苏宁云计算有限公司 | Method and device for automatically identifying comments in JSON (Java Server object notation) code |
CN110826991A (en) * | 2019-10-30 | 2020-02-21 | 中国电信集团工会上海市委员会 | Electronic receipt processing system and method |
-
2020
- 2020-03-11 CN CN202010164140.8A patent/CN111401002A/en active Pending
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6131092A (en) * | 1992-08-07 | 2000-10-10 | Masand; Brij | System and method for identifying matches of query patterns to document text in a document textbase |
US8392472B1 (en) * | 2009-11-05 | 2013-03-05 | Adobe Systems Incorporated | Auto-classification of PDF forms by dynamically defining a taxonomy and vocabulary from PDF form fields |
CN108960223A (en) * | 2018-05-18 | 2018-12-07 | 北京大账房网络科技股份有限公司 | The method for automatically generating voucher based on bill intelligent recognition |
CN109271410A (en) * | 2018-08-31 | 2019-01-25 | 平安科技(深圳)有限公司 | Extracting method, device and the computer readable storage medium of bank receipt |
CN109685477A (en) * | 2018-12-28 | 2019-04-26 | 北京爱康鼎科技有限公司 | Accounting process systems and processing method |
CN110390000A (en) * | 2019-07-30 | 2019-10-29 | 同方赛威讯信息技术有限公司 | A kind of legal documents automatic identification generates system and method |
CN110727703A (en) * | 2019-09-23 | 2020-01-24 | 苏宁云计算有限公司 | Method and device for automatically identifying comments in JSON (Java Server object notation) code |
CN110826991A (en) * | 2019-10-30 | 2020-02-21 | 中国电信集团工会上海市委员会 | Electronic receipt processing system and method |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112465618A (en) * | 2020-12-22 | 2021-03-09 | 航天信息股份有限公司企业服务分公司 | Universal importing method and system for bank statement |
CN113065936A (en) * | 2021-03-03 | 2021-07-02 | 浙江工贸职业技术学院 | Financial cloud network reimbursement system and equipment |
CN113065936B (en) * | 2021-03-03 | 2022-06-07 | 浙江工贸职业技术学院 | Financial cloud network reimbursement system and equipment |
CN113741995A (en) * | 2021-08-09 | 2021-12-03 | 太逗科技集团有限公司 | Method, device, equipment and medium for automatically confirming receipt by bypassing bank control |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10614527B2 (en) | System and method for automatic generation of reports based on electronic documents | |
CN111401002A (en) | Method, device and computer storage medium for automatically identifying PDF electronic receipt information | |
US11062132B2 (en) | System and method for identification of missing data elements in electronic documents | |
US20040193520A1 (en) | Automated understanding and decomposition of table-structured electronic documents | |
CN105243117B (en) | A kind of data processing system and method | |
CN111178836A (en) | Batch archiving method, device and equipment for electronic documents and storage medium | |
CN110599319B (en) | Automatic auditing method, device, terminal and storage medium | |
CN111931780A (en) | Intelligent management method and equipment for accounting documents | |
WO2021259080A1 (en) | Bill information archiving method and apparatus, computer device, and storage medium | |
US11138372B2 (en) | System and method for reporting based on electronic documents | |
CN110956166A (en) | Bill marking method and device | |
CN111914729A (en) | Voucher association method and device, computer equipment and storage medium | |
CN109002425B (en) | Method for acquiring upstream and downstream relations of enterprise, terminal device and medium | |
CN112364645A (en) | Method and equipment for automatically auditing ERP financial system business documents | |
CN112785404A (en) | Invoice issuing management system | |
TWI716761B (en) | Intelligent accounting system and identification method for accounting documents | |
CN111768565B (en) | Method for identifying and post-processing invoice codes in value-added tax invoices | |
CN107832278A (en) | A kind of method and device of real time scan taxation informatization data | |
CN111428497A (en) | Method, device and equipment for automatically extracting financing information | |
CN109325045B (en) | Method and device for opening bank | |
CN111400187A (en) | Parameter dynamic verification system and method based on customized data source | |
US20190057456A1 (en) | System and methods thereof for associating electronic documents to evidence | |
TWM575887U (en) | Intelligent accounting system | |
US20170185832A1 (en) | System and method for verifying extraction of multiple document images from an electronic document | |
CN113807901A (en) | Electronic invoice detection method, terminal equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20200710 |
|
RJ01 | Rejection of invention patent application after publication |