CN111401002A - Method, device and computer storage medium for automatically identifying PDF electronic receipt information - Google Patents

Method, device and computer storage medium for automatically identifying PDF electronic receipt information Download PDF

Info

Publication number
CN111401002A
CN111401002A CN202010164140.8A CN202010164140A CN111401002A CN 111401002 A CN111401002 A CN 111401002A CN 202010164140 A CN202010164140 A CN 202010164140A CN 111401002 A CN111401002 A CN 111401002A
Authority
CN
China
Prior art keywords
field
template
electronic receipt
bank
pdf electronic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010164140.8A
Other languages
Chinese (zh)
Inventor
秦涛
王士勇
钟如玉
李海彬
司慧杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong Inspur Genersoft Information Technology Co Ltd
Original Assignee
Shandong Inspur Genersoft Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong Inspur Genersoft Information Technology Co Ltd filed Critical Shandong Inspur Genersoft Information Technology Co Ltd
Priority to CN202010164140.8A priority Critical patent/CN111401002A/en
Publication of CN111401002A publication Critical patent/CN111401002A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/242Query formulation
    • G06F16/2433Query languages
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/02Banking, e.g. interest calculation or account maintenance

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Business, Economics & Management (AREA)
  • Theoretical Computer Science (AREA)
  • Accounting & Taxation (AREA)
  • Finance (AREA)
  • General Physics & Mathematics (AREA)
  • Development Economics (AREA)
  • Technology Law (AREA)
  • Strategic Management (AREA)
  • General Business, Economics & Management (AREA)
  • Marketing (AREA)
  • Economics (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Financial Or Insurance-Related Operations Such As Payment And Settlement (AREA)

Abstract

At present, hundreds or even thousands of bank receipts of many group companies at the end of the month are all matched to a fund settlement sheet and a business statement bill through manual work, and the work efficiency is urgently required to be improved and the cost is required to be reduced. Based on the above problems, the present invention provides a method for automatically identifying PDF electronic receipt information, comprising: presetting a bank template in a database, receiving a PDF electronic receipt task sent by a user, determining a corresponding bank template, reading required service content, inserting the required service content into a database service table, and automatically matching a fund settlement sheet and a service statement bill. The bank electronic PDF receipt file of the text content is identified as formatted data and analyzed through a method of presetting a template by banks, and then the fund settlement receipt and the business report bill are automatically and sequentially associated, so that the work pain point that the workload is large, the time consumption is long and the efficiency is low when a cashier holds a paper bank receipt to manually check accounts is solved.

Description

Method, device and computer storage medium for automatically identifying PDF electronic receipt information
Technical Field
The invention relates to the technical field of computers, in particular to a method, a device and a storage medium for automatically identifying PDF electronic receipt information.
Background
The bank receipt is the original basis for the enterprise to compile the bookkeeping voucher, and the enterprise has corresponding receipt as the evidence when receiving and paying. The receipt content mainly comprises detailed information such as date, serial number, account number, currency, amount and the like, and each account has a receipt. Therefore, a large amount of receipt is processed in the capital management of the corporate company.
At present, the capital receipt and payment control of group companies to subordinate enterprises is higher and higher, and at the end of the month, hundreds or even thousands of bank receipts are matched to a capital settlement sheet and a business report bill by manpower, so that simple and repeated labor is a very time-consuming matter, is a work pain point for cashier, and urgently needs to improve work efficiency and reduce cost.
Disclosure of Invention
Based on the above problems, the present invention provides a method for automatically identifying PDF electronic receipt information, which aims to accurately obtain the required text content and get rid of the current situation that a cashier holds a paper bank receipt to perform manual reconciliation.
Currently, there are many ways to read a PDF document of text content, for example: the ITestSharp and the PdfBox can be read out in a character string mode, but the format of the electronic PDF receipt between banks is not uniform, and the problem of non-uniform format also exists in the same bank, so that the read character sequence is various, and the required text content cannot be accurately identified and acquired in a fixed mode.
Therefore, the acquired character string can only be automatically analyzed according to a certain logic rule, and the required text content can be acquired more accurately by presetting the template and the methods of the preposed field and the postpositional field of a certain field.
In order to achieve the above object, the present invention provides a method for automatically identifying PDF electronic receipt information, comprising:
s1, receiving a PDF electronic receipt task sent by a user;
s2, determining a corresponding bank template;
s3 reading the required service content;
s4 inserting a business table of the database;
s5 automatically matches the fund settlement bill and the business report bill.
Preferably, step S1 is preceded by the steps of:
s0 presets the bank template.
Further, step S0 includes:
s101, reading PDF electronic receipt text information of each bank;
s102, establishing a template preset table according to PDF electronic receipt text information of each bank;
s103, establishing a field preset table according to PDF electronic receipt text information of each bank;
s104, analyzing the character data and presetting the data of the preposed field and the postpositive field.
Preferably, the template preset table in step S102 includes fields with data types of VARCHAR: internal code, bank number, bank name, template number and template name.
Preferably, the field preset table in step S103 includes fields with data types of VARCHAR: inner code, field name, field number, start field, end field, start field sequence number.
Further, step S2 includes:
and circularly traversing the data of the template preset table, acquiring the field name data of the field preset information table corresponding to each template, searching the read text contents one by one until the unique template is searched, prompting that the plurality of bank templates are searched and the configuration of the templates is checked if the plurality of templates are searched, and prompting that the corresponding bank template cannot be searched if the matched templates cannot be searched.
Further, step S3 includes:
after the template is determined, searching the read text content according to the start field and the end field in the field preset table, if the start field data in the text content has a repetition value, determining the position of the start text according to the sequence number of the start field, and searching the first matched end field afterwards, wherein the content of the middle part of the two fields is the required service content.
The invention also provides a device for automatically identifying the PDF electronic receipt information, which comprises:
a memory for storing a computer program;
and the processor is used for executing the computer program to realize any one of the above methods for automatically identifying the PDF electronic receipt information.
The invention also provides a computer storage medium, which stores a computer program, and when the computer program is executed, the computer storage medium is positioned in equipment to execute any one of the above methods for automatically identifying the PDF electronic receipt information.
The invention reads the bank electronic PDF receipt file of the text content into the system by a method of presetting templates by banks, and identifies the file as formatted data. The formatted data is analyzed through a preset format, and then the fund settlement sheet and the business report bill are automatically and sequentially associated, so that the working pain points of large workload, long consumed time and low efficiency when a cashier holds a paper bank receipt to perform manual account checking are solved.
In addition, the invention can flexibly define the template format by banks, and the PDF receipt format of the same bank is different and can define the corresponding template format. The method for presetting and flexibly identifying the preposed field and the postposition field of a certain field can accurately acquire the required text content, and further automatically match a fund settlement list and a business bill according to the acquired content.
Drawings
FIG. 1 is a flow chart of the method of the present invention;
FIG. 2 is a diagram illustrating text contents searched and read according to a start field and an end field in a field preset table according to the present invention;
fig. 3 is a schematic diagram of a PDF electronic receipt of a certain bank in the embodiment.
Detailed Description
In order to better illustrate and facilitate the understanding of the process of the invention, examples are presented for the purpose of illustration. It should be noted that the examples are only for illustration and should not be taken as a basis for limiting the scope of the present invention.
The invention provides a method for automatically identifying PDF electronic receipt information, which comprises the following steps of:
according to a certain bank PDF electronic receipt (figure 3), the text content read by using the GetTextFromPage method of PdfTextExtractor class of ITestSharp program set of C # is as follows:
Figure BDA0002406807340000031
the template preset table structure is designed as table 1:
serial number Name of field Field identification Data type
1 Inner code ZJYHDZHDYSZB_NM VARchar(40)
2 Bank number ZJYHDZHDYSZB_YHBH VARchar(100)
3 Name of bank ZJYHDZHDYSZB_YHMC VARchar(100)
4 Template numbering ZJYHDZHDYSZB_MBBH VARchar(100)
5 Name of template ZJYHDZHDYSZB_MBMC VARchar(100)
TABLE 1
The field preset table structure is designed as table 2:
serial number Name of field Field identification Data type
1 Inner code ZJYHDZHDYS_NM VARchar(40)
2 Name of field ZJYHDZHDYS_ZDMC VARchar(100)
3 Field numbering ZJYHDZHDYS_ZDBH VARchar(40)
4 Start field ZJYHDZHDYS_KSZD VARchar(100)
5 Termination field ZJYHDZHDYS_ZZZD VARchar(100)
6 Starting field sequence number ZJYHDZHDYS_KSZDXH VARchar(10)
TABLE 2
Analyzing the word sequence, the template and the preset data of the prepositive field and the postpositive field of the service field to obtain a template preset table shown in table 3:
Figure BDA0002406807340000041
TABLE 3
The field preset table is as in table 4:
Figure BDA0002406807340000042
Figure BDA0002406807340000051
TABLE 4
Other bank templates are preset in the database in the same way.
When receiving the PDF electronic receipt task of the user, carrying out program analysis according to preset data, and the steps are as follows.
Determining templates, traversing and circularly traversing the data of the template preset table, acquiring ZJYHDZHDYS _ ZDMC line data of the field preset information table corresponding to each template, searching the read text contents one by one until a unique template is searched and matched, prompting that a plurality of bank templates are searched and template configuration is checked if a plurality of templates are searched, and prompting that the corresponding bank template cannot be searched if the matched templates cannot be searched.
Reading the required service content: after the template is determined, the read text content is searched according to ZJYHDZHDYS _ KSZD (start field) and ZJYHDZHDYS _ ZZZD (end field) in the field preset table, as shown in FIG. 2, if the start field data in the text content has a repetition value, the start text position is determined according to ZJYHDZHDYS _ KSZDXH (start field number), and then the first matched end field is searched afterwards, and the content in the middle of the two fields is the required service content.
And (4) inserting a database business table, namely forming an SQ L statement and inserting the SQ L statement into the business table according to the business content read in the step 2 and the corresponding ZJYHDZHDYS _ ZDBH (field number).
And automatically matching a fund settlement bill and a business report bill: and (4) searching a fund settlement list and a service bill according to the service table data formed in the step (3).
The above is only one embodiment of the present invention, and is not intended to limit the scope of protection. All equivalents made by using the contents of the specification and the attached drawings of the present invention fall within the protection scope of the present invention.

Claims (9)

1. A method for automatically identifying PDF electronic receipt information is characterized by comprising the following steps:
s1, receiving a PDF electronic receipt task sent by a user;
s2, determining a corresponding bank template;
s3 reading the required service content;
s4 inserting a business table of the database;
s5 automatically matches the fund settlement bill and the business report bill.
2. The method for automatically identifying the PDF electronic receipt information as claimed in claim 1, further comprising the step of, at step S1:
s0 presets the bank template.
3. The method for automatically identifying PDF electronic receipt information according to claim 2, wherein step S0 comprises:
s101, reading PDF electronic receipt text information of each bank;
s102, establishing a template preset table according to PDF electronic receipt text information of each bank;
s103, establishing a field preset table according to PDF electronic receipt text information of each bank;
s104, analyzing the character data and presetting the data of the preposed field and the postpositive field.
4. The method for automatically identifying the PDF electronic receipt information according to claim 3, wherein the template preset table in step S102 includes fields whose data types are all VARCHAR: internal code, bank number, bank name, template number and template name.
5. The method for automatically identifying PDF electronic receipt information according to claim 3, wherein said field preset table in step S103 includes fields whose data types are all VARCHAR: inner code, field name, field number, start field, end field, start field sequence number.
6. The method for automatically identifying the PDF electronic receipt information according to claim 5, wherein step S2 comprises:
circularly traversing the data of the template preset table, acquiring field name data of a field preset information table corresponding to each template, and searching the read text contents one by one until a unique template is searched and matched; if the plurality of templates are found, prompting that the plurality of bank templates are found and the configuration of the templates is checked; and if the matched template cannot be found, prompting that the corresponding bank template cannot be found.
7. The method for automatically identifying the PDF electronic receipt information according to claim 5, wherein step S3 comprises:
after the template is determined, searching the read text content according to the start field and the end field in the field preset table, if the start field data in the text content has a repetition value, determining the position of the start text according to the sequence number of the start field, and searching the first matched end field afterwards, wherein the content of the middle part of the two fields is the required service content.
8. An apparatus for automatically identifying PDF electronic receipt information, comprising:
a memory for storing a computer program;
a processor for executing said computer program to implement the method of automatically identifying PDF electronic receipt information according to any of the preceding claims 1 to 7.
9. A computer storage medium storing a computer program, wherein the computer program when executed causes an apparatus of the computer storage medium to perform a method of automatically identifying PDF electronic receipt information according to any of claims 1-7.
CN202010164140.8A 2020-03-11 2020-03-11 Method, device and computer storage medium for automatically identifying PDF electronic receipt information Pending CN111401002A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010164140.8A CN111401002A (en) 2020-03-11 2020-03-11 Method, device and computer storage medium for automatically identifying PDF electronic receipt information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010164140.8A CN111401002A (en) 2020-03-11 2020-03-11 Method, device and computer storage medium for automatically identifying PDF electronic receipt information

Publications (1)

Publication Number Publication Date
CN111401002A true CN111401002A (en) 2020-07-10

Family

ID=71430765

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010164140.8A Pending CN111401002A (en) 2020-03-11 2020-03-11 Method, device and computer storage medium for automatically identifying PDF electronic receipt information

Country Status (1)

Country Link
CN (1) CN111401002A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112465618A (en) * 2020-12-22 2021-03-09 航天信息股份有限公司企业服务分公司 Universal importing method and system for bank statement
CN113065936A (en) * 2021-03-03 2021-07-02 浙江工贸职业技术学院 Financial cloud network reimbursement system and equipment
CN113741995A (en) * 2021-08-09 2021-12-03 太逗科技集团有限公司 Method, device, equipment and medium for automatically confirming receipt by bypassing bank control

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6131092A (en) * 1992-08-07 2000-10-10 Masand; Brij System and method for identifying matches of query patterns to document text in a document textbase
US8392472B1 (en) * 2009-11-05 2013-03-05 Adobe Systems Incorporated Auto-classification of PDF forms by dynamically defining a taxonomy and vocabulary from PDF form fields
CN108960223A (en) * 2018-05-18 2018-12-07 北京大账房网络科技股份有限公司 The method for automatically generating voucher based on bill intelligent recognition
CN109271410A (en) * 2018-08-31 2019-01-25 平安科技(深圳)有限公司 Extracting method, device and the computer readable storage medium of bank receipt
CN109685477A (en) * 2018-12-28 2019-04-26 北京爱康鼎科技有限公司 Accounting process systems and processing method
CN110390000A (en) * 2019-07-30 2019-10-29 同方赛威讯信息技术有限公司 A kind of legal documents automatic identification generates system and method
CN110727703A (en) * 2019-09-23 2020-01-24 苏宁云计算有限公司 Method and device for automatically identifying comments in JSON (Java Server object notation) code
CN110826991A (en) * 2019-10-30 2020-02-21 中国电信集团工会上海市委员会 Electronic receipt processing system and method

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6131092A (en) * 1992-08-07 2000-10-10 Masand; Brij System and method for identifying matches of query patterns to document text in a document textbase
US8392472B1 (en) * 2009-11-05 2013-03-05 Adobe Systems Incorporated Auto-classification of PDF forms by dynamically defining a taxonomy and vocabulary from PDF form fields
CN108960223A (en) * 2018-05-18 2018-12-07 北京大账房网络科技股份有限公司 The method for automatically generating voucher based on bill intelligent recognition
CN109271410A (en) * 2018-08-31 2019-01-25 平安科技(深圳)有限公司 Extracting method, device and the computer readable storage medium of bank receipt
CN109685477A (en) * 2018-12-28 2019-04-26 北京爱康鼎科技有限公司 Accounting process systems and processing method
CN110390000A (en) * 2019-07-30 2019-10-29 同方赛威讯信息技术有限公司 A kind of legal documents automatic identification generates system and method
CN110727703A (en) * 2019-09-23 2020-01-24 苏宁云计算有限公司 Method and device for automatically identifying comments in JSON (Java Server object notation) code
CN110826991A (en) * 2019-10-30 2020-02-21 中国电信集团工会上海市委员会 Electronic receipt processing system and method

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112465618A (en) * 2020-12-22 2021-03-09 航天信息股份有限公司企业服务分公司 Universal importing method and system for bank statement
CN113065936A (en) * 2021-03-03 2021-07-02 浙江工贸职业技术学院 Financial cloud network reimbursement system and equipment
CN113065936B (en) * 2021-03-03 2022-06-07 浙江工贸职业技术学院 Financial cloud network reimbursement system and equipment
CN113741995A (en) * 2021-08-09 2021-12-03 太逗科技集团有限公司 Method, device, equipment and medium for automatically confirming receipt by bypassing bank control

Similar Documents

Publication Publication Date Title
US10614527B2 (en) System and method for automatic generation of reports based on electronic documents
CN111401002A (en) Method, device and computer storage medium for automatically identifying PDF electronic receipt information
US11062132B2 (en) System and method for identification of missing data elements in electronic documents
US20040193520A1 (en) Automated understanding and decomposition of table-structured electronic documents
CN105243117B (en) A kind of data processing system and method
CN111178836A (en) Batch archiving method, device and equipment for electronic documents and storage medium
CN110599319B (en) Automatic auditing method, device, terminal and storage medium
CN111931780A (en) Intelligent management method and equipment for accounting documents
WO2021259080A1 (en) Bill information archiving method and apparatus, computer device, and storage medium
US11138372B2 (en) System and method for reporting based on electronic documents
CN110956166A (en) Bill marking method and device
CN111914729A (en) Voucher association method and device, computer equipment and storage medium
CN109002425B (en) Method for acquiring upstream and downstream relations of enterprise, terminal device and medium
CN112364645A (en) Method and equipment for automatically auditing ERP financial system business documents
CN112785404A (en) Invoice issuing management system
TWI716761B (en) Intelligent accounting system and identification method for accounting documents
CN111768565B (en) Method for identifying and post-processing invoice codes in value-added tax invoices
CN107832278A (en) A kind of method and device of real time scan taxation informatization data
CN111428497A (en) Method, device and equipment for automatically extracting financing information
CN109325045B (en) Method and device for opening bank
CN111400187A (en) Parameter dynamic verification system and method based on customized data source
US20190057456A1 (en) System and methods thereof for associating electronic documents to evidence
TWM575887U (en) Intelligent accounting system
US20170185832A1 (en) System and method for verifying extraction of multiple document images from an electronic document
CN113807901A (en) Electronic invoice detection method, terminal equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20200710

RJ01 Rejection of invention patent application after publication