CN114255335A - Electronic invoice recognition method, system, electronic device and medium - Google Patents

Electronic invoice recognition method, system, electronic device and medium Download PDF

Info

Publication number
CN114255335A
CN114255335A CN202111391678.3A CN202111391678A CN114255335A CN 114255335 A CN114255335 A CN 114255335A CN 202111391678 A CN202111391678 A CN 202111391678A CN 114255335 A CN114255335 A CN 114255335A
Authority
CN
China
Prior art keywords
invoice
electronic
information
name
identification
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111391678.3A
Other languages
Chinese (zh)
Inventor
张帆
黄鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ctrip Computer Technology Shanghai Co Ltd
Original Assignee
Ctrip Computer Technology Shanghai Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ctrip Computer Technology Shanghai Co Ltd filed Critical Ctrip Computer Technology Shanghai Co Ltd
Priority to CN202111391678.3A priority Critical patent/CN114255335A/en
Publication of CN114255335A publication Critical patent/CN114255335A/en
Pending legal-status Critical Current

Links

Images

Abstract

The invention discloses an electronic invoice identification method, an electronic invoice identification system, electronic equipment and a medium, wherein the electronic invoice identification method comprises the following steps: acquiring an invoice name of the electronic invoice; selecting a corresponding invoice identification template according to the invoice name; and determining and identifying the area to be identified of the electronic invoice according to the invoice identification template. The invention aims at electronic invoices with different formats to manufacture invoice identification templates, each invoice identification template corresponds to a different region to be identified and transmits an invoice name as a parameter, the corresponding invoice identification templates are butted, the region to be identified on the electronic invoice is intercepted through the invoice identification templates, invoice information in the region to be identified is identified, interference items of electronic invoice data are reduced through dividing the regions, the required electronic invoice data can be extracted rapidly and accurately, and the invoice identification method is suitable for different invoice formats, so that the efficiency and the accuracy of electronic invoice identification are effectively improved.

Description

Electronic invoice recognition method, system, electronic device and medium
Technical Field
The invention relates to the technical field of information identification, in particular to an electronic invoice identification method, an electronic invoice identification system, electronic equipment and a medium.
Background
With the development of the information era, more and more merchants choose to issue electronic invoices, and compared with the traditional paper invoices, the electronic invoices have the characteristics of paperless property, low energy consumption, easiness in storage and the like, but when invoice information is checked, manual intervention is needed to identify the content of the electronic invoices, and the problems of large workload and low efficiency exist. In the prior art, an electronic invoice identification method is provided, wherein an OCR (optical character recognition) technology is used for converting all electronic invoices into electronic data and extracting useful invoice information from the electronic invoices, but a plurality of interference items exist on the electronic invoices, so that required invoice contents cannot be quickly acquired, the identification efficiency is very low, and the required invoice contents cannot be accurately acquired due to different provincial electronic invoice formats, such as empty carriage return and the like, so that the identification accuracy is reduced, and even the electronic invoices cannot be identified.
Disclosure of Invention
The invention aims to overcome the defects of low identification efficiency and low identification accuracy rate in the prior art for identifying electronic invoices, and provides an electronic invoice identification method, an electronic invoice identification system, electronic equipment and a medium.
The invention solves the technical problems through the following technical scheme:
according to a first aspect of the present invention, there is provided an electronic invoice identification method, comprising the steps of:
acquiring the invoice name of the electronic invoice;
selecting a corresponding invoice identification template according to the invoice name;
and determining and identifying the area to be identified of the electronic invoice according to the invoice identification template.
Preferably, the step of obtaining the invoice name of the electronic invoice comprises:
identifying the file type of the electronic invoice, and determining an extraction method according to the file type;
extracting text data of the electronic invoice according to the extraction method;
and determining the invoice name according to the text data.
Preferably, the file types include a PDF (portable document format) file and an OFD (an autonomous document format) file.
Preferably, the step of determining and identifying the region to be identified of the electronic invoice according to the invoice identification template includes:
intercepting the electronic invoice according to the invoice identification template to obtain the area to be identified;
identifying the area to be identified to obtain a keyword;
and acquiring invoice information according to the key words.
Preferably, the keyword includes at least one of an invoice code, an invoice number, an invoicing date, a name, a taxpayer identification number, an item name, an amount, a tax rate, and a remark;
the invoice information comprises at least one of invoice code information, invoice number information, invoicing date information, purchaser name information, purchaser taxpayer identification number information, project name information, amount information, tax rate information, seller name information, seller taxpayer identification number information and remark information.
Preferably, the step of obtaining the invoice information according to the keyword further includes checking the invoice information.
Preferably, the step of checking the invoice information comprises:
comparing the invoice information with invoice data in a preset invoice database, and if the invoice information is consistent with the invoice data in comparison, determining that the electronic invoice is valid;
and if the comparison is inconsistent, determining that the electronic invoice is invalid.
According to a second aspect of the present invention, there is provided an electronic invoice recognition system, comprising an acquisition module, a selection module and a recognition module:
the obtaining module is used for obtaining the invoice name of the electronic invoice;
the selection module is used for selecting a corresponding invoice identification template according to the invoice name;
the identification module is used for determining and identifying the area to be identified of the electronic invoice according to the invoice identification template.
According to a third aspect of the present invention, there is provided an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the electronic invoice recognition method of the present invention when executing the computer program.
According to a fourth aspect of the present invention, there is provided a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the electronic invoice recognition method of the present invention.
The positive progress effects of the invention are as follows:
the invention makes invoice recognition templates aiming at electronic invoices of different formats, each invoice recognition template corresponds to different areas to be recognized and transmits an invoice name as a parameter, the corresponding invoice recognition templates are butted, the areas to be recognized on the electronic invoices are intercepted through the invoice recognition templates and the invoice information in the areas to be recognized is recognized, in addition, the electronic invoices of different invoice names and the same format can reuse the invoice recognition templates, the workload of development is reduced, the areas are divided, the interference items of the electronic invoice data are reduced, the required electronic invoice data can be extracted rapidly and accurately, and the method is suitable for different invoice formats, thereby effectively improving the efficiency and the accuracy of electronic invoice recognition.
Drawings
Fig. 1 is a schematic flow chart of an electronic invoice recognition method according to embodiment 1 of the present invention.
Fig. 2 is a schematic diagram of an invoice identification template in the electronic invoice identification method according to embodiment 1 of the present invention.
Fig. 3 is a schematic diagram of a rectangular box in the invoice identification template according to embodiment 1 of the present invention.
Fig. 4 is a flowchart illustrating step 103 in the electronic invoice recognition method according to embodiment 1 of the present invention.
Fig. 5 is a schematic structural diagram of an electronic invoice recognition system according to embodiment 2 of the present invention.
Fig. 6 is a schematic structural diagram of an electronic device according to embodiment 3 of the present invention.
Detailed Description
The invention is further illustrated by the following examples, which are not intended to limit the scope of the invention.
Example 1
The embodiment provides an electronic invoice identification method, as shown in fig. 1, the electronic invoice identification method includes the following steps:
step 101, obtaining an invoice name of an electronic invoice.
The electronic invoice name is an electronic invoice full name, is used for representing identification information of the electronic invoice, and different invoices are classified through the identification information, referring to fig. 2, and the electronic invoice can be known to belong to the Shenzhen value-added tax electronic common invoice through the invoice name of the electronic invoice.
In order to obtain the invoice name of the electronic invoice, the text data on the electronic invoice needs to be extracted first, and then the invoice name of the electronic invoice is identified from the text data. In the embodiment, the electronic invoice file is directly imported, and as an optional implementation mode, the import of a single file and the batch import of a plurality of files are supported.
At present, electronic invoice files issued by an electronic invoice service platform mainly comprise electronic invoice files in a PDF format and electronic invoice files in an OFD format, and as an optional implementation mode, different extraction methods can be selected to extract text data in an electronic invoice according to electronic invoice files of different file types. If the file type of the electronic invoice is in PDF format, the text data in the electronic invoice can be directly extracted through an OCR recognition technology. As another optional implementation manner, if the file type of the electronic invoice is in the OFD format, the electronic invoice file needs to be decompressed, and a file in an XML (extensible markup language) format is obtained after decompression, so that text data of the electronic invoice can be directly extracted from the XML file.
In this embodiment, the full name of the electronic invoice, that is, the first line of data in the text data, can easily obtain the invoice name of the electronic invoice by traversing the text information in the first line of the text data. As an optional implementation manner, when extracting the text data of the electronic invoice, it is not necessary to extract all the text data of the electronic invoice, and only at least one line of text data needs to be extracted.
And 102, selecting a corresponding invoice identification template according to the invoice name.
The invoice names of the electronic invoices serve as identification information of the electronic invoices, each invoice name corresponds to one invoice identification template, and as an optional implementation mode, different invoice names can be specified to use the same invoice identification template. For example, most provincial value-added tax electronic general invoices have the same invoice format, and although the invoice names of the invoices are different, the invoices can be identified by applying the same invoice identification template. In addition, some provinces can also issue electronic invoices with different invoice formats from those of common value-added tax electronic general invoices, and another invoice identification template needs to be applied for identification.
In this embodiment, a standard electronic invoice file issued by an electronic invoice service platform is selected to make an invoice identification template, an area to be identified on an electronic invoice is specified according to actual requirements, a rectangular frame is drawn through a Rectangle tool, the drawn rectangular frame includes four parameters, namely left, top, right, and bottom, as shown in fig. 3, the coordinate of the upper left corner of the electronic invoice is set as a coordinate origin, the transverse side of the rectangular frame is parallel to the transverse side of the electronic invoice, and the vertical side of the rectangular frame is parallel to the vertical side of the electronic invoice, wherein left represents the transverse coordinate of the upper left corner of the Rectangle, top represents the longitudinal coordinate of the upper left corner of the Rectangle, right represents the transverse coordinate of the lower right corner of the Rectangle, bottom represents the longitudinal coordinate of the lower right corner of the Rectangle, so that the length of the transverse side of the rectangular frame is right-left, and the length of the vertical side of the rectangular frame is bottom-top, the position and size of the rectangular box are saved by saving these four parameters. In addition, label information representing the property of the area is marked on each rectangular frame, the label information can be defined by users according to the invoice content of the area to be identified, and the label information is stored.
As an alternative embodiment, different parameters are set according to the proportion to draw the rectangular frame based on the size of the display area of the electronic invoice, that is, the stored parameters of the rectangular frame are not fixed values, but four proportionality coefficients, wherein the actual values of left and right are the length of the invoice multiplied by the proportionality coefficients respectively, and the actual values of top and bottom are the width of the invoice multiplied by the proportionality coefficients respectively.
As an optional implementation manner, the drawn rectangular box includes at least one piece of complete invoice information, and the invoice information with a relatively close position can be divided into the same area, so that the integrity of the invoice information is guaranteed to the maximum extent. As shown in fig. 2, a total of seven areas are divided on the electronic invoice, and each area is labeled with label information for characterizing the area, specifically: the number area is an invoice code area, and invoice code information, invoice number information, invoice date information and the like are contained in the invoice code area; the number area is a buyer area, and the number area comprises buyer name information, buyer tax payer identification number information and the like; the third area is an item name area which contains item name information; the fourth area is a tax rate area and contains tax rate information; the number area is a money amount area which contains money amount information; sixthly, the number area is a seller area and comprises seller name information, seller tax payer identification number information and the like; the region is a remark region, and the remark region contains remark information. Of course, the invoice identification template of the embodiment is not limited to the area division manner, and different electronic invoices have different division manners, and can be set according to actual requirements.
After the invoice identification template is determined, the corresponding relationship between the invoice name and the invoice identification template is stored, as an optional implementation manner, one invoice identification template may correspond to one or more invoice names, and after the invoice name of the electronic invoice is identified, the corresponding invoice identification template is loaded.
And 103, determining and identifying the area to be identified of the electronic invoice according to the invoice identification template.
As shown in fig. 4, step 103 specifically includes the following steps:
and step 1031, intercepting the electronic invoice according to the invoice identification template to obtain the area to be identified.
The rectangular frame of invoice discernment template corresponds the regional of waiting to discern of electronic invoice, in this embodiment, after leading-in electronic invoice, can acquire the length and the height of electronic invoice, obtain the proportionality coefficient of each rectangular frame through loading the invoice discernment template that corresponds, substitute the length and the height of electronic invoice into the proportionality coefficient and calculate specific rectangle parameter value, obtain the coordinate value in rectangular frame lower left corner and upper right corner, thereby acquire specific rectangular frame, the position and the size of the regional position of waiting to discern of electronic invoice are confirmed through the position and the size of rectangular frame.
And step 1032, identifying the area to be identified to obtain a keyword.
The keywords are used to represent words describing invoice information, such as an invoice code, an invoice number, an invoicing date, a name, a taxpayer identification number, a project name, an amount, a tax rate, and a remark, but the embodiment is not limited to the keywords.
By dividing the area, the keywords can be easily acquired from the text data of the area to be recognized, and the keywords contained in different areas to be recognized are different. As an optional implementation manner, keywords to be recognized in each region are preset, the region to be recognized is found according to tag information of the region to be recognized, and then whether the preset keywords exist in the region to be recognized is queried, for example, referring to fig. 2, for example, an invoice code domain is set to be a first number region, the invoice code domain includes three keywords, namely, an invoice code, an invoice number and an invoicing date, when an electronic invoice is recognized, it is determined first whether the region to be recognized is the first number region, and then whether the invoice code, the invoice number and the invoicing date exist therein is recognized, and if matching is successful, the keywords are successfully recognized. In addition, the keywords included in the to-be-identified areas of different tag information are the same, for example, the keywords in the buyer area and the keywords in the seller area in fig. 2 are names and taxpayer identification numbers, for the invoice information in these areas, the keywords are used to distinguish the invoice information, and besides, the tag information is added, and after the areas to be identified are judged to be the No. two areas and the No. six area according to the tag information, whether two keywords, namely the names and the taxpayer identification numbers, exist in the areas are identified respectively.
And step 1033, acquiring invoice information according to the keywords.
The electronic invoice comprises various invoice information, the keywords correspond to the invoice information, if the keywords are successfully identified, the position of the invoice information can be determined according to the accurate position of the keywords in the invoice text data, and the invoice information corresponding to the keywords can be easily extracted. As an optional implementation manner, by dividing the reasonable area, the positions of the keywords and the invoice information can be determined more quickly under the condition that the completeness of the keywords and the invoice information in the area is ensured.
The invoice information includes, but is not limited to, invoice code information, invoice number information, invoice date information, purchaser name information, purchaser taxpayer identification number information, project name information, amount information, tax rate information, seller name information, seller taxpayer identification number information, and remark information.
Referring to fig. 2, taking invoice code information as an example, the invoice code information is in the region of number (i), text data of the region of number (i) is obtained first, lines are traversed sequentially, the first line is an invoice code, the second line is an invoice number, if the keyword of the invoice code is successfully identified, all data including the item of the invoice code are intercepted, and the item of the invoice code is: the characters are removed to obtain invoice code information, the invoice code information is obtained in the same mode, the third line is the invoicing date, non-numeric characters need to be replaced by blank spaces after the steps, and the year, month and day are sequentially obtained according to the division of the blank spaces to serve as the invoicing date information.
As an optional implementation manner, some keywords in the region to be identified are not in the same line as the invoice information, for example, the region No. c in fig. 2 is a project name field, lines are sequentially traversed, after the keyword of the project name is identified, a next line containing the line where the "project name" is located is found, all data in the next line are intercepted as the project name information, and if there are multiple pieces of invoice information, after the data in the next line is acquired, corresponding invoice information is acquired by sequentially dividing according to spaces.
As an optional implementation manner, some areas to be identified which are accurately divided may directly acquire data as invoice information, for example, the area # fifthly in fig. 2 is an amount field, and the inside only includes the amount information, and directly acquire the data of the area # fifthly as the amount information. Certainly, the present embodiment is not limited to the above several ways of identifying invoice information by keywords, and the present embodiment accurately sets different identification ways in each area by dividing the area, and can perform corresponding adjustment according to actual requirements.
As an alternative embodiment, referring to fig. 4, after step 1033, the method further includes:
step 1034, check the invoice information.
The method can be directly connected with the invoice database preset in each enterprise management system to compare whether the invoice information is correct, firstly, the invoice data is inquired from the preset invoice database according to the key words, and then, the identified invoice information and the invoice data are compared one by one. The invoice data includes but is not limited to invoice code data, invoice number data, invoicing date data, purchaser name data, purchaser taxpayer identification number data, project name data, amount data, tax rate data, seller name data, seller taxpayer identification number data and remark data.
In the embodiment, the invoice information in each area is obtained by dividing the areas to be identified, and the invoice information in each area to be identified can be checked respectively at the same time, so that the invoice information checking efficiency is greatly improved.
If the corresponding invoice data can be found in the invoice database by the invoice information in all the areas to be identified in the checking process, the electronic invoice is confirmed to be valid, the valid checking result of the electronic invoice is output, as an optional implementation mode, and after the correct character pattern is checked by the output invoice, the next electronic invoice is automatically loaded.
If the invoice information of the to-be-identified area cannot be identified or the invoice information acquired in the to-be-identified area is inconsistent with the invoice data in the invoice database in the checking process, the electronic invoice is invalid, as an optional implementation mode, after the error word of invoice checking is output, specific invoice content of the electronic invoice which is inconsistent with the invoice database in comparison is also output at the same time, for example, when the taxpayer name of the buyer cannot find corresponding data in the invoice database, the invoice is identified as an error in identification, the taxpayer name displayed in the invoice and the taxpayer name in the corresponding invoice database are output, and meanwhile, manual follow-up processing is notified.
In addition, the electronic invoices with different invoice names and the same format can reuse the invoice recognition templates, the development workload is reduced, the areas are divided, the interference items of the electronic invoice data are reduced, the required electronic invoice data can be extracted rapidly and accurately, and different invoice formats are adapted, so that the efficiency and the accuracy of electronic invoice recognition are effectively improved.
Example 2
The present embodiment provides an electronic invoice identification system, as shown in fig. 5, the electronic invoice identification system includes an obtaining module 21, a selecting module 22 and an identifying module 23.
The obtaining module 21 is used for obtaining the invoice name of the electronic invoice. In order to obtain the invoice name of the electronic invoice, the obtaining module 21 needs to extract the text data on the electronic invoice, and then identify the invoice name of the electronic invoice from the text data.
As an alternative embodiment, the obtaining module 21 may select different extraction methods to extract text data in the electronic invoice according to electronic invoice files of different file types. If the file type of the electronic invoice is in PDF format, the text data in the electronic invoice can be directly extracted through an OCR recognition technology. As another optional implementation manner, if the file type of the electronic invoice is in the OFD format, the electronic invoice file needs to be decompressed, and a file in the XML format is obtained after decompression, so that text data of the electronic invoice can be directly extracted from the XML file.
In this embodiment, the invoice name of the electronic invoice is the first line of data in the text data, and the obtaining module 21 may easily obtain the invoice name of the electronic invoice by traversing the text information in the first line of the text data. As an optional embodiment, when extracting the text data of the electronic invoice, the obtaining module 21 need not extract all the text data of the electronic invoice, and only needs to extract at least one line of text data.
The selection module 22 is used for selecting the corresponding invoice identification template according to the invoice name. The invoice names of the electronic invoices serve as identification information of the electronic invoices, each invoice name corresponds to one invoice identification template, and as an optional implementation mode, different invoice names can be specified to use the same invoice identification template. For example, most provincial value-added tax electronic general invoices have the same invoice format, and although the invoice names of the invoices are different, the invoices can be identified by applying the same invoice identification template. In addition, some provinces can also issue electronic invoices with different invoice formats from those of common value-added tax electronic general invoices, and another invoice identification template needs to be applied for identification.
In this embodiment, the selection module 22 selects a standard electronic invoice file issued by an electronic invoice service platform to make an invoice recognition template, the selection module 22 specifies an area to be recognized on the electronic invoice according to actual requirements, and draws a rectangular frame through Rectangle tool, where the drawn rectangular frame includes four parameters, namely left, top, right, and bottom, as shown in fig. 3, the upper left corner coordinate of the electronic invoice is set as a coordinate origin, the lateral side of the rectangular frame is parallel to the lateral side of the electronic invoice, and the vertical side of the rectangular frame is parallel to the vertical side of the electronic invoice, where left represents the lateral coordinate of the upper left corner of the Rectangle, top represents the vertical coordinate of the upper left corner of the Rectangle, right represents the lateral coordinate of the lower right corner of the Rectangle, and bottom represents the vertical coordinate of the lower right corner of the Rectangle, so that the lateral length of the rectangular frame is right-left, and the vertical length of the rectangular frame is bottom-top, the selection module 22 saves the position and size of the rectangular box by saving these four parameters. In addition, label information representing the property of the area is marked on each rectangular frame, the label information can be defined by users according to the invoice content of the area to be identified, and the label information is stored.
As an alternative embodiment, based on the size of the display area of the electronic invoice, the selection module 22 sets different parameters according to the proportion to draw the rectangular frame, that is, the stored parameters of the rectangular frame are not fixed values, but four proportionality coefficients, where the actual values of left and right are the length of the invoice multiplied by their proportionality coefficients, and the actual values of top and bottom are the width of the invoice multiplied by the proportionality coefficients.
As an optional embodiment, the rectangular box drawn by the selection module 22 contains at least one piece of complete invoice information, and the invoice information with a relatively close position can be divided into the same area, so that the integrity of the invoice information is guaranteed to the greatest extent. As shown in fig. 2, a total of seven areas are divided on the electronic invoice, and each area is labeled with label information for characterizing the area, specifically: the number area is an invoice code area, and invoice code information, invoice number information, invoice date information and the like are contained in the invoice code area; the number area is a buyer area, and the number area comprises buyer name information, buyer tax payer identification number information and the like; the third area is an item name area which contains item name information; the fourth area is a tax rate area and contains tax rate information; the number area is a money amount area which contains money amount information; sixthly, the number area is a seller area and comprises seller name information, seller tax payer identification number information and the like; the region is a remark region, and the remark region contains remark information. Of course, the invoice identification template of the embodiment is not limited to the area division manner, and different electronic invoices have different division manners, and can be set according to actual requirements.
After the invoice identification template is determined, the selection module 22 stores the corresponding relationship between the invoice names and the invoice identification templates, as an alternative embodiment, one invoice identification template may correspond to one or more invoice names, and after the obtaining module 21 identifies the invoice names of the electronic invoices, the selection module 22 loads the corresponding invoice identification template.
The identification module 23 is used for determining and identifying the region to be identified of the electronic invoice according to the invoice identification template. The rectangle frame of invoice identification template corresponds the regional of waiting to discern of electronic invoice, in this embodiment, after leading-in electronic invoice, the length and the height that electronic invoice can be acquireed to acquisition module 21, identification module 23 obtains the proportionality coefficient of each rectangle frame through loading the invoice identification template that corresponds, substitute the length and the height of electronic invoice for the proportionality coefficient and calculate specific rectangle parameter value, obtain the coordinate value in rectangle frame lower left corner and upper right corner, thereby acquire specific rectangle frame, acquisition module 21 confirms the position and the size of the regional of waiting to discern of electronic invoice through the position and the size of rectangle frame.
After obtaining the regions to be recognized, as an optional implementation manner, the recognition modules 23 respectively extract text data in the regions to be recognized. The identification module 23 can easily obtain keywords such as an invoice code, an invoice number, an invoicing date, a name, a taxpayer identification number, an item name, an amount of money, a tax rate, a remark and the like from text data of the area to be identified by dividing the area, and the embodiment is not limited to the keywords. The keywords contained in different regions to be recognized are different. As an optional implementation manner, keywords to be recognized in each region are preset, and the recognition module 23 finds the region to be recognized according to the tag information of the region to be recognized and queries whether the preset keywords exist in the region to be recognized.
The electronic invoice comprises various invoice information, the keywords correspond to the invoice information, if the identification module 23 successfully identifies the keywords, the position of the invoice information can be determined according to the accurate position of the keywords in the invoice text data, and the invoice information corresponding to the keywords can be easily extracted. As an optional implementation manner, by dividing the reasonable area, the recognition module 23 can determine the positions of the keywords and the invoice information more quickly under the condition that the integrity of the keywords and the invoice information in the area is ensured.
The invoice information includes, but is not limited to, invoice code information, invoice number information, invoice date information, purchaser name information, purchaser taxpayer identification number information, project name information, amount information, tax rate information, seller name information, seller taxpayer identification number information, and remark information.
Referring to fig. 2, taking invoice code information as an example, the invoice code information is in the region of number (i), text data of the region of number (i) is obtained first, the recognition module 23 traverses rows sequentially, the first row is an invoice code, the second row is an invoice number, and if the keyword of the invoice code is successfully recognized, the recognition module 23 intercepts all data containing the item of "invoice code", and the "invoice code: the characters are removed to obtain invoice code information, the invoice code information is obtained in the same way, the third line is the invoicing date, after the steps, the recognition module 23 needs to further replace non-numeric characters with blank spaces, and then the non-numeric characters are sequentially obtained according to the division of the blank spaces to serve as the invoicing date information.
As an optional implementation manner, some keywords in the area to be identified are not in the same line as the invoice information, for example, the third area in fig. 2 is a project name field, lines are sequentially traversed, after the keyword of the project name is identified, the identifying module 23 finds a next line containing the line of the "project name", and intercepts all data of the next line as the project name information, and if there are multiple pieces of invoice information, after the data of the next line is acquired, the identifying module 23 sequentially divides the data according to spaces to acquire corresponding invoice information.
As an optional implementation manner, some areas to be identified which are accurately divided may directly acquire data as invoice information, for example, the area # fifthly in fig. 2 is an amount area, and the inside only includes the amount information, and the identifying module 23 directly acquires the data of the area # fifthly as the amount information. Certainly, the present embodiment is not limited to the above several ways of identifying invoice information by keywords, and the present embodiment accurately sets different identification ways in each area by dividing the area, and can perform corresponding adjustment according to actual requirements.
As shown in fig. 5, the electronic invoice identification system further includes a checking module 24, and after identifying the invoice information of the electronic invoice, the checking module 24 is configured to check whether the invoice information is correct, as an optional implementation manner, the checking module 24 may directly connect to invoice databases preset in the enterprise management systems to compare whether the invoice information is correct, first query the invoice data from the preset invoice databases according to keywords, and then compare the identified invoice information with the invoice data one by one. The invoice data includes but is not limited to invoice code data, invoice number data, invoicing date data, purchaser name data, purchaser taxpayer identification number data, project name data, amount data, tax rate data, seller name data, seller taxpayer identification number data and remark data.
In this embodiment, the identification module 23 divides the areas to be identified to obtain the invoice information in each area, and the check module 24 can simultaneously and respectively check the invoice information in each area to be identified, so that the efficiency of checking the invoice information is greatly improved.
If in the checking process, the corresponding invoice data can be found in the invoice database for all the invoice information in the area to be identified, the electronic invoice is confirmed to be valid, the checking module 24 outputs the valid checking result of the electronic invoice, as an optional implementation manner, and after the checking module 24 outputs the invoice and checks the correct character, the obtaining module 21 automatically loads the next electronic invoice.
If the invoice information of the to-be-identified area cannot be identified or the invoice information acquired in the to-be-identified area is inconsistent with the invoice data in the invoice database in the checking process, the electronic invoice is invalid, as an optional implementation manner, the checking module 24 may also output specific invoice contents in which the electronic invoice and the invoice database are inconsistent after outputting the error word of invoice checking, for example, when the taxpayer name of the buyer cannot find corresponding data in the invoice database, the invoice is determined to be an error in identification, the checking module 24 outputs the taxpayer name displayed in the invoice and the taxpayer name in the corresponding invoice database, and notifies the manual follow-up processing.
Example 3
Fig. 6 is a schematic structural diagram of an electronic device provided in this embodiment. The electronic equipment comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, and the processor executes the program to realize the electronic invoice identification method of embodiment 1. The electronic device 30 shown in fig. 6 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiment of the present invention.
As shown in fig. 6, the electronic device 30 may be embodied in the form of a general purpose computing device, which may be, for example, a server device. The components of the electronic device 30 may include, but are not limited to: the at least one processor 31, the at least one memory 32, and a bus 33 connecting the various system components (including the memory 32 and the processor 31).
The bus 33 includes a data bus, an address bus, and a control bus.
The memory 32 may include volatile memory, such as Random Access Memory (RAM)321 and/or cache memory 322, and may further include Read Only Memory (ROM) 323.
Memory 32 may also include a program/utility 325 having a set (at least one) of program modules 324, such program modules 324 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.
The processor 31 executes various functional applications and data processing, such as the electronic invoice recognition method according to embodiment 1 of the present invention, by executing the computer program stored in the memory 32.
The electronic device 30 may also communicate with one or more external devices 34 (e.g., keyboard, pointing device, etc.). Such communication may be through input/output (I/O) interfaces 35. Also, model-generating device 30 may also communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the Internet) via network adapter 36. As shown in FIG. 6, network adapter 36 communicates with the other modules of model-generating device 30 via bus 33. It should be understood that although not shown in the figures, other hardware and/or software modules may be used in conjunction with the model-generating device 30, including but not limited to: microcode, device drivers, redundant processors, external disk drive arrays, RAID (disk array) systems, tape drives, and data backup storage systems, etc.
It should be noted that although in the above detailed description several units/modules or sub-units/modules of the electronic device are mentioned, such a division is merely exemplary and not mandatory. Indeed, the features and functionality of two or more of the units/modules described above may be embodied in one unit/module according to embodiments of the invention. Conversely, the features and functions of one unit/module described above may be further divided into embodiments by a plurality of units/modules.
Example 4
The present embodiment provides a computer-readable storage medium on which a computer program is stored, which when executed by a processor, implements the steps of the electronic invoice recognition method of embodiment 1.
More specific examples, among others, that the readable storage medium may employ may include, but are not limited to: a portable disk, a hard disk, random access memory, read only memory, erasable programmable read only memory, optical storage device, magnetic storage device, or any suitable combination of the foregoing.
In a possible implementation, the invention can also be implemented in the form of a program product comprising program code for causing a terminal device to carry out the steps of implementing the electronic invoice recognition method of example 1, when said program product is run on said terminal device.
Where program code for carrying out the invention is written in any combination of one or more programming languages, the program code may be executed entirely on the user device, partly on the user device, as a stand-alone software package, partly on the user device and partly on a remote device or entirely on the remote device.
While specific embodiments of the invention have been described above, it will be appreciated by those skilled in the art that this is by way of example only, and that the scope of the invention is defined by the appended claims. Various changes and modifications to these embodiments may be made by those skilled in the art without departing from the spirit and scope of the invention, and these changes and modifications are within the scope of the invention.

Claims (10)

1. An electronic invoice identification method is characterized by comprising the following steps:
acquiring the invoice name of the electronic invoice;
selecting a corresponding invoice identification template according to the invoice name;
and determining and identifying the area to be identified of the electronic invoice according to the invoice identification template.
2. The electronic invoice recognition method of claim 1, wherein the step of obtaining an invoice name for the electronic invoice comprises:
identifying the file type of the electronic invoice, and determining an extraction method according to the file type;
extracting text data of the electronic invoice according to the extraction method;
and determining the invoice name according to the text data.
3. The electronic invoice recognition method of claim 2, wherein the file types include PDF files and OFD files.
4. The electronic invoice recognition method of claim 1, wherein the step of determining and recognizing the area to be recognized of the electronic invoice according to the invoice recognition template comprises:
intercepting the electronic invoice according to the invoice identification template to obtain the area to be identified;
identifying the area to be identified to obtain a keyword;
and acquiring invoice information according to the key words.
5. The electronic invoice recognition method of claim 4, wherein the keywords include at least one of invoice code, invoice number, invoice date, name, taxpayer identification number, item name, amount, tax rate, and remarks;
the invoice information comprises at least one of invoice code information, invoice number information, invoicing date information, purchaser name information, purchaser taxpayer identification number information, project name information, amount information, tax rate information, seller name information, seller taxpayer identification number information and remark information.
6. The electronic invoice recognition method of claim 4, wherein the step of obtaining invoice information according to the keywords is further followed by:
and checking invoice information.
7. The electronic invoice identification method of claim 6, wherein the step of reconciling invoice information comprises:
comparing the invoice information with invoice data in a preset invoice database, and if the invoice information is consistent with the invoice data in comparison, determining that the electronic invoice is valid;
and if the comparison is inconsistent, determining that the electronic invoice is invalid.
8. An electronic invoice recognition system, comprising an acquisition module, a selection module and a recognition module:
the obtaining module is used for obtaining the invoice name of the electronic invoice;
the selection module is used for selecting a corresponding invoice identification template according to the invoice name;
the identification module is used for determining and identifying the area to be identified of the electronic invoice according to the invoice identification template.
9. An electronic device comprising a memory and a processor coupled to the memory, the processor implementing the electronic invoice recognition method of any one of claims 1-7 when executing a computer program stored on the memory.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the electronic invoice recognition method of any one of claims 1-7.
CN202111391678.3A 2021-11-23 2021-11-23 Electronic invoice recognition method, system, electronic device and medium Pending CN114255335A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111391678.3A CN114255335A (en) 2021-11-23 2021-11-23 Electronic invoice recognition method, system, electronic device and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111391678.3A CN114255335A (en) 2021-11-23 2021-11-23 Electronic invoice recognition method, system, electronic device and medium

Publications (1)

Publication Number Publication Date
CN114255335A true CN114255335A (en) 2022-03-29

Family

ID=80791025

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111391678.3A Pending CN114255335A (en) 2021-11-23 2021-11-23 Electronic invoice recognition method, system, electronic device and medium

Country Status (1)

Country Link
CN (1) CN114255335A (en)

Similar Documents

Publication Publication Date Title
US9418315B1 (en) Systems, methods, and computer readable media for extracting data from portable document format (PDF) files
US7590647B2 (en) Method for extracting, interpreting and standardizing tabular data from unstructured documents
US10366123B1 (en) Template-free extraction of data from documents
US9213893B2 (en) Extracting data from semi-structured electronic documents
CN112035653A (en) Policy key information extraction method and device, storage medium and electronic equipment
CN101661512A (en) System and method for identifying traditional form information and establishing corresponding Web form
US11341319B2 (en) Visual data mapping
CN111352628A (en) Front-end code generation method and device, computer system and readable storage medium
CN111159982B (en) Document editing method, device, electronic equipment and computer readable storage medium
CN112395418B (en) Method and device for extracting target object in webpage and electronic equipment
US20050102313A1 (en) System for locating data elements within originating data sources
CN114444465A (en) Information extraction method, device, equipment and storage medium
US20240062235A1 (en) Systems and methods for automated processing and analysis of deduction backup data
US20070282804A1 (en) Apparatus and method for extracting database information from a report
CN111444368B (en) Method and device for constructing user portrait, computer equipment and storage medium
CN112099801A (en) Excel analysis method and system based on metadata driving
CN109636303B (en) Storage method and system for semi-automatically extracting and structuring document information
CN114255335A (en) Electronic invoice recognition method, system, electronic device and medium
US11281901B2 (en) Document extraction system and method
US20220121881A1 (en) Systems and methods for enabling relevant data to be extracted from a plurality of documents
CN114169306A (en) Method, device and equipment for generating electronic receipt and readable storage medium
CN113962205A (en) Method and device for pasting spreadsheet contents, electronic equipment and storage medium
CN110413659B (en) General shopping ticket data accurate extraction method
CN108228688B (en) Template generation method, system and server based on XBRL
CN107609155B (en) Construction method of data asset model based on XBRL standard

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination