CN114328831A

CN114328831A - Bill information identification and error correction method and device

Info

Publication number: CN114328831A
Application number: CN202111601719.7A
Authority: CN
Inventors: 王飞
Original assignee: Jiangsu Yincheng Network Technology Co Ltd
Current assignee: Jiangsu Yincheng Network Technology Co Ltd
Priority date: 2021-12-24
Filing date: 2021-12-24
Publication date: 2022-04-12

Abstract

The invention provides a method and a device for identifying and correcting bill information, which are used for acquiring text content information, text position information and context information from a bill to be identified; extracting bill element information based on the text content information, the text position information and the context information; based on a text rule to be corrected, classifying the bill element information into first bill element information needing to be corrected and second bill element information not needing to be corrected; and determining the type of the first bill element information, and correcting the first bill element information by adopting an error correction mode corresponding to the type. The invention can provide a targeted error correction scheme according to different bill information, further correct the text information identified by the OCR, and quickly identify and acquire effective information in the bill information, thereby improving the accuracy of bill information identification and the efficiency of bill information identification.

Description

Bill information identification and error correction method and device

Technical Field

The application relates to the technical field of image recognition, in particular to a bill information recognition and error correction method, a bill information recognition and error correction device, computer equipment and a storage medium.

Background

As the economy evolves, many companies use financial systems to handle various financial-related issues for the company. In order to reduce the workload of financial staff, the financial system usually adopts an OCR (Optical Character Recognition) technology to recognize bills and extract effective information in the bills.

Because the recognition of the bill by adopting the OCR technology has the problem of insufficient precision, the further error correction of the recognized text information is lacked. Therefore, the existing bill recognition scheme does not carry out error correction processing on the text recognized by the OCR; and if error correction processing exists, only relying on the internal relation of the ticket face elements for verification, and lacking a targeted error correction scheme for different ticket information.

Therefore, there is a need for a bill information identification and error correction method and apparatus capable of performing a targeted error correction process for different bill information.

Disclosure of Invention

The embodiment of the invention provides a bill information recognition and error correction method, a bill information recognition and error correction device, computer equipment and a storage medium, which are used for solving the problem that the error correction is not carried out on a text recognized by an OCR (optical character recognition) in the existing bill recognition scheme; and if error correction exists, only relying on the internal relation of the ticket face elements for verification, and lacking a targeted error correction scheme for different ticket information.

In order to achieve the above object, in a first aspect of the embodiments of the present invention, there is provided a ticket information identifying and correcting method, including:

acquiring text content information, text position information and context information from a bill to be identified;

extracting bill element information based on the text content information, the text position information and the context information;

based on a text rule to be corrected, classifying the bill element information into first bill element information needing to be corrected and second bill element information not needing to be corrected;

and determining the type of the first bill element information, and correcting the first bill element information by adopting an error correction mode corresponding to the type.

Optionally, in a possible implementation manner of the first aspect, the obtaining text content information, text location information, and context information from a ticket to be recognized includes:

and identifying the image information of the bill to be identified based on the OCR model to obtain the text content information, the text position information, the context information and the confidence coefficient of the text content information of the bill to be identified.

Optionally, in a possible implementation manner of the first aspect, the classifying the ticket element information into first ticket element information requiring error correction and second ticket element information not requiring error correction based on a rule for identifying a text to be corrected includes:

if the bill element information is a fixed vocabulary, judging whether the fixed vocabulary exists in a fixed vocabulary word bank;

if the first bill element information does not exist, the bill element information is the first bill element information needing error correction;

if the second bill element information exists, the bill element information is the second bill element information which does not need error correction.

if the bill element information is a common text, judging whether the confidence of the common text is lower than a preset threshold value;

if the confidence of the common text is lower than a preset threshold value, the bill element information is the first bill element information needing error correction;

and if the confidence of the common text is not lower than a preset threshold value, the bill element information is second bill element information which does not need to be corrected.

Optionally, in a possible implementation manner of the first aspect, the determining a type of the first ticket element information, and performing error correction on the first ticket element information by using an error correction method corresponding to the type includes:

if the first bill element information is a fixed vocabulary, replacing the text information to be corrected according to the constructed confusion word bank to obtain target text information;

and if the first bill element information is a common text, correcting the text information to be corrected by adopting a BERT model.

Optionally, in a possible implementation manner of the first aspect, the correcting the text information to be corrected by using the BERT model includes:

inputting the text information to be corrected into the BERT model to obtain a plurality of candidate words and confidence degrees of the candidate words;

calculating the similarity between the text information to be corrected and the candidate words;

and when the confidence degree and the similarity meet a replacement condition, using the candidate word to replace the text information to be corrected.

In a second aspect of the embodiments of the present invention, there is provided a ticket information identification and error correction apparatus, including:

the text information acquisition module is used for acquiring text content information, text position information and context information from the bill to be identified;

the bill element information extraction module is used for extracting bill element information based on the text content information, the text position information and the context information;

the text information to be corrected judging module is used for classifying the bill element information into first bill element information needing to be corrected and second bill element information not needing to be corrected based on a text rule for identifying the text to be corrected;

and the error correction module is used for determining the type of the first bill element information and correcting the first bill element information by adopting an error correction mode corresponding to the type.

Optionally, in a possible implementation manner of the second aspect, the text information obtaining module is further configured to perform the following steps, including:

In a third aspect of the embodiments of the present invention, a computer device is provided, which includes a memory and a processor, where the memory stores a computer program that is executable on the processor, and the processor implements the steps in the above method embodiments when executing the computer program.

A fourth aspect of the embodiments of the present invention provides a readable storage medium, in which a computer program is stored, which, when being executed by a processor, is adapted to carry out the steps of the method according to the first aspect of the present invention and various possible designs of the first aspect of the present invention.

The invention provides a bill information identification and error correction method, a device, computer equipment and a storage medium, which are characterized in that text content information, text position information and context information are acquired from a bill to be identified; extracting bill element information based on the text content information, the text position information and the context information; based on a text rule to be corrected, classifying the bill element information into first bill element information needing to be corrected and second bill element information not needing to be corrected; and determining the type of the first bill element information, and correcting the first bill element information by adopting an error correction mode corresponding to the type. The invention can provide a targeted error correction scheme according to different bill information, further correct the text information identified by the OCR, and quickly identify and acquire effective information in the bill information, thereby improving the accuracy of bill information identification and the efficiency of bill information identification.

Drawings

FIG. 1 is a flow chart of a first embodiment of a ticket information identification and correction method;

FIG. 2 is a flowchart of a first embodiment of a ticket information identification and correction method;

fig. 3 is a block diagram of a first embodiment of a bill information identification and correction device.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The terms "first," "second," "third," "fourth," and the like in the description and in the claims, as well as in the drawings, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein.

It should be understood that, in various embodiments of the present invention, the sequence numbers of the processes do not mean the execution sequence, and the execution sequence of the processes should be determined by the functions and the internal logic of the processes, and should not constitute any limitation on the implementation process of the embodiments of the present invention.

It should be understood that in the present application, "comprising" and "having" and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

It should be understood that, in the present invention, "a plurality" means two or more. "and/or" is merely an association describing an associated object, meaning that three relationships may exist, for example, and/or B, may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. "comprises A, B and C" and "comprises A, B, C" means that all three of A, B, C comprise, "comprises A, B or C" means that one of A, B, C comprises, "comprises A, B and/or C" means that any 1 or any 2 or 3 of A, B, C comprises.

It should be understood that in the present invention, "B corresponding to a", "a corresponds to B", or "B corresponds to a" means that B is associated with a, and B can be determined from a. Determining B from a does not mean determining B from a alone, but may be determined from a and/or other information. And the matching of A and B means that the similarity of A and B is greater than or equal to a preset threshold value.

As used herein, "if" may be interpreted as "at … …" or "when … …" or "in response to a determination" or "in response to a detection", depending on the context.

The technical solution of the present invention will be described in detail below with specific examples. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments.

The invention provides a bill information identification and error correction method, which is shown in a flow chart of figures 1 and 2 and comprises the following steps:

step S110, acquiring text content information, text position information and context information from the bill to be identified.

In this step, the bill information identification and error correction method related in the present application can be applied to text identification of bills in the financial industry, and can also be applied to other fields of text identification of bills, and bills can be contracts, receipts, invoices, tickets, agreements, bills and the like, all of which are within the protection scope of the present invention. Before acquiring text information from a bill to be identified, the bill needs to be shot to obtain image information, wherein the image information can be picture information or a scanning piece of the bill, or can be shot on site, or can be acquired from a preset database; then, identifying the image information of the bill to be identified through an OCR model to obtain content information of a text (characters or characters), position information of the text, context information of the position of the text and confidence of the text in the bill information; the OCR model used in the method is the open source recognition scheme paddleOCR, so that a large amount of marking and training work on the pictures is avoided.

And step S120, extracting bill element information based on the text content information, the text position information and the context information.

In step S120, since the text information recognized by the OCR model does not include the type to which the text information belongs, and the recognized text information also includes some useless texts, which are not needed, further extraction of the text information recognized by the OCR model is needed to obtain the required ticket element information. The text content information refers to the content such as characters and symbols, the text position information refers to the position of the text content, such as the uppermost end, the lowermost end, the upper left end and the like, on the bill, and the context information refers to the text information (namely the context information) adjacent to the text, which needs to be paid attention to in the process of identifying the text information, so that the problem that the text is changed due to overlong text information and two lines of texts of the text cannot be classified into one category in the process of identifying can be solved. In the extraction process, a layout template corresponding to the format of the bill needs to be selected, and then the text content information, the text position information and the context information output by the OCR model are combined, so that the bill element information of the bill to be recognized is extracted and obtained. For example: if the top position of an invoice is a ticket number and the length of the ticket number is 30, identifying the numerical character string with the length of 30 at the top position as a ticket number, and extracting the numerical character string as the ticket element information of the text type (ticket number class); another example is: the lowest position of the invoice is address information, the address information is too long and occupies two lines of texts, so that the type of the text information needs to be determined according to the layout template of the bill and the position information of the text information, the context information of the text information is analyzed, and the upper line and the lower line of text information are assembled to obtain bill element information of the text type (address class).

And S130, classifying the bill element information into first bill element information needing error correction and second bill element information needing no error correction based on the rule for identifying the text to be corrected.

In the step, the extracted bill element information may have the situations of wrong identification and wrong identification, and needs to be corrected, and the words with errors need to be searched and identified in the error correction process, and the searching method adopted by the invention is a secondary classification marking method, namely: firstly, classifying the extracted bill information into fixed words and ordinary texts, and determining the type of the bill element information according to the layout template of the bill and the position information of the text in the primary classification process, for example: the top position of the bill is the bill number information, the right bottom position is the address information, the middle position is the bill amount, etc.; therefore, by identifying the image information of the bill, a plurality of bill element information and the types corresponding to the bill element information are output, and then all types on the bill are divided into two categories, namely fixed vocabularies, such as bill drawing amount (capital amount, such as first and second money and the like), and acceptance sign (a company financial special seal and/or a personal special seal); the second is plain text such as payer address, payer full name, etc. The method comprises the steps that bill element information of fixed vocabulary categories and bill element information of a common text are obtained after primary classification, then secondary classification is carried out on the two kinds of bill element information of the fixed vocabulary and the common text respectively, the bill element information is divided into first bill element information needing error correction and second bill element information needing no error correction, in the secondary classification process, a fixed vocabulary word bank is established in advance aiming at the bill element information only containing the fixed vocabulary, all the fixed vocabularies of the bill are contained in the fixed vocabulary word bank, whether the fixed vocabularies exist in the fixed vocabulary word bank is judged, and if the fixed vocabulary does not exist, the bill element information is the first bill element information needing error correction; if the second bill element information exists, the bill element information is the second bill element information which does not need error correction. For the bill element information of the ordinary text, a judgment needs to be made according to the confidence level of the text content information output by the OCR model in step S110, the OCR model has a corresponding confidence level for each recognized character, and the higher the confidence level is, the higher the confidence level represents, so as to set a preset threshold value of the confidence level, the size of the preset threshold value can be set according to the actual situation, and if the confidence level of the text is higher than or equal to the preset threshold value, the text (i.e., the bill element information) is the second bill element information that does not need to be corrected; if the confidence of the text is lower than a preset threshold value, the text (namely the bill element information) is the first bill element information needing error correction. By the secondary classification marking method, different judgment methods can be carried out on bill information of different types, and possibility is provided for a targeted error correction processing method for different bill information, so that the bill information identification accuracy and the bill information identification efficiency are improved.

Step S140, determining the type of the first ticket element information, and performing error correction on the first ticket element information by using an error correction method corresponding to the type.

In step S140, when the type of the first ticket element information is determined, the text information to be corrected is corrected according to the correction method corresponding to the type, specifically as follows:

and when the bill element text information to be corrected is judged to be a fixed vocabulary, correcting the error through the constructed confusion word bank. In the construction process, when a large amount of training data is counted, certain words or terms are frequently and incorrectly recognized as another word or term, so that a confusion word bank can be constructed according to the training data which is easily confused by a machine. When the error correction processing is carried out based on the confusion word bank, the fixed vocabulary text to be corrected and identified by the OCR model is detected and replaced. For example: aiming at the fact that the text information of the amount of money such as the amount of drawing a bill in the bill information relates to capital figures, a simple confusion lexicon { zero, one, two, three, four, five, land, seven, eight, Jiu } is established, and therefore recognition and replacement can be conducted.

And when the bill element text information to be corrected is judged to be the common text, the text information to be corrected is corrected by adopting a pre-trained natural language model BERT. The specific error correction process is as follows: inputting the text information to be corrected into the BERT model to obtain a plurality of candidate words and confidence degrees of the candidate words; calculating the similarity between the text information to be corrected and the candidate words; and when the confidence degree and the similarity meet a replacement condition, using the candidate word to replace the text information to be corrected. Judging first bill element information (a common text) to be corrected through a preset threshold of the confidence of the step S130, inputting the common text to be corrected into a trained BERT model as a predicted word, thereby outputting a plurality of candidate words and a confidence (set as C) corresponding to each candidate word, predicting the predicted word through the vocabulary around the predicted word in the prediction process, and taking the prediction result as the candidate word; then calculating the similarity of the candidate words and the predicted words, wherein the similarity is p, and an edit distance calculation method is adopted when the text similarity is calculated; and finally, judging whether the candidate word meets the replacement requirement or not according to the confidence of each candidate word and the similarity between the candidate word and the common text (predicted word) to be corrected, and replacing the predicted word with the candidate word under the condition of meeting the condition. The specific replacement requirements are as follows:

c+p>＝1and c>＝0.05and p>＝0.4

the three requirements that the sum of the confidence coefficient and the similarity is greater than or equal to 1, the confidence coefficient is greater than or equal to 0.05 and the similarity is greater than or equal to 0.04 are simultaneously met. After different error correction methods are used for carrying out error correction and replacement on different bill information, correct text information can be output.

The bill information identification and error correction method provided by the invention obtains text content information, text position information and context information from a bill to be identified; extracting bill element information based on the text content information, the text position information and the context information; based on a text rule to be corrected, classifying the bill element information into first bill element information needing to be corrected and second bill element information not needing to be corrected; and determining the type of the first bill element information, and correcting the first bill element information by adopting an error correction mode corresponding to the type. The invention can provide a targeted error correction scheme according to different bill information, further correct the text information identified by the OCR, and quickly identify and acquire effective information in the bill information, thereby improving the accuracy of bill information identification and the efficiency of bill information identification.

An embodiment of the present invention further provides a device for identifying and correcting ticket information, as shown in fig. 3, including:

In one embodiment, the text information obtaining module is further configured to perform the following steps, including:

The bill information identification and correction device provided by the invention acquires text content information, text position information and context information from a bill to be identified; extracting bill element information based on the text content information, the text position information and the context information; based on a text rule to be corrected, classifying the bill element information into first bill element information needing to be corrected and second bill element information not needing to be corrected; and determining the type of the first bill element information, and correcting the first bill element information by adopting an error correction mode corresponding to the type. The invention can provide a targeted error correction scheme according to different bill information, further correct the text information identified by the OCR, and quickly identify and acquire effective information in the bill information, thereby improving the accuracy of bill information identification and the efficiency of bill information identification.

The readable storage medium may be a computer storage medium or a communication medium. Communication media includes any medium that facilitates transfer of a computer program from one place to another. Computer storage media may be any available media that can be accessed by a general purpose or special purpose computer. For example, a readable storage medium is coupled to the processor such that the processor can read information from, and write information to, the readable storage medium. Of course, the readable storage medium may also be an integral part of the processor. The processor and the readable storage medium may reside in an Application Specific Integrated Circuits (ASIC). Additionally, the ASIC may reside in user equipment. Of course, the processor and the readable storage medium may also reside as discrete components in a communication device. The readable storage medium may be a read-only memory (ROM), a random-access memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

The present invention also provides a program product comprising execution instructions stored in a readable storage medium. The at least one processor of the device may read the execution instructions from the readable storage medium, and the execution of the execution instructions by the at least one processor causes the device to implement the methods provided by the various embodiments described above.

In the above embodiments of the terminal or the server, it should be understood that the Processor may be a Central Processing Unit (CPU), other general-purpose processors, a Digital Signal Processor (DSP), etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of a method disclosed in connection with the present invention may be embodied directly in a hardware processor, or in a combination of the hardware and software modules within the processor.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims

1. A bill information identification and error correction method is characterized by comprising the following steps:

2. The ticket information identification and correction method according to claim 1, wherein said obtaining text content information, text position information, and context information from the ticket to be identified comprises:

3. The ticket information identification and correction method according to claim 1, wherein said classifying the ticket element information into a first ticket element information requiring correction and a second ticket element information not requiring correction based on the rule of identifying the text to be corrected comprises:

4. The ticket information identification and correction method according to claim 1, wherein said classifying the ticket element information into a first ticket element information requiring correction and a second ticket element information not requiring correction based on the rule of identifying the text to be corrected comprises:

5. The ticket information identifying and error correcting method of claim 3, wherein the determining the type of the first ticket element information and performing error correction on the first ticket element information in an error correction manner corresponding to the type comprises:

6. The method for identifying and correcting the bill information according to claim 5, wherein the correcting the text information to be corrected by using the BERT model comprises:

7. A bill information identification and correction device is characterized by comprising:

8. The ticket information identification and correction device of claim 7, wherein the text information obtaining module is further configured to perform the following steps, including:

9. A computer device comprising a memory and a processor, the memory storing a computer program operable on the processor, wherein the processor implements the steps of the method of any one of claims 1 to 6 when executing the computer program.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 6.