CN116484844A - Document OCR recognition result error correction method, system, equipment and medium - Google Patents

Document OCR recognition result error correction method, system, equipment and medium Download PDF

Info

Publication number
CN116484844A
CN116484844A CN202310469838.4A CN202310469838A CN116484844A CN 116484844 A CN116484844 A CN 116484844A CN 202310469838 A CN202310469838 A CN 202310469838A CN 116484844 A CN116484844 A CN 116484844A
Authority
CN
China
Prior art keywords
recognition result
ocr recognition
character
ocr
document
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310469838.4A
Other languages
Chinese (zh)
Inventor
侯昶宇
王俊
王晓锐
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN202310469838.4A priority Critical patent/CN116484844A/en
Publication of CN116484844A publication Critical patent/CN116484844A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/232Orthographic correction, e.g. spell checking or vowelisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/126Character encoding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/14Image acquisition
    • G06V30/141Image acquisition using multiple overlapping images; Image stitching
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/42Document-oriented image-based pattern recognition based on the type of document
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention relates to the field of medical health, in particular to a document OCR recognition result error correction method, system, equipment and medium, wherein the method comprises the following steps: acquiring an OCR recognition result of the bill; obtaining the unified code and the font structure information; obtaining a confidence coefficient; splicing the unified code, the font structure information and the confidence coefficient to obtain a character code; acquiring a position code; and inputting the character codes and the position codes into a natural language processing model to obtain a character result after error correction. The method greatly improves the OCR result of bill recognition by combining the OCR point location information, the upstream model results of the confidence coefficient of the recognition characters, unicode codes and the nlp information represented by IDS.

Description

Document OCR recognition result error correction method, system, equipment and medium
Technical Field
The invention relates to the field of medical health, in particular to a document OCR recognition result error correction method, system, equipment and medium.
Background
OCR (Optical Character Recognition ) refers to the process of an electronic device (e.g., a scanner or digital camera) checking characters printed on paper, determining their shape by detecting dark and light patterns, and then translating the shape into computer text using a character recognition method. In many fields, bill identification is an important way for reducing manpower and material resources and improving working efficiency, and identification accuracy is an important index for evaluating whether a system is good or not.
Although the model recognition accuracy of OCR is generally higher at present, recognition errors are often caused in practical application, and particularly, the recognition accuracy is obviously reduced for near-shape words. However, in the special scene of document OCR recognition, the method of recognizing and correcting the common language model is insufficient to achieve a good enough effect due to lack of sufficient context information, which usually results in a low recall rate, and how to solve the problem is a key point for improving the OCR recognition result.
Disclosure of Invention
The invention aims to provide a bill OCR recognition result error correction method, system, equipment and medium capable of improving the error correction precision of the bill OCR recognition result.
To achieve the above object and other related objects, the present invention provides a document OCR recognition result error correction method, including the steps of:
acquiring an OCR recognition result of the bill;
preprocessing the OCR recognition result to obtain unified code and font structure information corresponding to each character in the OCR recognition result;
acquiring confidence degrees corresponding to the characters in the OCR recognition result;
splicing the unified code, the font structure information and the confidence coefficient to obtain a character code;
acquiring point location information corresponding to each word in the OCR recognition result, and taking the point location information corresponding to each word as a position code;
and inputting the character codes and the position codes into a natural language processing model to obtain a character result after error correction.
To achieve the above object and other related objects, the present invention also provides a document OCR recognition result error correction device, including:
the first acquisition module is used for acquiring an OCR recognition result of the bill;
the first processing module is used for preprocessing the OCR recognition result to obtain unified codes and font structure information corresponding to each character in the OCR recognition result;
the second processing module is used for obtaining the confidence coefficient corresponding to each character in the OCR recognition result;
the third processing module is used for splicing the unified code, the font structure information and the confidence coefficient to obtain a character code;
the second acquisition module is used for acquiring point location information corresponding to each word in the OCR recognition result, and taking the point location information corresponding to each word as a position code;
and the execution module is used for inputting the character codes and the position codes into a natural language processing model to obtain a character result after error correction.
To achieve the above and other related objects, the present invention also provides an electronic device including a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of the method when executing the computer program.
To achieve the above and other related objects, the present invention also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the method.
The invention has the technical effects that: the method provides a solution for error correction of the bill OCR recognition result, and effectively solves the problem that the performance of the current model is greatly reduced due to too short context when the bill is recognized by combining the OCR recognition point position information, the upstream model results such as the recognition text confidence level, unicode codes and IDS representation nlp information; by combining the font information of the IDS and the position information of the OCR, the model can be assisted to correct errors by supplementing additional information as much as possible, meanwhile, the font information of the IDS can give a more accurate result aiming at the recognition result of the OCR, and the OCR result of bill recognition is greatly improved by constructing proper training data and pre-training tasks.
Drawings
FIG. 1 is an application scenario diagram of a document OCR recognition result error correction method provided by an embodiment of the present invention;
FIG. 2 is a flow chart of a document OCR recognition result error correction method provided by an embodiment of the present invention;
FIG. 3 is a flow chart of a document OCR recognition result preprocessing process provided by an embodiment of the present invention;
FIG. 4 is a flow chart of a character encoding stitching process provided by an embodiment of the present invention;
FIG. 5 is a flow chart of a position code acquisition method provided by an embodiment of the present invention;
FIG. 6 is a flow chart of an error correction model training process provided by an embodiment of the present invention;
FIG. 7 is a functional block diagram of a document OCR recognition result error correction apparatus provided by an embodiment of the present invention;
fig. 8 is a block diagram of an electronic device provided by an embodiment of the present invention.
Detailed Description
Other advantages and effects of the present invention will become apparent to those skilled in the art from the following disclosure, which describes the embodiments of the present invention with reference to specific examples. The invention may be practiced or carried out in other embodiments that depart from the specific details, and the details of the present description may be modified or varied from the spirit and scope of the present invention.
Please refer to fig. 1-8. It should be noted that, the illustrations provided in the present embodiment merely illustrate the basic concept of the present invention by way of illustration, and only the components related to the present invention are shown in the drawings and are not drawn according to the number, shape and size of the components in actual implementation, and the form, number and proportion of the components in actual implementation may be arbitrarily changed, and the layout of the components may be more complex.
The embodiment of the application can acquire and process the related data based on the artificial intelligence technology. Among these, artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a digital computer-controlled machine to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use knowledge to obtain optimal results.
Fig. 1 shows an application scenario diagram of a preferred embodiment of the document OCR recognition result error correction method of the present invention.
The document OCR recognition result error correction method is applied to one or more electronic devices, wherein the electronic devices are devices capable of automatically performing numerical calculation and/or information processing according to preset or stored instructions, and the hardware of the electronic devices comprises, but is not limited to, microprocessors, application specific integrated circuits (Application Specific Integrated Circuit, ASICs), programmable gate arrays (Field-Programmable Gate Array, FPGA), digital processors (Digital Signal Processor, DSP), embedded devices and the like.
The electronic device may be any electronic product that can interact with a user in a human-computer manner, such as a personal computer, tablet computer, smart phone, personal digital assistant (Personal Digital Assistant, PDA), game console, interactive internet protocol television (Internet Protocol Television, IPTV), smart wearable device, etc.
The electronic device may also include a network device and/or a user device. Wherein the network device includes, but is not limited to, a single network server, a server group composed of a plurality of network servers, or a Cloud based Cloud Computing (Cloud Computing) composed of a large number of hosts or network servers.
The network in which the electronic device is located includes, but is not limited to, the internet, a wide area network, a metropolitan area network, a local area network, a virtual private network (Virtual Private Network, VPN), and the like.
The document OCR recognition result error correction method of the present invention will be described in detail below with reference to fig. 1, and the document OCR recognition result error correction method may be applied to, for example, automatic recognition of documents, and is particularly suitable for medical document OCR recognition or medical platform inquiry, for example, recognition of medical payment documents, medical detection, inspection documents, medical diagnosis cases, and the like.
OCR (optical character recognition) text recognition refers to the process of an electronic device (e.g., a scanner or digital camera) checking characters printed on paper, and then translating the shape into computer text using a character recognition method; namely, the text data is scanned, and then the image file is analyzed and processed to obtain the text and layout information. How to debug or use auxiliary information to improve recognition accuracy is the most important issue of OCR. The main indexes for measuring the performance of an OCR system are as follows: rejection rate, false recognition rate, recognition speed, user interface friendliness, product stability, usability, feasibility and the like.
OCR character recognition generally includes image input, preprocessing, layout analysis, character cutting, character recognition, layout recovery, post-processing, and collation. The image input means that different storage formats and compression modes are provided for images with different formats, and currently, open source projects for accessing images include OpenCV, cximage and the like; the preprocessing mainly comprises binarization, noise removal and inclination correction, in most cases, the pictures shot by using the cameras are color images, the color images contain very rich information, simplification is needed, the binarization can simply divide the content of the pictures into foreground and background, in order to enable a computer to recognize characters faster and better, the color images need to be processed first, so that the pictures only have foreground and background information, namely, the foreground information is simply defined as black, the background information is white, and the images are the binarization pictures. The definition of noise may be different for different documents, and the elimination processing is performed according to the characteristics of noise, which is called noise removal. In general, the user can take a photograph more randomly, and the photographed document is more likely to incline, and then the user needs to use a character recognition software to correct the photographed document. The process of dividing the document picture into sections and lines is called layout analysis, and no fixed and best cutting model exists at present due to the diversity and complexity of actual documents. Because of the limitation of photographing conditions, the conditions such as character adhesion and pen breakage are often caused, so that the performance of the recognition system is greatly limited, and character recognition software is required to have a character cutting function. Character recognition is mainly based on feature extraction. Usually, the characters which are expected to be recognized by the human are still arranged according to the original document pictures, the paragraphs are kept unchanged, the positions are unchanged and the sequence is unchanged, and then the characters are output to a Word document or a PDF document, and the process is called layout recovery. In different language environments, the logical order of the languages is different, so that the recognized result needs to be corrected according to the context of language features, and the process is post-processing.
Referring to fig. 1, the present invention converts characters recognized by OCR into Unicode and font structure information according to Unicode character list and IDS characterization of Chinese characters, combines the Unicode and font structure information with confidence of OCR recognition result to obtain multidimensional character information of OCR recognition result, combines character information with point location information of characters as input of Bert model, and because of combining Unicode, font structure, confidence and point location information of characters, bert model can more accurately recognize errors in characters.
According to the method, the problems that the performance of the model is greatly reduced due to too short context when the current model is used for bill recognition are effectively solved by combining the OCR recognition point location information, the upstream model results such as the recognition text confidence level, unicode codes and IDS representation nlp information. By combining the font information of the IDS and the position information of the OCR, the model can be assisted to correct errors by supplementing additional information as much as possible, meanwhile, the font information of the IDS can give a more accurate result aiming at the recognition result of the OCR, and the OCR result of bill recognition is greatly improved by constructing proper training data and pre-training tasks.
Unicode, also known as Unicode, single code, was developed by the Unicode alliance and is an industry standard in the field of computer science, including character sets, coding schemes, and the like. Unicode is generated to solve the limitations of traditional character coding schemes, and it sets a unified and unique binary code for each character in each language to meet the requirements of text conversion and processing across languages and platforms.
The ideographic description sequence (English: ideographic Description Sequence, IDS) is a Chinese character structure description grammar defined by Unicode standard, and a description sequence is formed by combining a description character with more than two specific characters (mainly Chinese characters) and represents the abstract structure of a Chinese character. IDS is primarily aimed at expressing the abstract structure of chinese characters, rather than being used for dynamic word-forming as is the case with compound characters.
Unicode defines 12 kinds of combination characters, as follows:
unicode has not defined a unique expression of Chinese characters, and according to current proposals, a Chinese character can be expressed by a plurality of IDSs, for example, a Chinese character can be expressed as a Chinese character "Worker's or>Worker->A person.
Referring to fig. 2, a document OCR recognition result error correction method includes the following steps:
s10: and acquiring an OCR recognition result of the bill.
S20: preprocessing the OCR recognition result to obtain unified code and font structure information corresponding to each character in the OCR recognition result.
Referring to fig. 3, in an embodiment, the step S20 further includes:
s21: acquiring characters in the OCR recognition result;
s22: according to the Unicode character list, determining a unified code corresponding to the characters;
s23: and generating IDS representation corresponding to the text according to the text, wherein the IDS representation comprises font structure information of the text.
The document OCR recognition result error correction method further comprises the following steps:
s30: and obtaining the confidence coefficient corresponding to each character in the OCR recognition result.
The document OCR recognition result error correction method further comprises the following steps:
s40: and splicing the unified code, the font structure information and the confidence coefficient to obtain a character code.
Referring to fig. 4, in an embodiment, the step S40 further includes:
s41: converting the unified code, the font structure information and the confidence into feature vectors respectively;
s42: and splicing the feature vector of the unified code, the feature vector of the font structure information and the feature vector of the confidence coefficient to obtain the feature vector of the character code.
The document OCR recognition result error correction method further comprises the following steps:
s50: and acquiring point location information corresponding to each character in the OCR recognition result, and taking the point location information corresponding to each character as a position code.
Referring to fig. 5, in an embodiment, the step S50 further includes:
s51: acquiring pixel values of pixel points, on the receipt, of which the characters are at specified relative positions in the OCR recognition result; for example, the pixel values of the pixel points of the document, which are positioned at the left upper corner and the right lower corner of each character in the OCR recognition result, are obtained.
S52: and taking the pixel value of each pixel point as the position code.
The document OCR recognition result error correction method further comprises the following steps:
s60: and inputting the character codes and the position codes into a natural language processing model to obtain a character result after error correction.
Referring to fig. 6, in a specific embodiment, the natural language processing model may be a BERT model, which may be obtained by training, for example, the following method:
s61: acquiring an OCR recognition result of the historical document, manually marking error characters in the OCR recognition result of the historical document, and correcting the error characters to be used as a model training set;
s62: preprocessing the OCR recognition result of the historical receipt to obtain unified codes and font structure information corresponding to each character in the OCR recognition result of the historical receipt;
s63: acquiring the confidence coefficient corresponding to each character in the OCR recognition result of the historical document;
s64: splicing the uniform codes of all the characters, the font structure information and the confidence coefficient in the OCR recognition result of the historical receipt to obtain the character code of the OCR recognition result of the historical receipt;
s65: acquiring point location information corresponding to each character in an OCR (optical character recognition) result of the historical document, and taking the point location information corresponding to each character in the OCR result of the historical document as a position code of the OCR result of the historical document;
s66: inputting character codes and position codes of OCR recognition results of the historical receipts into a BERT pre-training model to obtain an error correction result, and calculating a loss value of the BERT pre-training model according to the artificially marked error characters;
s67: and adjusting parameters of the BERT pre-training model, and repeating the steps until the loss value is smaller than a preset threshold value to obtain the trained BERT model.
The method builds a complete OCR recognition result error correction system, and greatly improves the recognition accuracy.
When the model is trained for coding, the model is divided into two layers, wherein the first layer is the combination of Unicode, IDS characterization and OCR recognition results, and the second layer is the point location information of the OCR recognition results. The following is a detailed explanation of these two parts:
IDS codes record the writing structure and strokes of Chinese characters, contain complete font structure information and can learn the problem of shape and proximity of characters in OCR recognition. However, using IDS alone may present some problems, such as "a" and "a" that are nearly identical in glyph structure, just using IDS to characterize may not distinguish between them. Thus, unicode encoding is additionally adopted to effectively distinguish different characters of the same structure during encoding. Finally, the confidence of OCR in recognizing individual words can also be a very important additional feature, as the problem of OCR recognition errors is generally less confidence. Thus, character encoding can be seen as the result of a concatenation of Unicode, IDS representation, confidence.
In position coding, although the document is not in much context, some additional context information may be provided by information such as the header of the form. Thus, in the position encoding, point position information of the upper left corner and the lower right corner of each character is used as the position encoding with reference to the structure of the span BERT, unlike the conventional BERT. Therefore, in the learning process, the model learns the contextual information such as the header in the bill form through the point location information.
The subsequent model structure imitates the BERT pre-training model structure, the addition result of the improved position coding and character coding information is used as the input of the BERT layer, and the output of the model is the character result predicted by the model at each position.
When training data is constructed, the OCR result of the actual use process is adopted, the error characters in the OCR result are marked manually, and the error characters are corrected to be used as a model training set. The model training target is to predict the error information in the actual OCR predicted text and provide correction results.
It should be noted that, the above steps of the methods are divided, for clarity of description, and may be combined into one step or split into multiple steps when implemented, so long as they contain the same logic relationship, and they are all within the protection scope of the present patent; it is within the scope of this patent to add insignificant modifications to the algorithm or flow or introduce insignificant designs, but not to alter the core design of its algorithm and flow.
Fig. 7 is a functional block diagram showing a preferred embodiment of the document OCR recognition result error correction device according to the present invention. The document OCR recognition result error correction device comprises: the first acquisition module 10, the first processing module 20, the second processing module 30, the third processing module 40, the second acquisition module 50, and the execution module 60. The modules of the present invention and sub-modules described below refer to a series of computer program segments capable of being executed by the processor 100 and performing a fixed function, which are stored in the memory 200.
The first acquiring module 10 is configured to acquire an OCR recognition result of a document; the first processing module 20 is configured to pre-process the OCR recognition result to obtain unified code and font structure information corresponding to each word in the OCR recognition result; the second processing module 30 is configured to obtain a confidence level corresponding to each text in the OCR recognition result; the third processing module 40 is configured to splice the unicode, the glyph structural information and the confidence coefficient to obtain a character code; the second obtaining module 50 is configured to obtain point location information corresponding to each word in the OCR recognition result, and take the point location information corresponding to each word as a position code; the execution module 60 is configured to input the character code and the position code into a natural language processing model, and obtain a corrected text result.
Note that, the document OCR recognition result error correction device of this embodiment is a device corresponding to the above-described document OCR recognition result error correction method, and the function modules in the document OCR recognition result error correction device or the respective steps in the document OCR recognition result error correction method correspond to each other. The document OCR recognition result error correction device of the embodiment can be implemented in cooperation with the document OCR recognition result error correction method. Accordingly, the related technical details mentioned in the document OCR recognition result error correction device of the present embodiment may also be applied to the above-mentioned document OCR recognition result error correction method.
It should be noted that each of the above functional modules may be fully or partially integrated into one physical entity or may be physically separated. And these modules may all be implemented in software in the form of calls by the processing element; or can be realized in hardware; the method can also be realized in a form of calling software by a processing element, and the method can be realized in a form of hardware by a part of modules. In addition, all or part of the modules can be integrated together or can be independently implemented. The processing element described herein may be an integrated circuit having signal processing capabilities. In implementation, some or all of the steps of the above methods, or the above functional modules, may be implemented by integrated logic circuits of hardware in the processor element or instructions in the form of software.
Fig. 8 is a schematic structural diagram of an electronic device according to a preferred embodiment of the present invention for implementing the document OCR recognition result error correction method.
The electronic device may include a memory 200, a processor 100, and a bus, and may also include a computer program stored in the memory and executable on the processor, such as a document OCR recognition result error correction program.
The memory includes at least one type of readable storage medium including flash memory, a removable hard disk, a multimedia card, a card memory (e.g., SD or DX memory, etc.), a magnetic memory, a magnetic disk, an optical disk, etc. The memory may in some embodiments be an internal storage unit of the electronic device, such as a mobile hard disk of the electronic device. The memory may also be an external storage device of the electronic device in other embodiments, such as a plug-in mobile hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash Card (Flash Card) or the like, which are provided on the electronic device. Further, the memory may also include both internal storage units and external storage devices of the electronic device. The memory may be used not only for storing application software installed in the electronic device and various kinds of data, such as codes of document OCR recognition result error correction programs, but also for temporarily storing data that has been output or is to be output.
The processor may in some embodiments be comprised of integrated circuits, for example, a single packaged integrated circuit, or may be comprised of multiple integrated circuits packaged with the same or different functionality, including one or more central processing units (Central Processing unit, CPU), microprocessors, digital processing chips, graphics processors, a combination of various control chips, and the like. The processor is a Control Unit (Control Unit) of the electronic device, connects various components of the entire electronic device using various interfaces and lines, and executes various functions of the electronic device and processes data by running or executing programs or modules stored in the memory (for example, executing a document OCR recognition result error correction program, etc.), and calling data stored in the memory.
And the processor executes the operating system of the electronic equipment and various installed application programs. The processor executes the application program to implement the steps in the embodiments of the error correction method for the OCR recognition result of each document, for example, the steps shown in the figure.
The computer program may be divided into one or more modules, which are stored in the memory and executed by the processor to accomplish the present invention, for example. The one or more modules may be a series of computer program instruction segments capable of performing the specified functions, which are used to describe the execution of the computer program in the electronic device. For example, the computer program may be divided into a first acquisition module, a first processing module, a second processing module, a third processing module, a second acquisition module, and an execution module.
The integrated units implemented in the form of software functional modules described above may be stored in a computer readable storage medium. The software functional module is stored in a storage medium, and includes several instructions for making a computer device (which may be a personal computer, a computer device, or a network device, etc.) or a processor (processor) execute part of the functions of the document OCR recognition result error correction method according to the embodiments of the present invention.
The blockchain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, encryption algorithm and the like. The Blockchain (Blockchain), which is essentially a decentralised database, is a string of data blocks that are generated by cryptographic means in association, each data block containing a batch of information of network transactions for verifying the validity of the information (anti-counterfeiting) and generating the next block. The blockchain may include a blockchain underlying platform, a platform product services layer, an application services layer, and the like.
The bus may be a peripheral component interconnect standard (peripheral component interconnect, PCI) bus or an extended industry standard architecture (extended industry standard architecture, EISA) bus, among others. The bus may be classified as an address bus, a data bus, a control bus, etc. For ease of illustration, only one arrow is shown in FIG. 8, but only one bus or one type of bus is not shown. The bus is arranged to enable a connection communication between the memory and at least one processor or the like.
The method provides a solution for error correction of the bill OCR recognition result, and effectively solves the problem that the performance of the current model is greatly reduced due to too short context when the bill is recognized by combining the OCR recognition point position information, the upstream model results such as the recognition text confidence level, unicode codes and IDS representation nlp information.
By combining the font information of the IDS and the position information of the OCR, the model can be assisted to correct errors by supplementing additional information as much as possible, meanwhile, the font information of the IDS can give a more accurate result aiming at the recognition result of the OCR, and the OCR result of bill recognition is greatly improved by constructing proper training data and pre-training tasks.
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential characteristics thereof.
The above embodiments are only for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications and equivalents may be made thereto without departing from the spirit and scope of the technical solution of the present invention.

Claims (10)

1. The document OCR recognition result error correction method is characterized by comprising the following steps of:
acquiring an OCR recognition result of the bill;
preprocessing the OCR recognition result to obtain unified code and font structure information corresponding to each character in the OCR recognition result;
acquiring confidence degrees corresponding to the characters in the OCR recognition result;
splicing the unified code, the font structure information and the confidence coefficient to obtain a character code;
acquiring point location information corresponding to each word in the OCR recognition result, and taking the point location information corresponding to each word as a position code;
and inputting the character codes and the position codes into a natural language processing model to obtain a character result after error correction.
2. The document OCR recognition result error correction method according to claim 1, wherein the step of preprocessing the OCR recognition result to obtain unified code and glyph structure information corresponding to each word in the OCR recognition result comprises:
acquiring characters in the OCR recognition result;
according to the Unicode character list, determining a unified code corresponding to the characters;
and generating IDS representation corresponding to the text according to the text, wherein the IDS representation comprises font structure information of the text.
3. The document OCR recognition result error correction method according to claim 1, wherein the step of concatenating the unicode, the glyph structural information, and the confidence level to obtain a character code comprises:
converting the unified code, the font structure information and the confidence into feature vectors respectively;
and splicing the feature vector of the unified code, the feature vector of the font structure information and the feature vector of the confidence coefficient to obtain the feature vector of the character code.
4. The document OCR recognition result error correction method according to claim 1, wherein the step of acquiring dot position information corresponding to each of the characters in the OCR recognition result and taking the dot position information corresponding to each of the characters as a position code comprises:
acquiring pixel values of pixel points, on the receipt, of which the characters are at specified relative positions in the OCR recognition result;
and taking the pixel value of each pixel point as the position code.
5. The document OCR recognition result error correction method according to claim 4, wherein the step of acquiring the pixel value of the pixel point on the document at a specified relative position to each of the characters in the OCR recognition result comprises:
and acquiring pixel values of pixel points, located at the left upper corner and the right lower corner of each character, of the OCR recognition result on the receipt.
6. The document OCR recognition result error correction method according to claim 1, wherein the step of inputting the character codes and the position codes into a natural language processing model to obtain an error corrected text result comprises:
and inputting the character codes and the position codes into a BERT model to obtain an error-corrected text result.
7. The document OCR recognition result error correction method according to claim 6, wherein the BERT model is trained and obtained by:
acquiring an OCR recognition result of the historical document, manually marking error characters in the OCR recognition result of the historical document, and correcting the error characters to be used as a model training set;
preprocessing the OCR recognition result of the historical receipt to obtain unified codes and font structure information corresponding to each character in the OCR recognition result of the historical receipt;
acquiring the confidence coefficient corresponding to each character in the OCR recognition result of the historical document;
splicing the uniform codes of all the characters, the font structure information and the confidence coefficient in the OCR recognition result of the historical receipt to obtain the character code of the OCR recognition result of the historical receipt;
acquiring point location information corresponding to each character in an OCR (optical character recognition) result of the historical document, and taking the point location information corresponding to each character in the OCR result of the historical document as a position code of the OCR result of the historical document;
inputting character codes and position codes of OCR recognition results of the historical receipts into a BERT pre-training model to obtain an error correction result, and calculating a loss value of the BERT pre-training model according to the artificially marked error characters;
and adjusting parameters of the BERT pre-training model, and repeating the steps until the loss value is smaller than a preset threshold value to obtain the trained BERT model.
8. An apparatus for correcting an OCR recognition result of a document, comprising:
the first acquisition module is used for acquiring an OCR recognition result of the bill;
the first processing module is used for preprocessing the OCR recognition result to obtain unified codes and font structure information corresponding to each character in the OCR recognition result;
the second processing module is used for obtaining the confidence coefficient corresponding to each character in the OCR recognition result;
the third processing module is used for splicing the unified code, the font structure information and the confidence coefficient to obtain a character code;
the second acquisition module is used for acquiring point location information corresponding to each word in the OCR recognition result, and taking the point location information corresponding to each word as a position code;
and the execution module is used for inputting the character codes and the position codes into a natural language processing model to obtain a character result after error correction.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the method of any of claims 1 to 7 when the computer program is executed by the processor.
10. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method according to any one of claims 1 to 7.
CN202310469838.4A 2023-04-24 2023-04-24 Document OCR recognition result error correction method, system, equipment and medium Pending CN116484844A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310469838.4A CN116484844A (en) 2023-04-24 2023-04-24 Document OCR recognition result error correction method, system, equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310469838.4A CN116484844A (en) 2023-04-24 2023-04-24 Document OCR recognition result error correction method, system, equipment and medium

Publications (1)

Publication Number Publication Date
CN116484844A true CN116484844A (en) 2023-07-25

Family

ID=87221158

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310469838.4A Pending CN116484844A (en) 2023-04-24 2023-04-24 Document OCR recognition result error correction method, system, equipment and medium

Country Status (1)

Country Link
CN (1) CN116484844A (en)

Similar Documents

Publication Publication Date Title
US10817741B2 (en) Word segmentation system, method and device
CN109543690B (en) Method and device for extracting information
CN113591546B (en) Semantic enhancement type scene text recognition method and device
CN110008961B (en) Text real-time identification method, text real-time identification device, computer equipment and storage medium
CN112257613B (en) Physical examination report information structured extraction method and device and computer equipment
CN110188762B (en) Chinese-English mixed merchant store name identification method, system, equipment and medium
CN111753717B (en) Method, device, equipment and medium for extracting structured information of text
CN111783767B (en) Character recognition method, character recognition device, electronic equipment and storage medium
CN112966685B (en) Attack network training method and device for scene text recognition and related equipment
CN112668580A (en) Text recognition method, text recognition device and terminal equipment
CN115862040A (en) Text error correction method and device, computer equipment and readable storage medium
CN112926700B (en) Class identification method and device for target image
CN110991303A (en) Method and device for positioning text in image and electronic equipment
CN110414522A (en) A kind of character identifying method and device
CN112749639B (en) Model training method and device, computer equipment and storage medium
CN112464927B (en) Information extraction method, device and system
CN112231507A (en) Identification method and device and electronic equipment
CN114429636B (en) Image scanning identification method and device and electronic equipment
CN111414889A (en) Financial statement identification method and device based on character identification
CN115439850A (en) Image-text character recognition method, device, equipment and storage medium based on examination sheet
CN116484844A (en) Document OCR recognition result error correction method, system, equipment and medium
Ledesma et al. Enabling automated herbarium sheet image post‐processing using neural network models for color reference chart detection
CN113836297A (en) Training method and device for text emotion analysis model
CN112861649A (en) Fingerprint signature generation method and device, electronic equipment and computer storage medium
CN108021918B (en) Character recognition method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination