CN116484844A - Document OCR recognition result error correction method, system, equipment and medium - Google Patents
Document OCR recognition result error correction method, system, equipment and medium Download PDFInfo
- Publication number
- CN116484844A CN116484844A CN202310469838.4A CN202310469838A CN116484844A CN 116484844 A CN116484844 A CN 116484844A CN 202310469838 A CN202310469838 A CN 202310469838A CN 116484844 A CN116484844 A CN 116484844A
- Authority
- CN
- China
- Prior art keywords
- recognition result
- ocr recognition
- character
- ocr
- document
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 65
- 238000012937 correction Methods 0.000 title claims abstract description 55
- 238000003058 natural language processing Methods 0.000 claims abstract description 9
- 238000012015 optical character recognition Methods 0.000 claims description 144
- 238000012545 processing Methods 0.000 claims description 24
- 238000012549 training Methods 0.000 claims description 20
- 238000004590 computer program Methods 0.000 claims description 13
- 238000007781 pre-processing Methods 0.000 claims description 11
- 239000013598 vector Substances 0.000 claims description 10
- 238000011144 upstream manufacturing Methods 0.000 abstract description 4
- 230000036541 health Effects 0.000 abstract description 2
- 230000008569 process Effects 0.000 description 13
- 238000010586 diagram Methods 0.000 description 6
- 230000006870 function Effects 0.000 description 6
- 238000013473 artificial intelligence Methods 0.000 description 3
- 238000004422 calculation algorithm Methods 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 230000001502 supplementing effect Effects 0.000 description 3
- 238000004458 analytical method Methods 0.000 description 2
- 238000012512 characterization method Methods 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000002093 peripheral effect Effects 0.000 description 2
- 238000012805 post-processing Methods 0.000 description 2
- 238000011084 recovery Methods 0.000 description 2
- 238000003491 array Methods 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 150000001875 compounds Chemical class 0.000 description 1
- 230000006835 compression Effects 0.000 description 1
- 238000007906 compression Methods 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000003745 diagnosis Methods 0.000 description 1
- 230000008030 elimination Effects 0.000 description 1
- 238000003379 elimination reaction Methods 0.000 description 1
- 238000000802 evaporation-induced self-assembly Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 230000010365 information processing Effects 0.000 description 1
- 238000007689 inspection Methods 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/232—Orthographic correction, e.g. spell checking or vowelisation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/12—Use of codes for handling textual entities
- G06F40/126—Character encoding
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
- G06V30/14—Image acquisition
- G06V30/141—Image acquisition using multiple overlapping images; Image stitching
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/40—Document-oriented image-based pattern recognition
- G06V30/42—Document-oriented image-based pattern recognition based on the type of document
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Abstract
The invention relates to the field of medical health, in particular to a document OCR recognition result error correction method, system, equipment and medium, wherein the method comprises the following steps: acquiring an OCR recognition result of the bill; obtaining the unified code and the font structure information; obtaining a confidence coefficient; splicing the unified code, the font structure information and the confidence coefficient to obtain a character code; acquiring a position code; and inputting the character codes and the position codes into a natural language processing model to obtain a character result after error correction. The method greatly improves the OCR result of bill recognition by combining the OCR point location information, the upstream model results of the confidence coefficient of the recognition characters, unicode codes and the nlp information represented by IDS.
Description
Technical Field
The invention relates to the field of medical health, in particular to a document OCR recognition result error correction method, system, equipment and medium.
Background
OCR (Optical Character Recognition ) refers to the process of an electronic device (e.g., a scanner or digital camera) checking characters printed on paper, determining their shape by detecting dark and light patterns, and then translating the shape into computer text using a character recognition method. In many fields, bill identification is an important way for reducing manpower and material resources and improving working efficiency, and identification accuracy is an important index for evaluating whether a system is good or not.
Although the model recognition accuracy of OCR is generally higher at present, recognition errors are often caused in practical application, and particularly, the recognition accuracy is obviously reduced for near-shape words. However, in the special scene of document OCR recognition, the method of recognizing and correcting the common language model is insufficient to achieve a good enough effect due to lack of sufficient context information, which usually results in a low recall rate, and how to solve the problem is a key point for improving the OCR recognition result.
Disclosure of Invention
The invention aims to provide a bill OCR recognition result error correction method, system, equipment and medium capable of improving the error correction precision of the bill OCR recognition result.
To achieve the above object and other related objects, the present invention provides a document OCR recognition result error correction method, including the steps of:
acquiring an OCR recognition result of the bill;
preprocessing the OCR recognition result to obtain unified code and font structure information corresponding to each character in the OCR recognition result;
acquiring confidence degrees corresponding to the characters in the OCR recognition result;
splicing the unified code, the font structure information and the confidence coefficient to obtain a character code;
acquiring point location information corresponding to each word in the OCR recognition result, and taking the point location information corresponding to each word as a position code;
and inputting the character codes and the position codes into a natural language processing model to obtain a character result after error correction.
To achieve the above object and other related objects, the present invention also provides a document OCR recognition result error correction device, including:
the first acquisition module is used for acquiring an OCR recognition result of the bill;
the first processing module is used for preprocessing the OCR recognition result to obtain unified codes and font structure information corresponding to each character in the OCR recognition result;
the second processing module is used for obtaining the confidence coefficient corresponding to each character in the OCR recognition result;
the third processing module is used for splicing the unified code, the font structure information and the confidence coefficient to obtain a character code;
the second acquisition module is used for acquiring point location information corresponding to each word in the OCR recognition result, and taking the point location information corresponding to each word as a position code;
and the execution module is used for inputting the character codes and the position codes into a natural language processing model to obtain a character result after error correction.
To achieve the above and other related objects, the present invention also provides an electronic device including a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of the method when executing the computer program.
To achieve the above and other related objects, the present invention also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the method.
The invention has the technical effects that: the method provides a solution for error correction of the bill OCR recognition result, and effectively solves the problem that the performance of the current model is greatly reduced due to too short context when the bill is recognized by combining the OCR recognition point position information, the upstream model results such as the recognition text confidence level, unicode codes and IDS representation nlp information; by combining the font information of the IDS and the position information of the OCR, the model can be assisted to correct errors by supplementing additional information as much as possible, meanwhile, the font information of the IDS can give a more accurate result aiming at the recognition result of the OCR, and the OCR result of bill recognition is greatly improved by constructing proper training data and pre-training tasks.
Drawings
FIG. 1 is an application scenario diagram of a document OCR recognition result error correction method provided by an embodiment of the present invention;
FIG. 2 is a flow chart of a document OCR recognition result error correction method provided by an embodiment of the present invention;
FIG. 3 is a flow chart of a document OCR recognition result preprocessing process provided by an embodiment of the present invention;
FIG. 4 is a flow chart of a character encoding stitching process provided by an embodiment of the present invention;
FIG. 5 is a flow chart of a position code acquisition method provided by an embodiment of the present invention;
FIG. 6 is a flow chart of an error correction model training process provided by an embodiment of the present invention;
FIG. 7 is a functional block diagram of a document OCR recognition result error correction apparatus provided by an embodiment of the present invention;
fig. 8 is a block diagram of an electronic device provided by an embodiment of the present invention.
Detailed Description
Other advantages and effects of the present invention will become apparent to those skilled in the art from the following disclosure, which describes the embodiments of the present invention with reference to specific examples. The invention may be practiced or carried out in other embodiments that depart from the specific details, and the details of the present description may be modified or varied from the spirit and scope of the present invention.
Please refer to fig. 1-8. It should be noted that, the illustrations provided in the present embodiment merely illustrate the basic concept of the present invention by way of illustration, and only the components related to the present invention are shown in the drawings and are not drawn according to the number, shape and size of the components in actual implementation, and the form, number and proportion of the components in actual implementation may be arbitrarily changed, and the layout of the components may be more complex.
The embodiment of the application can acquire and process the related data based on the artificial intelligence technology. Among these, artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a digital computer-controlled machine to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use knowledge to obtain optimal results.
Fig. 1 shows an application scenario diagram of a preferred embodiment of the document OCR recognition result error correction method of the present invention.
The document OCR recognition result error correction method is applied to one or more electronic devices, wherein the electronic devices are devices capable of automatically performing numerical calculation and/or information processing according to preset or stored instructions, and the hardware of the electronic devices comprises, but is not limited to, microprocessors, application specific integrated circuits (Application Specific Integrated Circuit, ASICs), programmable gate arrays (Field-Programmable Gate Array, FPGA), digital processors (Digital Signal Processor, DSP), embedded devices and the like.
The electronic device may be any electronic product that can interact with a user in a human-computer manner, such as a personal computer, tablet computer, smart phone, personal digital assistant (Personal Digital Assistant, PDA), game console, interactive internet protocol television (Internet Protocol Television, IPTV), smart wearable device, etc.
The electronic device may also include a network device and/or a user device. Wherein the network device includes, but is not limited to, a single network server, a server group composed of a plurality of network servers, or a Cloud based Cloud Computing (Cloud Computing) composed of a large number of hosts or network servers.
The network in which the electronic device is located includes, but is not limited to, the internet, a wide area network, a metropolitan area network, a local area network, a virtual private network (Virtual Private Network, VPN), and the like.
The document OCR recognition result error correction method of the present invention will be described in detail below with reference to fig. 1, and the document OCR recognition result error correction method may be applied to, for example, automatic recognition of documents, and is particularly suitable for medical document OCR recognition or medical platform inquiry, for example, recognition of medical payment documents, medical detection, inspection documents, medical diagnosis cases, and the like.
OCR (optical character recognition) text recognition refers to the process of an electronic device (e.g., a scanner or digital camera) checking characters printed on paper, and then translating the shape into computer text using a character recognition method; namely, the text data is scanned, and then the image file is analyzed and processed to obtain the text and layout information. How to debug or use auxiliary information to improve recognition accuracy is the most important issue of OCR. The main indexes for measuring the performance of an OCR system are as follows: rejection rate, false recognition rate, recognition speed, user interface friendliness, product stability, usability, feasibility and the like.
OCR character recognition generally includes image input, preprocessing, layout analysis, character cutting, character recognition, layout recovery, post-processing, and collation. The image input means that different storage formats and compression modes are provided for images with different formats, and currently, open source projects for accessing images include OpenCV, cximage and the like; the preprocessing mainly comprises binarization, noise removal and inclination correction, in most cases, the pictures shot by using the cameras are color images, the color images contain very rich information, simplification is needed, the binarization can simply divide the content of the pictures into foreground and background, in order to enable a computer to recognize characters faster and better, the color images need to be processed first, so that the pictures only have foreground and background information, namely, the foreground information is simply defined as black, the background information is white, and the images are the binarization pictures. The definition of noise may be different for different documents, and the elimination processing is performed according to the characteristics of noise, which is called noise removal. In general, the user can take a photograph more randomly, and the photographed document is more likely to incline, and then the user needs to use a character recognition software to correct the photographed document. The process of dividing the document picture into sections and lines is called layout analysis, and no fixed and best cutting model exists at present due to the diversity and complexity of actual documents. Because of the limitation of photographing conditions, the conditions such as character adhesion and pen breakage are often caused, so that the performance of the recognition system is greatly limited, and character recognition software is required to have a character cutting function. Character recognition is mainly based on feature extraction. Usually, the characters which are expected to be recognized by the human are still arranged according to the original document pictures, the paragraphs are kept unchanged, the positions are unchanged and the sequence is unchanged, and then the characters are output to a Word document or a PDF document, and the process is called layout recovery. In different language environments, the logical order of the languages is different, so that the recognized result needs to be corrected according to the context of language features, and the process is post-processing.
Referring to fig. 1, the present invention converts characters recognized by OCR into Unicode and font structure information according to Unicode character list and IDS characterization of Chinese characters, combines the Unicode and font structure information with confidence of OCR recognition result to obtain multidimensional character information of OCR recognition result, combines character information with point location information of characters as input of Bert model, and because of combining Unicode, font structure, confidence and point location information of characters, bert model can more accurately recognize errors in characters.
According to the method, the problems that the performance of the model is greatly reduced due to too short context when the current model is used for bill recognition are effectively solved by combining the OCR recognition point location information, the upstream model results such as the recognition text confidence level, unicode codes and IDS representation nlp information. By combining the font information of the IDS and the position information of the OCR, the model can be assisted to correct errors by supplementing additional information as much as possible, meanwhile, the font information of the IDS can give a more accurate result aiming at the recognition result of the OCR, and the OCR result of bill recognition is greatly improved by constructing proper training data and pre-training tasks.
Unicode, also known as Unicode, single code, was developed by the Unicode alliance and is an industry standard in the field of computer science, including character sets, coding schemes, and the like. Unicode is generated to solve the limitations of traditional character coding schemes, and it sets a unified and unique binary code for each character in each language to meet the requirements of text conversion and processing across languages and platforms.
The ideographic description sequence (English: ideographic Description Sequence, IDS) is a Chinese character structure description grammar defined by Unicode standard, and a description sequence is formed by combining a description character with more than two specific characters (mainly Chinese characters) and represents the abstract structure of a Chinese character. IDS is primarily aimed at expressing the abstract structure of chinese characters, rather than being used for dynamic word-forming as is the case with compound characters.
Unicode defines 12 kinds of combination characters, as follows:
unicode has not defined a unique expression of Chinese characters, and according to current proposals, a Chinese character can be expressed by a plurality of IDSs, for example, a Chinese character can be expressed as a Chinese character "Worker's or>Worker->A person.
Referring to fig. 2, a document OCR recognition result error correction method includes the following steps:
s10: and acquiring an OCR recognition result of the bill.
S20: preprocessing the OCR recognition result to obtain unified code and font structure information corresponding to each character in the OCR recognition result.
Referring to fig. 3, in an embodiment, the step S20 further includes:
s21: acquiring characters in the OCR recognition result;
s22: according to the Unicode character list, determining a unified code corresponding to the characters;
s23: and generating IDS representation corresponding to the text according to the text, wherein the IDS representation comprises font structure information of the text.
The document OCR recognition result error correction method further comprises the following steps:
s30: and obtaining the confidence coefficient corresponding to each character in the OCR recognition result.
The document OCR recognition result error correction method further comprises the following steps:
s40: and splicing the unified code, the font structure information and the confidence coefficient to obtain a character code.
Referring to fig. 4, in an embodiment, the step S40 further includes:
s41: converting the unified code, the font structure information and the confidence into feature vectors respectively;
s42: and splicing the feature vector of the unified code, the feature vector of the font structure information and the feature vector of the confidence coefficient to obtain the feature vector of the character code.
The document OCR recognition result error correction method further comprises the following steps:
s50: and acquiring point location information corresponding to each character in the OCR recognition result, and taking the point location information corresponding to each character as a position code.
Referring to fig. 5, in an embodiment, the step S50 further includes:
s51: acquiring pixel values of pixel points, on the receipt, of which the characters are at specified relative positions in the OCR recognition result; for example, the pixel values of the pixel points of the document, which are positioned at the left upper corner and the right lower corner of each character in the OCR recognition result, are obtained.
S52: and taking the pixel value of each pixel point as the position code.
The document OCR recognition result error correction method further comprises the following steps:
s60: and inputting the character codes and the position codes into a natural language processing model to obtain a character result after error correction.
Referring to fig. 6, in a specific embodiment, the natural language processing model may be a BERT model, which may be obtained by training, for example, the following method:
s61: acquiring an OCR recognition result of the historical document, manually marking error characters in the OCR recognition result of the historical document, and correcting the error characters to be used as a model training set;
s62: preprocessing the OCR recognition result of the historical receipt to obtain unified codes and font structure information corresponding to each character in the OCR recognition result of the historical receipt;
s63: acquiring the confidence coefficient corresponding to each character in the OCR recognition result of the historical document;
s64: splicing the uniform codes of all the characters, the font structure information and the confidence coefficient in the OCR recognition result of the historical receipt to obtain the character code of the OCR recognition result of the historical receipt;
s65: acquiring point location information corresponding to each character in an OCR (optical character recognition) result of the historical document, and taking the point location information corresponding to each character in the OCR result of the historical document as a position code of the OCR result of the historical document;
s66: inputting character codes and position codes of OCR recognition results of the historical receipts into a BERT pre-training model to obtain an error correction result, and calculating a loss value of the BERT pre-training model according to the artificially marked error characters;
s67: and adjusting parameters of the BERT pre-training model, and repeating the steps until the loss value is smaller than a preset threshold value to obtain the trained BERT model.
The method builds a complete OCR recognition result error correction system, and greatly improves the recognition accuracy.
When the model is trained for coding, the model is divided into two layers, wherein the first layer is the combination of Unicode, IDS characterization and OCR recognition results, and the second layer is the point location information of the OCR recognition results. The following is a detailed explanation of these two parts:
IDS codes record the writing structure and strokes of Chinese characters, contain complete font structure information and can learn the problem of shape and proximity of characters in OCR recognition. However, using IDS alone may present some problems, such as "a" and "a" that are nearly identical in glyph structure, just using IDS to characterize may not distinguish between them. Thus, unicode encoding is additionally adopted to effectively distinguish different characters of the same structure during encoding. Finally, the confidence of OCR in recognizing individual words can also be a very important additional feature, as the problem of OCR recognition errors is generally less confidence. Thus, character encoding can be seen as the result of a concatenation of Unicode, IDS representation, confidence.
In position coding, although the document is not in much context, some additional context information may be provided by information such as the header of the form. Thus, in the position encoding, point position information of the upper left corner and the lower right corner of each character is used as the position encoding with reference to the structure of the span BERT, unlike the conventional BERT. Therefore, in the learning process, the model learns the contextual information such as the header in the bill form through the point location information.
The subsequent model structure imitates the BERT pre-training model structure, the addition result of the improved position coding and character coding information is used as the input of the BERT layer, and the output of the model is the character result predicted by the model at each position.
When training data is constructed, the OCR result of the actual use process is adopted, the error characters in the OCR result are marked manually, and the error characters are corrected to be used as a model training set. The model training target is to predict the error information in the actual OCR predicted text and provide correction results.
It should be noted that, the above steps of the methods are divided, for clarity of description, and may be combined into one step or split into multiple steps when implemented, so long as they contain the same logic relationship, and they are all within the protection scope of the present patent; it is within the scope of this patent to add insignificant modifications to the algorithm or flow or introduce insignificant designs, but not to alter the core design of its algorithm and flow.
Fig. 7 is a functional block diagram showing a preferred embodiment of the document OCR recognition result error correction device according to the present invention. The document OCR recognition result error correction device comprises: the first acquisition module 10, the first processing module 20, the second processing module 30, the third processing module 40, the second acquisition module 50, and the execution module 60. The modules of the present invention and sub-modules described below refer to a series of computer program segments capable of being executed by the processor 100 and performing a fixed function, which are stored in the memory 200.
The first acquiring module 10 is configured to acquire an OCR recognition result of a document; the first processing module 20 is configured to pre-process the OCR recognition result to obtain unified code and font structure information corresponding to each word in the OCR recognition result; the second processing module 30 is configured to obtain a confidence level corresponding to each text in the OCR recognition result; the third processing module 40 is configured to splice the unicode, the glyph structural information and the confidence coefficient to obtain a character code; the second obtaining module 50 is configured to obtain point location information corresponding to each word in the OCR recognition result, and take the point location information corresponding to each word as a position code; the execution module 60 is configured to input the character code and the position code into a natural language processing model, and obtain a corrected text result.
Note that, the document OCR recognition result error correction device of this embodiment is a device corresponding to the above-described document OCR recognition result error correction method, and the function modules in the document OCR recognition result error correction device or the respective steps in the document OCR recognition result error correction method correspond to each other. The document OCR recognition result error correction device of the embodiment can be implemented in cooperation with the document OCR recognition result error correction method. Accordingly, the related technical details mentioned in the document OCR recognition result error correction device of the present embodiment may also be applied to the above-mentioned document OCR recognition result error correction method.
It should be noted that each of the above functional modules may be fully or partially integrated into one physical entity or may be physically separated. And these modules may all be implemented in software in the form of calls by the processing element; or can be realized in hardware; the method can also be realized in a form of calling software by a processing element, and the method can be realized in a form of hardware by a part of modules. In addition, all or part of the modules can be integrated together or can be independently implemented. The processing element described herein may be an integrated circuit having signal processing capabilities. In implementation, some or all of the steps of the above methods, or the above functional modules, may be implemented by integrated logic circuits of hardware in the processor element or instructions in the form of software.
Fig. 8 is a schematic structural diagram of an electronic device according to a preferred embodiment of the present invention for implementing the document OCR recognition result error correction method.
The electronic device may include a memory 200, a processor 100, and a bus, and may also include a computer program stored in the memory and executable on the processor, such as a document OCR recognition result error correction program.
The memory includes at least one type of readable storage medium including flash memory, a removable hard disk, a multimedia card, a card memory (e.g., SD or DX memory, etc.), a magnetic memory, a magnetic disk, an optical disk, etc. The memory may in some embodiments be an internal storage unit of the electronic device, such as a mobile hard disk of the electronic device. The memory may also be an external storage device of the electronic device in other embodiments, such as a plug-in mobile hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash Card (Flash Card) or the like, which are provided on the electronic device. Further, the memory may also include both internal storage units and external storage devices of the electronic device. The memory may be used not only for storing application software installed in the electronic device and various kinds of data, such as codes of document OCR recognition result error correction programs, but also for temporarily storing data that has been output or is to be output.
The processor may in some embodiments be comprised of integrated circuits, for example, a single packaged integrated circuit, or may be comprised of multiple integrated circuits packaged with the same or different functionality, including one or more central processing units (Central Processing unit, CPU), microprocessors, digital processing chips, graphics processors, a combination of various control chips, and the like. The processor is a Control Unit (Control Unit) of the electronic device, connects various components of the entire electronic device using various interfaces and lines, and executes various functions of the electronic device and processes data by running or executing programs or modules stored in the memory (for example, executing a document OCR recognition result error correction program, etc.), and calling data stored in the memory.
And the processor executes the operating system of the electronic equipment and various installed application programs. The processor executes the application program to implement the steps in the embodiments of the error correction method for the OCR recognition result of each document, for example, the steps shown in the figure.
The computer program may be divided into one or more modules, which are stored in the memory and executed by the processor to accomplish the present invention, for example. The one or more modules may be a series of computer program instruction segments capable of performing the specified functions, which are used to describe the execution of the computer program in the electronic device. For example, the computer program may be divided into a first acquisition module, a first processing module, a second processing module, a third processing module, a second acquisition module, and an execution module.
The integrated units implemented in the form of software functional modules described above may be stored in a computer readable storage medium. The software functional module is stored in a storage medium, and includes several instructions for making a computer device (which may be a personal computer, a computer device, or a network device, etc.) or a processor (processor) execute part of the functions of the document OCR recognition result error correction method according to the embodiments of the present invention.
The blockchain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, encryption algorithm and the like. The Blockchain (Blockchain), which is essentially a decentralised database, is a string of data blocks that are generated by cryptographic means in association, each data block containing a batch of information of network transactions for verifying the validity of the information (anti-counterfeiting) and generating the next block. The blockchain may include a blockchain underlying platform, a platform product services layer, an application services layer, and the like.
The bus may be a peripheral component interconnect standard (peripheral component interconnect, PCI) bus or an extended industry standard architecture (extended industry standard architecture, EISA) bus, among others. The bus may be classified as an address bus, a data bus, a control bus, etc. For ease of illustration, only one arrow is shown in FIG. 8, but only one bus or one type of bus is not shown. The bus is arranged to enable a connection communication between the memory and at least one processor or the like.
The method provides a solution for error correction of the bill OCR recognition result, and effectively solves the problem that the performance of the current model is greatly reduced due to too short context when the bill is recognized by combining the OCR recognition point position information, the upstream model results such as the recognition text confidence level, unicode codes and IDS representation nlp information.
By combining the font information of the IDS and the position information of the OCR, the model can be assisted to correct errors by supplementing additional information as much as possible, meanwhile, the font information of the IDS can give a more accurate result aiming at the recognition result of the OCR, and the OCR result of bill recognition is greatly improved by constructing proper training data and pre-training tasks.
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential characteristics thereof.
The above embodiments are only for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications and equivalents may be made thereto without departing from the spirit and scope of the technical solution of the present invention.
Claims (10)
1. The document OCR recognition result error correction method is characterized by comprising the following steps of:
acquiring an OCR recognition result of the bill;
preprocessing the OCR recognition result to obtain unified code and font structure information corresponding to each character in the OCR recognition result;
acquiring confidence degrees corresponding to the characters in the OCR recognition result;
splicing the unified code, the font structure information and the confidence coefficient to obtain a character code;
acquiring point location information corresponding to each word in the OCR recognition result, and taking the point location information corresponding to each word as a position code;
and inputting the character codes and the position codes into a natural language processing model to obtain a character result after error correction.
2. The document OCR recognition result error correction method according to claim 1, wherein the step of preprocessing the OCR recognition result to obtain unified code and glyph structure information corresponding to each word in the OCR recognition result comprises:
acquiring characters in the OCR recognition result;
according to the Unicode character list, determining a unified code corresponding to the characters;
and generating IDS representation corresponding to the text according to the text, wherein the IDS representation comprises font structure information of the text.
3. The document OCR recognition result error correction method according to claim 1, wherein the step of concatenating the unicode, the glyph structural information, and the confidence level to obtain a character code comprises:
converting the unified code, the font structure information and the confidence into feature vectors respectively;
and splicing the feature vector of the unified code, the feature vector of the font structure information and the feature vector of the confidence coefficient to obtain the feature vector of the character code.
4. The document OCR recognition result error correction method according to claim 1, wherein the step of acquiring dot position information corresponding to each of the characters in the OCR recognition result and taking the dot position information corresponding to each of the characters as a position code comprises:
acquiring pixel values of pixel points, on the receipt, of which the characters are at specified relative positions in the OCR recognition result;
and taking the pixel value of each pixel point as the position code.
5. The document OCR recognition result error correction method according to claim 4, wherein the step of acquiring the pixel value of the pixel point on the document at a specified relative position to each of the characters in the OCR recognition result comprises:
and acquiring pixel values of pixel points, located at the left upper corner and the right lower corner of each character, of the OCR recognition result on the receipt.
6. The document OCR recognition result error correction method according to claim 1, wherein the step of inputting the character codes and the position codes into a natural language processing model to obtain an error corrected text result comprises:
and inputting the character codes and the position codes into a BERT model to obtain an error-corrected text result.
7. The document OCR recognition result error correction method according to claim 6, wherein the BERT model is trained and obtained by:
acquiring an OCR recognition result of the historical document, manually marking error characters in the OCR recognition result of the historical document, and correcting the error characters to be used as a model training set;
preprocessing the OCR recognition result of the historical receipt to obtain unified codes and font structure information corresponding to each character in the OCR recognition result of the historical receipt;
acquiring the confidence coefficient corresponding to each character in the OCR recognition result of the historical document;
splicing the uniform codes of all the characters, the font structure information and the confidence coefficient in the OCR recognition result of the historical receipt to obtain the character code of the OCR recognition result of the historical receipt;
acquiring point location information corresponding to each character in an OCR (optical character recognition) result of the historical document, and taking the point location information corresponding to each character in the OCR result of the historical document as a position code of the OCR result of the historical document;
inputting character codes and position codes of OCR recognition results of the historical receipts into a BERT pre-training model to obtain an error correction result, and calculating a loss value of the BERT pre-training model according to the artificially marked error characters;
and adjusting parameters of the BERT pre-training model, and repeating the steps until the loss value is smaller than a preset threshold value to obtain the trained BERT model.
8. An apparatus for correcting an OCR recognition result of a document, comprising:
the first acquisition module is used for acquiring an OCR recognition result of the bill;
the first processing module is used for preprocessing the OCR recognition result to obtain unified codes and font structure information corresponding to each character in the OCR recognition result;
the second processing module is used for obtaining the confidence coefficient corresponding to each character in the OCR recognition result;
the third processing module is used for splicing the unified code, the font structure information and the confidence coefficient to obtain a character code;
the second acquisition module is used for acquiring point location information corresponding to each word in the OCR recognition result, and taking the point location information corresponding to each word as a position code;
and the execution module is used for inputting the character codes and the position codes into a natural language processing model to obtain a character result after error correction.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the method of any of claims 1 to 7 when the computer program is executed by the processor.
10. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method according to any one of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310469838.4A CN116484844A (en) | 2023-04-24 | 2023-04-24 | Document OCR recognition result error correction method, system, equipment and medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310469838.4A CN116484844A (en) | 2023-04-24 | 2023-04-24 | Document OCR recognition result error correction method, system, equipment and medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116484844A true CN116484844A (en) | 2023-07-25 |
Family
ID=87221158
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310469838.4A Pending CN116484844A (en) | 2023-04-24 | 2023-04-24 | Document OCR recognition result error correction method, system, equipment and medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116484844A (en) |
-
2023
- 2023-04-24 CN CN202310469838.4A patent/CN116484844A/en active Pending
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10817741B2 (en) | Word segmentation system, method and device | |
CN109543690B (en) | Method and device for extracting information | |
CN113591546B (en) | Semantic enhancement type scene text recognition method and device | |
CN110008961B (en) | Text real-time identification method, text real-time identification device, computer equipment and storage medium | |
CN112257613B (en) | Physical examination report information structured extraction method and device and computer equipment | |
CN110188762B (en) | Chinese-English mixed merchant store name identification method, system, equipment and medium | |
CN111753717B (en) | Method, device, equipment and medium for extracting structured information of text | |
CN111783767B (en) | Character recognition method, character recognition device, electronic equipment and storage medium | |
CN112966685B (en) | Attack network training method and device for scene text recognition and related equipment | |
CN112668580A (en) | Text recognition method, text recognition device and terminal equipment | |
CN115862040A (en) | Text error correction method and device, computer equipment and readable storage medium | |
CN112926700B (en) | Class identification method and device for target image | |
CN110991303A (en) | Method and device for positioning text in image and electronic equipment | |
CN110414522A (en) | A kind of character identifying method and device | |
CN112749639B (en) | Model training method and device, computer equipment and storage medium | |
CN112464927B (en) | Information extraction method, device and system | |
CN112231507A (en) | Identification method and device and electronic equipment | |
CN114429636B (en) | Image scanning identification method and device and electronic equipment | |
CN111414889A (en) | Financial statement identification method and device based on character identification | |
CN115439850A (en) | Image-text character recognition method, device, equipment and storage medium based on examination sheet | |
CN116484844A (en) | Document OCR recognition result error correction method, system, equipment and medium | |
Ledesma et al. | Enabling automated herbarium sheet image post‐processing using neural network models for color reference chart detection | |
CN113836297A (en) | Training method and device for text emotion analysis model | |
CN112861649A (en) | Fingerprint signature generation method and device, electronic equipment and computer storage medium | |
CN108021918B (en) | Character recognition method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |