CN113936130A - Document information intelligent acquisition and error correction method, system and equipment based on OCR technology - Google Patents

Document information intelligent acquisition and error correction method, system and equipment based on OCR technology Download PDF

Info

Publication number
CN113936130A
CN113936130A CN202111151913.XA CN202111151913A CN113936130A CN 113936130 A CN113936130 A CN 113936130A CN 202111151913 A CN202111151913 A CN 202111151913A CN 113936130 A CN113936130 A CN 113936130A
Authority
CN
China
Prior art keywords
text
bulletin
target
error correction
preset
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111151913.XA
Other languages
Chinese (zh)
Inventor
肖甜
徐从洋
刘大航
杨忱宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Weikun Shanghai Technology Service Co Ltd
Original Assignee
Weikun Shanghai Technology Service Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Weikun Shanghai Technology Service Co Ltd filed Critical Weikun Shanghai Technology Service Co Ltd
Priority to CN202111151913.XA priority Critical patent/CN113936130A/en
Publication of CN113936130A publication Critical patent/CN113936130A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Character Discrimination (AREA)

Abstract

The invention relates to the field of artificial intelligence, and provides an OCR technology-based document information intelligent acquisition and error correction method, which comprises the following steps: acquiring a bulletin file, and analyzing the bulletin file to obtain a bulletin text; obtaining the confidence of each character in the bulletin text through a pre-trained language representation model; judging whether the bulletin text is a preset type text or not; if the bulletin text is not the preset type text, inputting the bulletin text and the confidence degrees of the characters into a preset OCR (optical character recognition) error correction model, and performing error correction operation on the characters with the confidence degrees lower than a preset threshold value in the bulletin text through the preset OCR error correction model to obtain an error-corrected target bulletin text; matching a plurality of target element values from a pre-configured information element table according to the target bulletin text; and extracting a plurality of bulletin data from the target bulletin text according to the plurality of target element values. The invention improves the efficiency and the accuracy of acquiring the required bulletin data from the bulletin file.

Description

Document information intelligent acquisition and error correction method, system and equipment based on OCR technology
Technical Field
The embodiment of the invention relates to the field of artificial intelligence, in particular to a method, a system and equipment for intelligently acquiring and correcting document information based on an OCR technology.
Background
Some platforms may issue some bulletin documents of products on sale, for example, some bulletin documents related to funds issued by financial platforms, and it is important to quickly and accurately acquire important information in the bulletin documents. With the rapid development of internet technology, the amount of information on the network is also increasing explosively, it is difficult for bulletin documents issued by each financial platform to acquire all product element information only by users reading the bulletin documents themselves, and omission, entry errors and the like are easy to occur by manually retrieving the product element information in the bulletin documents. The tool for extracting the product element information in various announcements in the prior art often has the problems of low accuracy and low efficiency. Therefore, how to solve the problems of low extraction accuracy and low extraction efficiency of the product element information extraction tool in various announcements in the prior art becomes a technical problem which needs to be solved urgently.
Disclosure of Invention
In view of the above, it is necessary to provide an OCR technology-based document information intelligent acquisition and error correction method, system, device and readable storage medium, so as to solve the problems of low extraction accuracy and low extraction efficiency of extraction tools for product element information in various announcements in the prior art.
In order to achieve the above object, an embodiment of the present invention provides a document information intelligent acquisition and error correction method based on an OCR technology, where the method includes:
acquiring a bulletin file, and analyzing the bulletin file to obtain a bulletin text;
obtaining the confidence of each character in the bulletin text through a pre-trained language representation model;
judging whether the announcement text is a preset type text or not;
if the bulletin text is not the preset type text, inputting the bulletin text and the confidence degrees of the characters into a preset OCR (optical character recognition) error correction model, and performing error correction operation on the characters with the confidence degrees lower than a preset threshold value in the bulletin text through the preset OCR error correction model to obtain an error-corrected target bulletin text;
matching a plurality of target element values from a pre-configured information element table according to the target bulletin text; and
and extracting a plurality of bulletin data from the target bulletin text according to the plurality of target element values.
Optionally, the step of obtaining the bulletin file and analyzing the bulletin file to obtain the bulletin text includes:
carrying out format conversion operation on the announcement file to obtain an announcement file in a picture format; and
and extracting the text content of the announcement file in the picture format based on a text recognition technology, and generating the announcement text according to the text content.
Optionally, the method further includes:
and if the bulletin text is the preset type text, performing error correction operation on the bulletin text according to a preset keyword table to obtain the error-corrected target bulletin text.
Optionally, if the bulletin text is the preset type text, the step of performing error correction operation on the bulletin text according to a preset keyword table to obtain the error-corrected target bulletin text includes:
acquiring a plurality of target characters with confidence degrees lower than a preset threshold value from the bulletin text;
extracting a plurality of candidate words from the bulletin text according to the plurality of target characters; and
and carrying out error correction operation on the candidate words according to the preset keyword list so as to obtain the error-corrected target bulletin text.
Optionally, the step of extracting a plurality of advertisement data from the target advertisement text according to the plurality of target element values includes:
acquiring elements corresponding to the target element values from the information element table according to the target element values and a preset matching rule to obtain a plurality of target elements; and
and extracting a plurality of announcement data from the target announcement text according to the plurality of target elements.
Optionally, the step of obtaining the elements corresponding to each target element value from the information element table according to the plurality of target element values and a preset matching rule to obtain the plurality of target elements includes:
converting the target element values into regular expressions according to the matching rules; and
and acquiring elements corresponding to all target element values from the information element table according to the regular expressions to obtain a plurality of target elements.
Optionally, the method further includes: uploading the plurality of advertisement data to a blockchain.
In order to achieve the above object, an embodiment of the present invention further provides an OCR technology-based document information intelligent acquisition and error correction system, including:
the analysis module is used for acquiring the bulletin files and analyzing the bulletin files to obtain bulletin texts;
the acquisition module is used for acquiring the confidence coefficient of each character in the bulletin text through a pre-trained language representation model;
the judging module is used for judging whether the bulletin text is a preset type text;
the input module is used for inputting the bulletin text and the confidence coefficient of each character into a preset OCR (optical character recognition) error correction model if the bulletin text is not the preset type text, so that the characters with the confidence coefficient lower than a preset threshold value in the bulletin text are subjected to error correction operation through the preset OCR error correction model to obtain an error-corrected target bulletin text;
the matching module is used for matching a plurality of target element values from a pre-configured information element table according to the target bulletin text; and
and the extraction module is used for extracting a plurality of announcement data from the target announcement text according to the target element values.
In order to achieve the above object, an embodiment of the present invention further provides a computer device, where the computer device includes a memory, a processor, and a computer program stored on the memory and executable on the processor, and the computer program, when executed by the processor, implements the steps of the document information intelligent acquisition and error correction method based on the OCR technology.
To achieve the above object, an embodiment of the present invention further provides a computer-readable storage medium, in which a computer program is stored, where the computer program is executable by at least one processor, so as to cause the at least one processor to execute the steps of the OCR technology-based document information intelligent acquisition and error correction method.
According to the document information intelligent acquisition and error correction method, system, computer equipment and computer readable storage medium based on the OCR technology, the announcement file which cannot be directly identified is analyzed, error correction processing is carried out on the analyzed announcement file to obtain the corrected announcement text, and finally the announcement data which needs to be acquired is searched from the announcement text through the information element table, so that the efficiency and the accuracy of acquiring the required announcement data from the announcement file are improved, and the problems of low extraction accuracy and low extraction efficiency of a product element information extraction tool in various announcements in the prior art are solved.
Drawings
FIG. 1 is a schematic flow chart of a document information intelligent acquisition and error correction method based on OCR technology according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a program module of a second embodiment of the document information intelligent acquisition and correction system based on OCR technology according to the present invention;
fig. 3 is a schematic diagram of a hardware structure of a third embodiment of the computer device according to the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that the description relating to "first", "second", etc. in the present invention is for descriptive purposes only and is not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In addition, technical solutions between various embodiments may be combined with each other, but must be realized by a person skilled in the art, and when the technical solutions are contradictory or cannot be realized, such a combination should not be considered to exist, and is not within the protection scope of the present invention.
Example one
Referring to fig. 1, a flowchart illustrating steps of an OCR technology based document information intelligent acquisition and error correction method according to an embodiment of the present invention is shown. It is to be understood that the flow charts in the embodiments of the present method are not intended to limit the order in which the steps are performed. The intelligent document information acquisition and error correction system based on the OCR technology in the present embodiment may be implemented in the computer device 2, and the following description is made by taking the computer device 2 as an execution subject. The details are as follows.
Step S100, obtaining the bulletin file, and analyzing the bulletin file to obtain a bulletin text.
The announcement documents can be fund announcement documents published by financial platforms such as security traders, and the documents can comprise fund dividend, rate discount, fund change and other information. The computer device 2 may download the announcement file, which is typically a PDF file, from the financial platform that published the targeted announcement. After the announcement file is acquired, the computer device 2 may analyze information in the announcement file to obtain an announcement text.
In an exemplary embodiment, the step S100 further includes a step S200 to a step S202, where in the step S200, the format conversion operation is performed on the advertisement file to obtain an advertisement file in a picture format; and step S202, extracting the character content of the announcement file in the picture format based on a character recognition technology, and generating the announcement text according to the character content. In this embodiment, the announcement file may be subjected to character recognition by an OCR (optical character recognition) character recognition technology, and an announcement text corresponding to the target announcement is generated. The announcement text may be an editable document, such as a Word document. In the embodiment, the announcement file is converted into the announcement file in the picture format, and the character content of the announcement file in the picture format is identified by the OCR character recognition technology, so that the information extraction efficiency of the announcement file is improved.
It should be noted that the information extraction efficiency and accuracy of the existing bulletin files are often low, and manual intervention is needed for rechecking after extraction; for example, after the notice information is acquired, the notice information is input into a product system through a service person, and the rechecking information is notified to a rechecker offline, so that no problem is checked and displayed to a user. The process is time-consuming and labor-consuming, and omission, entry errors and the like are easy to occur by manually searching the product element information in the bulletin. In order to solve the problems of low information extraction efficiency and low accuracy of the bulletin file, the embodiment may further perform error correction on the bulletin text to improve the information extraction accuracy of the bulletin file, where the specific error correction process is as follows:
and step S102, obtaining the confidence of each character in the bulletin text through a pre-trained language representation model.
The embodiment may further obtain a single character confidence of each character in the advertisement text, where the single character confidence of each character may be obtained through a pre-trained BERT (Bidirectional Encoder registration from transforms language Representation) model, for example, the advertisement text may be input into the BERT model to obtain the confidence of the character at each position in the advertisement text, for example, the BERT model may be used to predict each character in an input sentence, so as to predict the single character confidence of each position character in the advertisement text.
And step S104, judging whether the bulletin text is a preset type text.
The preset type text can be a bill type text with a short text and a narrow range.
And step S106, if the bulletin text is not the preset type text, inputting the bulletin text and the confidence degrees of the characters into a preset OCR error correction model, and performing error correction operation on the characters with the confidence degrees lower than a preset threshold value in the bulletin text through the preset OCR error correction model to obtain an error-corrected target bulletin text.
Since there may be some errors in the bulletin text recognized by the OCR recognition technology, in view of this, the embodiment further provides a preconfigured OCR error correction model to perform error correction processing on the bulletin text. For example, the error correction operation may be performed on the advertisement text by a pre-trained error correction model, wherein the error correction model may be the OCR error corrector configured in advance based on an OCR technology error correction algorithm. Specifically, after obtaining the confidence levels of the advertisement text and each character, the confidence levels of the advertisement text and each character may be input to the error correction model, so as to perform error detection processing on the advertisement text through the error correction model, perform error correction processing when an error is detected, and output the advertisement text after error correction.
For convenience of understanding, the present embodiment further provides a specific example of an error detection process of an error correction model, which specifically includes:
for example:
inputting:
text [ 'only see Yangtze river interplanetary flow' ]
probs=[[0.99,0.99,0.99,0.99,0.56,0.99,0.99]]
And (3) outputting:
text _ corrected [ 'only view the day stream of the Yangtze river' ].
In the embodiment, by obtaining the single character confidence of each character in the announcement text, the error detection process of the OCR error correction model can be reduced, and the error correction accuracy and the error correction speed are improved. Practice shows that by configuring a confidence threshold, the error detection recall rate can reach 100%, the error correction rate is 0.1, and the risk of error correction can be caused without adopting the method.
In an exemplary embodiment, the method for intelligently acquiring and correcting the document information based on the OCR technology further includes the step S300: and if the bulletin text is the preset type text, performing error correction operation on the bulletin text according to a preset keyword table to obtain the error-corrected target bulletin text. In order to further improve the efficiency of error correction of the bulletin texts, the embodiment may also perform different error correction operations for different file types of the bulletin texts; specifically, text correction can be performed on some special texts (preset types of texts) through a keyword correction technology. For example, for a class document type text, a shape-similar word dictionary can be established in advance, and a preset keyword table is configured according to the shape-similar word dictionary so as to correct the class document type text according to the preset keyword table; wherein, the keyword list comprises a plurality of near-shape words, and the error correction range is limited to the near-shape words. In this embodiment, by determining whether the bulletin text is a preset type text and configuring a corresponding preset keyword table for the preset type text, the error correction operation is performed on the bulletin text through the preset keyword table, so that the error correction efficiency on a special text (the preset type text) is improved.
In an exemplary embodiment, the step S300 further includes a step S400 to a step S404, where in the step S400, a plurality of target characters with confidence levels lower than a preset threshold are acquired from the bulletin text; step S402, extracting a plurality of candidate words from the bulletin text according to the target characters; and step S404, performing error correction operation on the candidate words according to the preset keyword list to obtain the error-corrected target bulletin text. In this embodiment, the error correction operation is performed on the plurality of candidate words through the preset keyword table, so that the error correction efficiency of the bulletin text is further improved.
For convenience of understanding, in the embodiment, the confidence level of each character in the bulletin text is obtained through a pre-trained language Representation model (BERT model: Bidirectional Encoder reproduction from transforms language Representation model), wherein the confidence level of each character is a probability value output by a softmax layer of the BERT model, and a plurality of target characters with confidence levels lower than a preset threshold value can be obtained from the bulletin text according to the probability value output by the softmax layer. In some embodiments, extracting a plurality of candidate words from the bulletin text may include: firstly, determining candidate characters through an MLM (Masked Language Model); second, the candidate characters are filtered by CSD (Character similarity decoder). Wherein the first step may be an Encoder portion of the mask language model, which may be a trained bert (bidirectional Encoder retrieval from transforms) model; the second step may be a Decoder part of the mask language model, wherein the Decoder may be CSD, by which the character similarity Decoder may improve the confidence of the context.
After obtaining the target characters, the computer device 2 may extract a plurality of candidate words from the bulletin text by using a BK-Tree algorithm (Burkhard Keller Tree fuzzy matching), determine corresponding positions of the candidate words in the bulletin text, and finally match and replace the candidate words according to a plurality of preset keywords in the preset keyword table to obtain the error-corrected target bulletin text.
In some embodiments, after the corrected target bulletin text is obtained, the confidence of each character in the target bulletin text can be extracted; if the target bulletin text also has characters with the confidence degrees lower than a preset threshold value, error correction operation is carried out on the target bulletin text until the confidence degree of each character in the target bulletin text is higher than the preset threshold value.
And step S108, matching a plurality of target element values from a pre-configured information element table according to the target bulletin text.
The information element table includes a plurality of elements. The embodiment can configure an information element table according to the data to be acquired, and configure an element value according to each data (element) to be acquired. It should be noted that the elements (advertisement data) extracted from the advertisement text may be attribute specialized words of the product, and these attribute specialized words are fixed and generally do not change, so these attribute specialized words may be configured in an information element table in advance, where one attribute specialized word may be configured with one element value (key).
Step S110, extracting a plurality of announcement data from the target announcement text according to the plurality of target element values.
When the corrected bulletin text is obtained, a plurality of bulletin data may be extracted from the corrected bulletin text according to the element value. When an element extracted from the bulletin text is required, a plurality of bulletin data may be extracted from the target bulletin text according to the plurality of target element values.
In an exemplary embodiment, the step S110 further includes a step S500 to a step S502, where in the step S500, an element corresponding to each target element value is acquired from the information element table according to the plurality of target element values and a pre-configured matching rule, so as to obtain a plurality of target elements; and step S502, extracting a plurality of announcement data from the target announcement text according to the plurality of target elements. Because some attribute professional words may be english or other language words, the embodiment may also configure a uniform key with the same term corresponding to multiple languages, and values returned for the key during retrieval may all be regarded as values of the same attribute. The multi-language configuration of this element can be adjusted to the need for real-time additions. The embodiment acquires the element corresponding to each target element value from the information element table according to the target element values, extracts the notice data from the target notice text according to the target elements, and improves the accuracy of extracting the notice data from the target notice text.
In an exemplary embodiment, the step S500 further includes a step S600 to a step S602, where the step S600 converts the plurality of target element values into a plurality of regular expressions according to the matching rule; step S602, obtaining elements corresponding to each target element value from the information element table according to the regular expressions, so as to obtain a plurality of target elements. In this embodiment, the accuracy and the safety of obtaining the elements corresponding to the target element values from the information element table are improved by converting the target element values into regular expressions.
After obtaining the plurality of announcement data extracted from the plurality of announcement data, the fund product corresponding to the target announcement can be searched, and the plurality of announcement data and the announcement document after error correction are uploaded, wherein the plurality of announcement data can be entered into the information element table.
The embodiment analyzes the announcement file of the content in the file which can not be directly searched, such as the picture format, the PDF format and the like, into the document of the word format which can directly search the file content in the file, and inputs the text content in the document of the word format into the Bert model, so as to realize text error correction through the OCR error corrector. After the whole error correction is completed, the keyword error detection can be carried out based on the keywords to be retrieved, the required element values are obtained, and the values are automatically recorded into the corresponding product information, the series of operations are free of manual intervention, and the file analysis processing function of the OCR is combined, so that the file processing of the OCR correction model is added, the operation efficiency is improved by 85%, the manual operation error is reduced by 90%, the acquisition efficiency and accuracy of the announcement data are improved, and the timeliness of the announcement data is further improved.
In an exemplary embodiment, the method for intelligently acquiring and correcting the document information based on the OCR technology further includes the step S700: uploading the plurality of advertisement data to a blockchain.
For example, uploading the plurality of advertisement data to the blockchain can ensure the security and the fair transparency. The blockchain referred to in this example is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, encryption algorithm, and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.
Example two
FIG. 2 is a schematic diagram of program modules of a second embodiment of the document information intelligent acquisition and correction system based on OCR technology. OCR-based document information intelligent acquisition and correction system 20 may comprise or be divided into one or more program modules, which are stored in a storage medium and executed by one or more processors, to implement the present invention and implement the OCR-based document information intelligent acquisition and correction method described above. The program module referred to in the embodiments of the present invention refers to a series of computer program instruction segments capable of performing specific functions, and is more suitable for describing the execution process of the document information intelligent acquisition and correction system 20 based on OCR technology in a storage medium than the program itself. The following description will specifically describe the functions of the program modules of the present embodiment:
an obtaining module 200, configured to obtain a bulletin file, and analyze the bulletin file to obtain a bulletin text;
an obtaining module 202, configured to obtain a confidence level of each character in the bulletin text through a pre-trained language representation model;
the judging module 204 is configured to judge whether the advertisement text is a preset type text;
the input module 206 is configured to input the bulletin text and the confidence level of each character into a preset OCR error correction model if the bulletin text is not the preset type of text, so as to perform error correction operation on the character with the confidence level lower than a preset threshold in the bulletin text through the preset OCR error correction model, and obtain an error-corrected target bulletin text;
a matching module 208, configured to match a plurality of target element values from a preconfigured information element table according to the target advertisement text; and
an extracting module 210, configured to extract a plurality of advertisement data from the target advertisement text according to the plurality of target element values.
Illustratively, the obtaining module 200 is further configured to: carrying out format conversion operation on the announcement file to obtain an announcement file in a picture format; and extracting the text content of the announcement file in the picture format based on a text recognition technology, and generating the announcement text according to the text content.
Illustratively, the system 20 for intelligently acquiring and correcting document information based on OCR technology further includes an error correction module, where the error correction module is configured to, if the advertisement text is the preset type text, perform error correction operation on the advertisement text according to a preset keyword table to obtain the corrected target advertisement text.
Illustratively, the error correction module is further configured to: acquiring a plurality of target characters with confidence degrees lower than a preset threshold value from the bulletin text; extracting a plurality of candidate words from the bulletin text according to the plurality of target characters; and performing error correction operation on the candidate words according to the preset keyword list to obtain the error-corrected target bulletin text.
Illustratively, the extracting module 210 is further configured to: acquiring elements corresponding to the target element values from the information element table according to the target element values and a preset matching rule to obtain a plurality of target elements; and extracting a plurality of announcement data from the target announcement text according to the plurality of target elements.
Illustratively, the extracting module 210 is further configured to: converting the target element values into regular expressions according to the matching rules; and acquiring elements corresponding to all target element values from the information element table according to the regular expressions to obtain a plurality of target elements.
Illustratively, the system 20 for intelligently acquiring and correcting document information based on OCR technology further includes an uploading module, configured to upload the plurality of advertisement data to a blockchain.
EXAMPLE III
Fig. 3 is a schematic diagram of a hardware architecture of a computer device according to a third embodiment of the present invention. In the present embodiment, the computer device 2 is a device capable of automatically performing numerical calculation and/or information processing in accordance with a command set in advance or stored. The computer device 2 may be a rack server, a blade server, a tower server or a rack server (including an independent server or a server cluster composed of a plurality of servers), and the like. As shown, the computer device 2 includes, but is not limited to, at least a memory 21, a processor 22, a network interface 23, and an OCR technology-based document information intelligent acquisition and correction system 20, communicatively coupled to each other via a system bus.
In this embodiment, the memory 21 includes at least one type of computer-readable storage medium including a flash memory, a hard disk, a multimedia card, a card-type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Read Only Memory (ROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a Programmable Read Only Memory (PROM), a magnetic memory, a magnetic disk, an optical disk, and the like. In some embodiments, the storage 21 may be an internal storage unit of the computer device 2, such as a hard disk or a memory of the computer device 2. In other embodiments, the memory 21 may also be an external storage device of the computer device 2, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), or the like provided on the computer device 2. Of course, the memory 21 may also comprise both internal and external memory units of the computer device 2. In this embodiment, the memory 21 is generally used for storing an operating system installed on the computer device 2 and various application software, such as the program code of the document information intelligent acquisition and correction system 20 based on the OCR technology in the second embodiment. Further, the memory 21 may also be used to temporarily store various types of data that have been output or are to be output.
Processor 22 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data Processing chip in some embodiments. The processor 22 is typically used to control the overall operation of the computer device 2. In this embodiment, the processor 22 is configured to run the program code stored in the memory 21 or process data, for example, run the OCR technology based document information intelligent acquisition and correction system 20, so as to implement the OCR technology based document information intelligent acquisition and correction method according to the first embodiment.
The network interface 23 may comprise a wireless network interface or a wired network interface, and the network interface 23 is typically used for establishing a communication connection between the computer device 2 and other electronic apparatuses. For example, the network interface 23 is used to connect the computer device 2 to an external terminal through a network, establish a data transmission channel and a communication connection between the computer device 2 and the external terminal, and the like. The network may be a wireless or wired network such as an Intranet (Intranet), the Internet (Internet), a Global System of Mobile communication i/On (GSM), a Wideband Code Division Multiple Access (WCDMA), a 4G network, a 5G network, Bluetooth (Bluetooth), Wi-Fi, and the like.
It is noted that fig. 3 only shows the computer device 2 with components 20-23, but it is to be understood that not all shown components are required to be implemented, and that more or less components may be implemented instead.
In the present embodiment, the OCR technology based document information intelligent acquisition and correction system 20 stored in the memory 21 may also be divided into one or more program modules, which are stored in the memory 21 and executed by one or more processors (in the present embodiment, the processor 22) to complete the present invention.
For example, fig. 2 is a schematic diagram of program modules for implementing the OCR technology based document information intelligent acquisition and correction system 20 according to the second embodiment of the present invention, in which the OCR technology based document information intelligent acquisition and correction system 20 can be divided into a parsing module 200, an acquisition module 202, a judgment module 204, an input module 206, a matching module 208, and an extraction module 210. The program module referred to in the present invention refers to a series of computer program instruction segments capable of performing specific functions, and is more suitable than a program for describing the execution process of the OCR technology based document information intelligent acquisition and correction system 20 in the computer device 2. The specific functions of the program modules 200 and 210 have been described in detail in the second embodiment, and are not described herein again.
Example four
The present embodiment also provides a computer-readable storage medium, such as a flash memory, a hard disk, a multimedia card, a card-type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a read-only memory (ROM), an electrically erasable programmable read-only memory (EEPROM), a programmable read-only memory (PROM), a magnetic memory, a magnetic disk, an optical disk, a server, an App application mall, etc., on which a computer program is stored, which when executed by a processor implements corresponding functions. The computer-readable storage medium of the present embodiment is the OCR technology-based document information intelligent acquisition and correction system 20, and when being executed by a processor, the OCR technology-based document information intelligent acquisition and correction method of the first embodiment is implemented.
The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner.
The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims (10)

1. An OCR technology-based intelligent document information acquisition and error correction method is characterized by comprising the following steps:
acquiring a bulletin file, and analyzing the bulletin file to obtain a bulletin text;
obtaining the confidence of each character in the bulletin text through a pre-trained language representation model;
judging whether the announcement text is a preset type text or not;
if the bulletin text is not the preset type text, inputting the bulletin text and the confidence degrees of the characters into a preset OCR (optical character recognition) error correction model, and performing error correction operation on the characters with the confidence degrees lower than a preset threshold value in the bulletin text through the preset OCR error correction model to obtain an error-corrected target bulletin text;
matching a plurality of target element values from a pre-configured information element table according to the target bulletin text; and
and extracting a plurality of bulletin data from the target bulletin text according to the plurality of target element values.
2. An OCR technology-based intelligent document information acquisition and correction method according to claim 1, wherein the step of acquiring a bulletin file and parsing the bulletin file to obtain a bulletin text comprises:
carrying out format conversion operation on the announcement file to obtain an announcement file in a picture format; and
and extracting the text content of the announcement file in the picture format based on a text recognition technology, and generating the announcement text according to the text content.
3. An OCR technology-based document information intelligent acquisition and error correction method as claimed in claim 1, further comprising:
and if the bulletin text is the preset type text, performing error correction operation on the bulletin text according to a preset keyword table to obtain the error-corrected target bulletin text.
4. The OCR technology-based document information intelligent acquisition and correction method as claimed in claim 3, wherein the step of performing error correction operation on the announcement text according to a preset keyword table to obtain the corrected target announcement text if the announcement text is the preset type text comprises:
acquiring a plurality of target characters with confidence degrees lower than a preset threshold value from the bulletin text;
extracting a plurality of candidate words from the bulletin text according to the plurality of target characters; and
and carrying out error correction operation on the candidate words according to the preset keyword list so as to obtain the error-corrected target bulletin text.
5. An OCR technology-based document information intelligent acquisition and correction method as claimed in claim 1, wherein the step of extracting a plurality of advertisement data from the target advertisement text according to the plurality of target element values comprises:
acquiring elements corresponding to the target element values from the information element table according to the target element values and a preset matching rule to obtain a plurality of target elements; and
and extracting a plurality of announcement data from the target announcement text according to the plurality of target elements.
6. An OCR technology-based intelligent document information acquisition and error correction method according to claim 5, wherein the step of acquiring elements corresponding to the respective target element values from the information element table according to the plurality of target element values and a pre-configured matching rule to obtain a plurality of target elements comprises:
converting the target element values into regular expressions according to the matching rules; and
and acquiring elements corresponding to all target element values from the information element table according to the regular expressions to obtain a plurality of target elements.
7. An OCR technology-based document information intelligent acquisition and error correction method as claimed in any one of claims 1 to 6, further comprising: uploading the plurality of advertisement data to a blockchain.
8. An OCR technology-based intelligent document information acquisition and error correction system is characterized by comprising:
the analysis module is used for acquiring the bulletin files and analyzing the bulletin files to obtain bulletin texts;
the acquisition module is used for acquiring the confidence coefficient of each character in the bulletin text through a pre-trained language representation model;
the judging module is used for judging whether the bulletin text is a preset type text;
the input module is used for inputting the bulletin text and the confidence coefficient of each character into a preset OCR (optical character recognition) error correction model if the bulletin text is not the preset type text, so that the characters with the confidence coefficient lower than a preset threshold value in the bulletin text are subjected to error correction operation through the preset OCR error correction model to obtain an error-corrected target bulletin text;
the matching module is used for matching a plurality of target element values from a pre-configured information element table according to the target bulletin text; and
and the extraction module is used for extracting a plurality of announcement data from the target announcement text according to the target element values.
9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the computer program when executed by the processor implements the steps of the OCR technology based document information intelligent acquisition and correction method.
10. A computer-readable storage medium, wherein a computer program is stored in the computer-readable storage medium, and the computer program is executable by at least one processor to cause the at least one processor to execute the steps of the OCR technology based document information intelligent acquisition and correction method according to any one of claims 1 to 7.
CN202111151913.XA 2021-09-29 2021-09-29 Document information intelligent acquisition and error correction method, system and equipment based on OCR technology Pending CN113936130A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111151913.XA CN113936130A (en) 2021-09-29 2021-09-29 Document information intelligent acquisition and error correction method, system and equipment based on OCR technology

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111151913.XA CN113936130A (en) 2021-09-29 2021-09-29 Document information intelligent acquisition and error correction method, system and equipment based on OCR technology

Publications (1)

Publication Number Publication Date
CN113936130A true CN113936130A (en) 2022-01-14

Family

ID=79277346

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111151913.XA Pending CN113936130A (en) 2021-09-29 2021-09-29 Document information intelligent acquisition and error correction method, system and equipment based on OCR technology

Country Status (1)

Country Link
CN (1) CN113936130A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115294588A (en) * 2022-08-17 2022-11-04 湖北鑫英泰系统技术股份有限公司 Data processing method and system based on RPA process robot

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115294588A (en) * 2022-08-17 2022-11-04 湖北鑫英泰系统技术股份有限公司 Data processing method and system based on RPA process robot
CN115294588B (en) * 2022-08-17 2024-04-19 湖北鑫英泰系统技术股份有限公司 Data processing method and system based on RPA flow robot

Similar Documents

Publication Publication Date Title
US9690788B2 (en) File type recognition analysis method and system
CN112036145A (en) Financial statement identification method and device, computer equipment and readable storage medium
CN111695439A (en) Image structured data extraction method, electronic device and storage medium
CN112396049A (en) Text error correction method and device, computer equipment and storage medium
CN111984792A (en) Website classification method and device, computer equipment and storage medium
AU2019204444A1 (en) System and method for enrichment of ocr-extracted data
CN113627168B (en) Method, device, medium and equipment for checking component packaging conflict
CN112052305A (en) Information extraction method and device, computer equipment and readable storage medium
CN112149387A (en) Visualization method and device for financial data, computer equipment and storage medium
CN111859093A (en) Sensitive word processing method and device and readable storage medium
CN112580363A (en) Requirement document processing method and device, computer equipment and storage medium
CN113936130A (en) Document information intelligent acquisition and error correction method, system and equipment based on OCR technology
CN114417798A (en) Document structured extraction method and device, computer equipment and storage medium
CN113642569A (en) Unstructured data document processing method and related equipment
CN112069808A (en) Financing wind control method and device, computer equipment and storage medium
CN110781404A (en) Friend relationship chain matching method, system, computer equipment and readable storage medium
CN115481599A (en) Document processing method and device, electronic equipment and storage medium
US20220044048A1 (en) System and method to recognise characters from an image
CN114491134A (en) Trademark registration success rate analysis method and system
CN113535938A (en) Standard data construction method, system, device and medium based on content identification
CN112749258A (en) Data searching method and device, electronic equipment and storage medium
CN110599338A (en) Transaction data processing method and device, computer equipment and storage medium
CN115563941B (en) Composite document processing method and device, storage medium and computer equipment
CN113850085B (en) Enterprise grade evaluation method and device, electronic equipment and readable storage medium
US20210295031A1 (en) Automated classification and interpretation of life science documents

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination