CN116758565B - OCR text restoration method, equipment and storage medium based on decision tree - Google Patents
OCR text restoration method, equipment and storage medium based on decision tree Download PDFInfo
- Publication number
- CN116758565B CN116758565B CN202311064174.XA CN202311064174A CN116758565B CN 116758565 B CN116758565 B CN 116758565B CN 202311064174 A CN202311064174 A CN 202311064174A CN 116758565 B CN116758565 B CN 116758565B
- Authority
- CN
- China
- Prior art keywords
- text
- text box
- decision tree
- ocr
- keywords
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000003066 decision tree Methods 0.000 title claims abstract description 38
- 238000000034 method Methods 0.000 title claims abstract description 35
- 238000007781 pre-processing Methods 0.000 claims abstract description 8
- 230000009467 reduction Effects 0.000 claims description 19
- 238000004590 computer program Methods 0.000 claims description 11
- 230000008569 process Effects 0.000 claims description 8
- 230000014509 gene expression Effects 0.000 claims description 3
- 238000000605 extraction Methods 0.000 claims description 2
- 238000012805 post-processing Methods 0.000 abstract description 5
- 238000012015 optical character recognition Methods 0.000 description 26
- 238000005516 engineering process Methods 0.000 description 14
- 238000012986 modification Methods 0.000 description 12
- 230000004048 modification Effects 0.000 description 12
- 230000006870 function Effects 0.000 description 5
- 230000006872 improvement Effects 0.000 description 4
- 238000004458 analytical method Methods 0.000 description 3
- 239000000463 material Substances 0.000 description 3
- 238000012545 processing Methods 0.000 description 3
- 230000000644 propagated effect Effects 0.000 description 3
- 238000012549 training Methods 0.000 description 3
- 230000004075 alteration Effects 0.000 description 2
- 238000004422 calculation algorithm Methods 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 2
- 241000579895 Chlorostilbon Species 0.000 description 1
- 230000006978 adaptation Effects 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 239000010976 emerald Substances 0.000 description 1
- 229910052876 emerald Inorganic materials 0.000 description 1
- 239000000835 fiber Substances 0.000 description 1
- 239000010977 jade Substances 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- ZLIBICFPKPWGIZ-UHFFFAOYSA-N pyrimethanil Chemical compound CC1=CC(C)=NC(NC=2C=CC=CC=2)=N1 ZLIBICFPKPWGIZ-UHFFFAOYSA-N 0.000 description 1
- 238000011084 recovery Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 239000010979 ruby Substances 0.000 description 1
- 229910001750 ruby Inorganic materials 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
- G06V30/26—Techniques for post-processing, e.g. correcting the recognition result
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N5/00—Computing arrangements using knowledge-based models
- G06N5/01—Dynamic search techniques; Heuristics; Dynamic trees; Branch-and-bound
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
- G06V30/19—Recognition using electronic means
- G06V30/191—Design or setup of recognition systems or techniques; Extraction of features in feature space; Clustering techniques; Blind source separation
- G06V30/19153—Design or setup of recognition systems or techniques; Extraction of features in feature space; Clustering techniques; Blind source separation using rules for classification or partitioning the feature space
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/40—Document-oriented image-based pattern recognition
- G06V30/41—Analysis of document content
- G06V30/413—Classification of content, e.g. text, photographs or tables
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/40—Document-oriented image-based pattern recognition
- G06V30/41—Analysis of document content
- G06V30/414—Extracting the geometrical structure, e.g. layout tree; Block segmentation, e.g. bounding boxes for graphics or text
Landscapes
- Engineering & Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Geometry (AREA)
- Computer Graphics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Character Discrimination (AREA)
Abstract
The application provides an OCR text restoring method, equipment and a storage medium based on a decision tree, which comprises the following steps: preprocessing the text box recognized by OCR; extracting text box characteristics, and constructing a decision tree based on the text box characteristics; and classifying and merging the text boxes according to the decision tree, and restoring the original layout of the text. The application carries out post-processing on the recognition result of OCR, analyzes the multiple characteristics of the text box by applying a decision tree, and recognizes the content category of the text box: such as a title, a chapter, a page number, a paragraph, etc., and then classifying and merging to restore the original layout of the text, the situation that text boxes in the OCR recognition result are wrongly classified, arranged or overlapped is avoided, and the problems that the text content is incoherent, and the format and layout of the text are easily disordered are solved.
Description
Technical Field
The present application relates to the field of text recognition technologies, and in particular, to a decision tree-based OCR text reduction method, apparatus, and storage medium.
Background
In order to further improve the accessibility of document information and facilitate management, text content recognition needs to be performed on the document, and text in images and scanned images is converted into editable and searchable text. The earliest document recognition technology was based on OCR, which uses optical character recognition technology to extract text from a document. In recent years, with the rapid development of science and technology, document recognition technologies based on deep learning and computer vision have gradually appeared. Although the document recognition technology based on deep learning has significantly progressed in image processing, extensive data set training is required, and a large amount of computing resources and time are consumed. Computer vision-based document recognition technology has been widely used in form parsing, but it also requires a lot of resource training, and parsing errors or loss of part of information may still occur for a form of a special structure. The anti-OCR technology has higher maturity and stability, can be used for various types of documents, has high accuracy of recognition results along with algorithm improvement, supports various languages, and can be selected by a plurality of business and open source engines. Thus, current OCR recognition technology remains the most commonly used document recognition technology.
Although the recognition accuracy of OCR technology has advanced significantly, in cases where text is more complex, blurred or distorted, low resolution images, etc. are challenging, the recognized text may still not fully preserve the format and layout of the original document, resulting in inconsistent recognition results from the original. The post-treatment method then acts: for documents with known styles and templates, the reduction can be performed according to style rules and template information, but this method cannot process documents with unknown formats. The method also can be used for carrying out semantic analysis and entity recognition on the OCR recognition result through a natural language processing technology, extracting key information, named entities, relations and the like in the text, so as to restore semantic structures and information in the original document, but the method needs to consume a large amount of resources to carry out model training and needs to incorporate entity knowledge in a specific field. Therefore, the most commonly used post-processing method of OCR text at present is a text layout analysis method, which is used to restore the layout structure of an original document by analyzing the relative positional relationship of text blocks in the OCR recognition result and performing distance calculation or clustering on a plurality of text boxes. However, many text layout analysis methods currently focus only on the relative positional information of text boxes, but rarely focus on other features such as fonts, numerical proportions, specific keywords, and the like.
Aiming at the existing research situation, the current post-processing method of the OCR technology facing the document has the following problems:
1. the existing post-processing technology has poor reduction capability on the identified text structure, and can lead the text to be classified or combined wrongly, thereby affecting the accuracy and continuity of the identification result;
2. there is a lack of attention to other various features such as fonts, numerical proportions, specific keywords, etc.
Disclosure of Invention
Aiming at the problems in the prior art, an OCR text reduction method, equipment and a storage medium based on a decision tree are provided, the decision tree analyzes multiple characteristics of text boxes, classifies and merges the text boxes, text reduction is realized, and the problem that the text boxes are misclassified, arranged or overlapped can be solved.
The technical scheme adopted by the application is as follows: an OCR text reduction method based on decision tree, comprising:
preprocessing the text box recognized by OCR;
extracting text box characteristics, and constructing a decision tree based on the text box characteristics;
and classifying and merging the text boxes according to the decision tree, and restoring the original layout of the text.
Further, the preprocessing includes:
numbering each text box, and recording the initial content of each text box;
converting all English characters of the text box into lowercase;
and removing special characters in the text box.
Further, the special characters include characters other than numbers, letters, chinese, punctuation, and spaces.
Further, the text box feature extracting process includes:
extracting the number of words, the number of lines and the position of each text box in the whole document;
extracting the length, width and font of each text box;
the numerical proportion, the letter proportion and the contained keywords in each text box are extracted.
Further, the keywords are keywords that can represent meaning of the text box content, for example, "fig. 1", "table 2", "1.1", "2.1", and the like. The format of these keywords is empirically formulated by an expert and can be identified by regular expressions.
Further, the constructing the decision tree includes:
root node: judging whether the keywords are contained or not; if yes, classifying the text box according to the keyword type, including:
chapter node judgment: subdividing chapter grades according to the width, the fonts and the keyword number of the text boxes;
judging graph nodes; determining the graph according to the characters of the fonts, the positions and the keywords of the text box;
otherwise, classifying the text boxes directly according to the length, width, fonts, positions and the like of the text boxes;
judging the title node: the text box has the widest width and is positioned at the highest position in the page;
judging page number nodes: if the keywords are included, the other contents are all numbers, and if the keywords are not included, the other contents are all numbers; the length is less than one line, and the length is at the highest or lowest position in the page;
judging a paragraph node: and determining the paragraph type according to the numerical proportion and the letter proportion characteristics.
Further, the classifying and merging process includes:
classifying all text boxes according to the decision tree;
restoring the initial content and position arrangement of each text box according to the serial numbers of the text boxes;
and merging text boxes with adjacent positions, consistent fonts and identical width in the same category.
A second aspect of the present application proposes an electronic device comprising a processor and a memory, the memory storing a computer program executable by the processor, the processor being executable to implement the above-mentioned decision tree based OCR text reduction method.
A third aspect of the present application proposes a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements an OCR text reduction method based on decision trees as described above.
Compared with the prior art, the beneficial effects of adopting the technical scheme are as follows: the application focuses on the multiple characteristics of the text boxes except the positions, classifies and reunites the text boxes by using the decision tree, avoids the situation that the text boxes with similar positions are misclassified, and can carry out targeted reduction based on different categories of the text.
Drawings
FIG. 1 is a flow chart of an OCR text reduction method based on decision trees.
FIG. 2 is a flowchart of preprocessing in an embodiment of the present application.
FIG. 3 is a flow chart of feature extraction in an embodiment of the application.
FIG. 4 is a flowchart of decision tree construction in accordance with an embodiment of the present application.
FIG. 5 is a flow chart of classification and merging according to an embodiment of the present application.
Detailed Description
Embodiments of the present application are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar modules or modules having like or similar functions throughout. The embodiments described below by referring to the drawings are illustrative only and are not to be construed as limiting the application. On the contrary, the embodiments of the application include all alternatives, modifications and equivalents as may be included within the spirit and scope of the appended claims.
Example 1
OCR (optical character recognition ) algorithms recognize text in an image or scanned as a text box containing characteristics of text content, length, width, location, etc., but also require a reformatting of the text box to be able to read smoothly. Because the text boxes are easily classified or combined by mistake in the existing OCR recognition process, more attention to more other features is lacking due to the fact that only position features are considered, in order to solve the problem, the embodiment of the application provides an OCR text reduction method based on a decision tree, post-processing is carried out on the recognition result of OCR, and the text boxes in the title, the chapter, the page number, the paragraph and the block diagram are classified and combined by analyzing the multiple features of the text boxes by applying the decision tree so as to reduce the original layout of the text, the situation that the text boxes in the OCR recognition result are classified, arranged or overlapped by mistake is avoided, and the problems of incoherence of text content and easy confusion of the format and the layout of the text are solved. As shown in fig. 1, the specific scheme is as follows:
step S101, preprocessing the text box recognized by OCR.
As shown in fig. 2, in this embodiment, the preprocessing mainly includes: each text box is numbered first, and initial content is recorded, so that subsequent recovery is facilitated.
Simultaneously, all English characters in the text box are converted into lowercase, and special characters in the text box are removed. Through the preprocessing process, the interference items in the text boxes can be effectively removed, the text box features can be extracted more accurately, and the accuracy of text box classification is improved.
In one embodiment, the special characters are non-numeric, non-alphabetic, non-chinese, non-punctuation, non-space characters.
And S102, extracting text box characteristics, and constructing a decision tree based on the text box characteristics.
As shown in fig. 3, in order to classify and merge text boxes, various features of the text boxes need to be extracted first, and in this embodiment, the method includes:
for each text box, the number of words, the number of lines, and the position in the entire document are extracted.
Extracting length, width and font for each text box;
for each text box, the numerical scale, the alphabetical scale, and the keywords contained are extracted.
In this embodiment, the keywords are keywords that can indicate the meaning of the text box content, for example, "fig. 1", "table 2", "1.1", "2.1", and the like. The format of these keywords is empirically formulated by an expert and can be identified by regular expressions.
After determining the features contained in the text box, a decision tree is further built based on the extracted features. The specific process is as follows:
as shown in fig. 4, keywords, font types, width sections of text boxes, and the like included in the entire document are counted first in this embodiment.
And then constructing a decision tree according to the statistical structure:
root node: judging whether keywords (such as 'figure 1', 'table 2', '1.1', '2.1', and the like) are included, classifying the text boxes according to the types of the keywords, including:
chapter node judgment: the chapter level is further subdivided according to the characteristics of the text box, such as width, font, number of keywords, etc.
Judging the graph nodes: further determining which chart belongs to according to the characters such as fonts, positions, keywords and the like.
Otherwise, the text box is classified according to the length, width, font, position, etc. of the text box.
Judging the title node: the text box is widest in width, usually highest in the page, and is usually one line or less in length, not excluding more than one line.
Judging page number nodes: if the key words are not included, the key words are all digital; when keywords such as "page" are included, all the keywords are all the digits except the keywords. The length is less than one line and the position is usually lowest in the page.
Still include, paragraph node judgement: paragraph types (e.g., text, references, etc.) are determined based on the number scale, letter scale, etc. characteristics.
And 103, classifying and merging the text boxes according to the decision tree, and restoring the original layout of the text.
Referring to fig. 5, in this embodiment, all text boxes are classified directly by using the constructed decision tree; restoring the initial content and position arrangement of each text box according to the serial numbers of the text boxes; and merging text boxes with adjacent positions, consistent fonts and identical width in the same category.
The application focuses on the multiple characteristics (such as number/letter proportion, specific keywords and the like) except the positions of the text boxes, and then uses a decision tree to classify and merge the text boxes; the situation that text boxes with similar positions are misclassified is avoided, and targeted reduction can be performed based on different categories of the text.
Example 2
The present embodiment proposes an electronic device comprising a processor and a memory, the memory storing a computer program executable by the processor, the processor being executable to implement the decision tree based OCR text reduction method described in embodiment 1.
The processor may be a central processing unit (CPU, central Processing Unit), other general purpose processors, digital signal processors (digital signal processor), application specific integrated circuits (Application Specific Integrated Circuit), off-the-shelf programmable gate arrays (Field programmable gate array) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The memory may be used to store the computer program and/or the modules, and the processor may implement various functions of a transcoding device between different front-end frameworks of the present application by executing or executing data stored in the memory. The memory may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, application programs required for at least one function (such as a sound playing function, an image playing function, etc.), and the like. In addition, the memory may include high-speed random access memory, and may also include non-volatile memory, such as a hard disk, memory, plug-in hard disk, smart memory card, secure digital card, flash memory card, at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device.
Having described the basic concept of the application, it will be apparent to those skilled in the art that the foregoing detailed disclosure is by way of example only and is not intended to be limiting. Although not explicitly described herein, various modifications, improvements, and adaptations to the present disclosure may occur to one skilled in the art. Such modifications, improvements, and modifications are intended to be suggested within this specification, and therefore, such modifications, improvements, and modifications are intended to be included within the spirit and scope of the exemplary embodiments of the present application.
Meanwhile, the specification uses specific words to describe the embodiments of the specification. Reference to "one embodiment," "an embodiment," and/or "some embodiments" means that a particular feature, structure, or characteristic is associated with at least one embodiment of the present description. Thus, it should be emphasized and should be appreciated that two or more references to "an embodiment" or "one embodiment" or "an alternative embodiment" in various positions in this specification are not necessarily referring to the same embodiment. Furthermore, certain features, structures, or characteristics of one or more embodiments of the present description may be combined as suitable.
Furthermore, those skilled in the art will appreciate that the various aspects of the specification can be illustrated and described in terms of several patentable categories or circumstances, including any novel and useful procedures, machines, products, or materials, or any novel and useful modifications thereof. Accordingly, aspects of the present description may be performed entirely by hardware, entirely by software (including firmware, resident software, micro-code, etc.), or by a combination of hardware and software. The above hardware or software may be referred to as a "data block," module, "" engine, "" unit, "" component, "or" system.
Example 3
The present embodiment proposes a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the decision tree based OCR text reduction method described in embodiment 1.
The computer readable storage medium may contain a propagated data signal with the computer program code embodied therein, for example, on a baseband or as part of a carrier wave. The propagated signal may take on a variety of forms, including electro-magnetic, optical, etc., or any suitable combination thereof. A computer storage medium may be any computer readable medium that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code located on a computer storage medium may be propagated through any suitable medium, including radio, cable, fiber optic cable, RF, or the like, or a combination of any of the foregoing.
The computer program code necessary for operation of portions of the present description may be written in any one or more programming languages, including an object oriented programming language such as Java, scala, smalltalk, eiffel, JADE, emerald, C ++, c#, VB NET, python, and the like, a conventional programming language such as C language, visual Basic, fortran 2003, perl, COBOL 2002, PHP, ABAP, a dynamic programming language such as Python, ruby, and Groovy, or other programming languages, and the like. The program code may execute entirely on the user's computer or as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any form of network, such as a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet), or the use of services such as software as a service (SaaS) in a cloud computing environment.
Furthermore, the order in which the elements and sequences are processed, the use of numerical letters, or other designations in the description are not intended to limit the order in which the processes and methods of the description are performed unless explicitly recited in the claims. While certain presently useful inventive embodiments have been discussed in the foregoing disclosure, by way of various examples, it is to be understood that such details are merely illustrative and that the appended claims are not limited to the disclosed embodiments, but, on the contrary, are intended to cover all modifications and equivalent arrangements included within the spirit and scope of the embodiments of the present disclosure. For example, while the system components described above may be implemented by hardware devices, they may also be implemented solely by software solutions, such as installing the described system on an existing server or mobile device.
Likewise, it should be noted that in order to simplify the presentation disclosed in this specification and thereby aid in understanding one or more inventive embodiments, various features are sometimes grouped together in a single embodiment, figure, or description thereof. This method of disclosure, however, is not intended to imply that more features than are presented in the claims are required for the present description. Indeed, less than all of the features of a single embodiment disclosed above.
Each patent, patent application publication, and other material, such as articles, books, specifications, publications, documents, etc., referred to in this specification is incorporated herein by reference in its entirety. Except for application history documents that are inconsistent or conflicting with the content of this specification, documents that are currently or later attached to this specification in which the broadest scope of the claims to this specification is limited are also. It is noted that, if the description, definition, and/or use of a term in an attached material in this specification does not conform to or conflict with what is described in this specification, the description, definition, and/or use of the term in this specification controls.
Finally, it should be understood that the embodiments described in this specification are merely illustrative of the principles of the embodiments of this specification. Other variations are possible within the scope of this description. Thus, by way of example, and not limitation, alternative configurations of embodiments of the present specification may be considered as consistent with the teachings of the present specification. Accordingly, the embodiments of the present specification are not limited to only the embodiments explicitly described and depicted in the present specification.
While preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the application.
It will be apparent to those skilled in the art that various modifications and variations can be made to the present application without departing from the spirit or scope of the application. Thus, it is intended that the present application also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.
Claims (5)
1. An OCR text reduction method based on a decision tree, comprising:
preprocessing the text box recognized by OCR;
extracting text box characteristics, and constructing a decision tree based on the text box characteristics;
classifying and merging the text boxes according to the decision tree, and restoring the original layout of the text;
the pretreatment comprises the following steps:
numbering each text box, and recording the initial content of each text box;
converting all English characters of the text box into lowercase;
removing special characters in the text box;
the text box feature extraction process comprises the following steps:
extracting the number of words, the number of lines and the position of each text box in the whole document;
extracting the length, width and font of each text box;
extracting the digital proportion, the letter proportion and the contained keywords in each text box;
the constructing a decision tree includes:
root node: judging whether the keywords are contained or not; if yes, classifying the text box according to the keyword type, including:
chapter node judgment: subdividing chapter grades according to the width, the fonts and the keyword number of the text boxes;
judging graph nodes; determining the graph according to the characters of the fonts, the positions and the keywords of the text box;
otherwise, classifying the text boxes directly according to the length, width, font and position of the text boxes;
judging the title node: the text box has the widest width and is positioned at the highest position in the page;
judging page number nodes: if the keywords are included, the other contents are all numbers, and if the keywords are not included, the other contents are all numbers; the length is less than one line, and the length is at the highest or lowest position in the page;
judging a paragraph node: determining a specific paragraph type according to the numerical proportion and the letter proportion characteristics;
the classifying and merging process comprises the following steps:
classifying all text boxes according to the decision tree;
restoring the initial content and position arrangement of each text box according to the serial numbers of the text boxes;
and merging text boxes with adjacent positions, consistent fonts and identical width in the same category.
2. The decision tree based OCR text reduction method of claim 1, wherein the special characters comprise non-numeric, non-alphabetic, non-chinese, non-punctuation, non-space characters.
3. The decision tree based OCR text reduction method according to claim 1, wherein the keywords are keywords that can represent meaning of text box content, and are identified by regular expressions.
4. An electronic device comprising a processor and a memory, the memory storing a computer program executable by the processor, the processor executing the computer program to implement the decision tree based OCR text reduction method of any one of claims 1-3.
5. A computer readable storage medium, on which a computer program is stored, which computer program, when being executed by a processor, implements a decision tree based OCR text reduction method according to any one of claims 1-3.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311064174.XA CN116758565B (en) | 2023-08-23 | 2023-08-23 | OCR text restoration method, equipment and storage medium based on decision tree |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311064174.XA CN116758565B (en) | 2023-08-23 | 2023-08-23 | OCR text restoration method, equipment and storage medium based on decision tree |
Publications (2)
Publication Number | Publication Date |
---|---|
CN116758565A CN116758565A (en) | 2023-09-15 |
CN116758565B true CN116758565B (en) | 2023-11-24 |
Family
ID=87951980
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311064174.XA Active CN116758565B (en) | 2023-08-23 | 2023-08-23 | OCR text restoration method, equipment and storage medium based on decision tree |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116758565B (en) |
Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106250830A (en) * | 2016-07-22 | 2016-12-21 | 浙江大学 | Digital book structured analysis processing method |
CN107145479A (en) * | 2017-05-04 | 2017-09-08 | 北京文因互联科技有限公司 | Structure of an article analysis method based on text semantic |
US10636074B1 (en) * | 2015-09-18 | 2020-04-28 | Amazon Technologies, Inc. | Determining and executing application functionality based on text analysis |
CN111768820A (en) * | 2020-06-04 | 2020-10-13 | 上海森亿医疗科技有限公司 | Paper medical record digitization and target detection model training method, device and storage medium |
CN113221735A (en) * | 2021-05-11 | 2021-08-06 | 润联软件系统(深圳)有限公司 | Multimodal-based scanned part paragraph structure restoration method and device and related equipment |
CN114186533A (en) * | 2021-11-04 | 2022-03-15 | 北京百度网讯科技有限公司 | Model training method and device, knowledge extraction method and device, equipment and medium |
CN114220114A (en) * | 2021-12-28 | 2022-03-22 | 科大讯飞股份有限公司 | Text image recognition method, device, equipment and storage medium |
CN114238575A (en) * | 2021-12-15 | 2022-03-25 | 平安科技(深圳)有限公司 | Document parsing method, system, computer device and computer-readable storage medium |
CN114303140A (en) * | 2019-07-03 | 2022-04-08 | 马里兰怡安风险服务有限公司 | Analysis of intellectual property data related to products and services |
CN114495147A (en) * | 2022-01-25 | 2022-05-13 | 北京百度网讯科技有限公司 | Identification method, device, equipment and storage medium |
CN115147841A (en) * | 2022-06-01 | 2022-10-04 | 兴业银行股份有限公司杭州分行 | Data intelligent identification and extraction system, method and medium based on deep learning |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9098471B2 (en) * | 2011-12-29 | 2015-08-04 | Chegg, Inc. | Document content reconstruction |
WO2014127535A1 (en) * | 2013-02-22 | 2014-08-28 | Google Inc. | Systems and methods for automated content generation |
-
2023
- 2023-08-23 CN CN202311064174.XA patent/CN116758565B/en active Active
Patent Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10636074B1 (en) * | 2015-09-18 | 2020-04-28 | Amazon Technologies, Inc. | Determining and executing application functionality based on text analysis |
CN106250830A (en) * | 2016-07-22 | 2016-12-21 | 浙江大学 | Digital book structured analysis processing method |
CN107145479A (en) * | 2017-05-04 | 2017-09-08 | 北京文因互联科技有限公司 | Structure of an article analysis method based on text semantic |
CN114303140A (en) * | 2019-07-03 | 2022-04-08 | 马里兰怡安风险服务有限公司 | Analysis of intellectual property data related to products and services |
CN111768820A (en) * | 2020-06-04 | 2020-10-13 | 上海森亿医疗科技有限公司 | Paper medical record digitization and target detection model training method, device and storage medium |
CN113221735A (en) * | 2021-05-11 | 2021-08-06 | 润联软件系统(深圳)有限公司 | Multimodal-based scanned part paragraph structure restoration method and device and related equipment |
CN114186533A (en) * | 2021-11-04 | 2022-03-15 | 北京百度网讯科技有限公司 | Model training method and device, knowledge extraction method and device, equipment and medium |
CN114238575A (en) * | 2021-12-15 | 2022-03-25 | 平安科技(深圳)有限公司 | Document parsing method, system, computer device and computer-readable storage medium |
CN114220114A (en) * | 2021-12-28 | 2022-03-22 | 科大讯飞股份有限公司 | Text image recognition method, device, equipment and storage medium |
CN114495147A (en) * | 2022-01-25 | 2022-05-13 | 北京百度网讯科技有限公司 | Identification method, device, equipment and storage medium |
CN115147841A (en) * | 2022-06-01 | 2022-10-04 | 兴业银行股份有限公司杭州分行 | Data intelligent identification and extraction system, method and medium based on deep learning |
Non-Patent Citations (4)
Title |
---|
Email Spam Detection Using Machine Learning Algorithms;Nikhil Kumar等;《Proceedings of the Second International Conference on Inventive Research in Computing Applications (ICIRCA-2020)》;第108-113页 * |
刑事检察办案辅助系统的设计与实现;李艳露;《中国优秀硕士学位论文全文数据库 信息科技辑》(第5期);第I138-627页 * |
基于网页信息和图像特征的Web图像检索研究;黄治虎;《中国博士学位论文全文数据库 信息科技辑》(第7期);第I138-34页 * |
多模态公文的结构知识抽取与组织研究;徐瑞麟等;《系统工程与电子技术》;第44卷(第7期);第2241-2250页 * |
Also Published As
Publication number | Publication date |
---|---|
CN116758565A (en) | 2023-09-15 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Shen et al. | Layoutparser: A unified toolkit for deep learning based document image analysis | |
US10853638B2 (en) | System and method for extracting structured information from image documents | |
US10482174B1 (en) | Systems and methods for identifying form fields | |
US20220004878A1 (en) | Systems and methods for synthetic document and data generation | |
US10789281B2 (en) | Regularities and trends discovery in a flow of business documents | |
US11816138B2 (en) | Systems and methods for parsing log files using classification and a plurality of neural networks | |
RU2613846C2 (en) | Method and system for extracting data from images of semistructured documents | |
US20240054802A1 (en) | System and method for spatial encoding and feature generators for enhancing information extraction | |
CN110866116A (en) | Policy document processing method and device, storage medium and electronic equipment | |
CN112464927B (en) | Information extraction method, device and system | |
CN111460162A (en) | Text classification method and device, terminal equipment and computer readable storage medium | |
Malakar et al. | An image database of handwritten Bangla words with automatic benchmarking facilities for character segmentation algorithms | |
Chua et al. | DeepCPCFG: deep learning and context free grammars for end-to-end information extraction | |
Alrasheed et al. | Evaluation of Deep Learning Techniques for Content Extraction in Spanish Colonial Notary Records | |
CN110008807A (en) | A kind of training method, device and the equipment of treaty content identification model | |
CN117216279A (en) | Text extraction method, device and equipment of PDF (portable document format) file and storage medium | |
EP3640861A1 (en) | Systems and methods for parsing log files using classification and a plurality of neural networks | |
CN116758565B (en) | OCR text restoration method, equipment and storage medium based on decision tree | |
US11720605B1 (en) | Text feature guided visual based document classifier | |
CN115294593A (en) | Image information extraction method and device, computer equipment and storage medium | |
Pegu et al. | Table structure recognition using CoDec encoder-decoder | |
CN111507236B (en) | File processing method, system, device and medium | |
Henke | Building and improving an OCR classifier for Republican Chinese newspaper text | |
CN117573956B (en) | Metadata management method, device, equipment and storage medium | |
US20240161528A1 (en) | Entity extraction via document image processing |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |