CN116758565B

CN116758565B - OCR text restoration method, equipment and storage medium based on decision tree

Info

Publication number: CN116758565B
Application number: CN202311064174.XA
Authority: CN
Inventors: 刘法; 白建亮; 阎德劲; 郑大安; 雷文强; 向元新; 熊可欣; 袁焦; 丁栋威; 邓欣; 顾海燕; 奂锐; 谢明华; 孙国东
Original assignee: CETC 10 Research Institute
Current assignee: CETC 10 Research Institute
Priority date: 2023-08-23
Filing date: 2023-08-23
Publication date: 2023-11-24
Anticipated expiration: 2043-08-23
Also published as: CN116758565A

Abstract

The application provides an OCR text restoring method, equipment and a storage medium based on a decision tree, which comprises the following steps: preprocessing the text box recognized by OCR; extracting text box characteristics, and constructing a decision tree based on the text box characteristics; and classifying and merging the text boxes according to the decision tree, and restoring the original layout of the text. The application carries out post-processing on the recognition result of OCR, analyzes the multiple characteristics of the text box by applying a decision tree, and recognizes the content category of the text box: such as a title, a chapter, a page number, a paragraph, etc., and then classifying and merging to restore the original layout of the text, the situation that text boxes in the OCR recognition result are wrongly classified, arranged or overlapped is avoided, and the problems that the text content is incoherent, and the format and layout of the text are easily disordered are solved.

Description

OCR text restoration method, equipment and storage medium based on decision tree

Technical Field

The present application relates to the field of text recognition technologies, and in particular, to a decision tree-based OCR text reduction method, apparatus, and storage medium.

Background

In order to further improve the accessibility of document information and facilitate management, text content recognition needs to be performed on the document, and text in images and scanned images is converted into editable and searchable text. The earliest document recognition technology was based on OCR, which uses optical character recognition technology to extract text from a document. In recent years, with the rapid development of science and technology, document recognition technologies based on deep learning and computer vision have gradually appeared. Although the document recognition technology based on deep learning has significantly progressed in image processing, extensive data set training is required, and a large amount of computing resources and time are consumed. Computer vision-based document recognition technology has been widely used in form parsing, but it also requires a lot of resource training, and parsing errors or loss of part of information may still occur for a form of a special structure. The anti-OCR technology has higher maturity and stability, can be used for various types of documents, has high accuracy of recognition results along with algorithm improvement, supports various languages, and can be selected by a plurality of business and open source engines. Thus, current OCR recognition technology remains the most commonly used document recognition technology.

Although the recognition accuracy of OCR technology has advanced significantly, in cases where text is more complex, blurred or distorted, low resolution images, etc. are challenging, the recognized text may still not fully preserve the format and layout of the original document, resulting in inconsistent recognition results from the original. The post-treatment method then acts: for documents with known styles and templates, the reduction can be performed according to style rules and template information, but this method cannot process documents with unknown formats. The method also can be used for carrying out semantic analysis and entity recognition on the OCR recognition result through a natural language processing technology, extracting key information, named entities, relations and the like in the text, so as to restore semantic structures and information in the original document, but the method needs to consume a large amount of resources to carry out model training and needs to incorporate entity knowledge in a specific field. Therefore, the most commonly used post-processing method of OCR text at present is a text layout analysis method, which is used to restore the layout structure of an original document by analyzing the relative positional relationship of text blocks in the OCR recognition result and performing distance calculation or clustering on a plurality of text boxes. However, many text layout analysis methods currently focus only on the relative positional information of text boxes, but rarely focus on other features such as fonts, numerical proportions, specific keywords, and the like.

Aiming at the existing research situation, the current post-processing method of the OCR technology facing the document has the following problems:

1. the existing post-processing technology has poor reduction capability on the identified text structure, and can lead the text to be classified or combined wrongly, thereby affecting the accuracy and continuity of the identification result;

2. there is a lack of attention to other various features such as fonts, numerical proportions, specific keywords, etc.

Disclosure of Invention

Aiming at the problems in the prior art, an OCR text reduction method, equipment and a storage medium based on a decision tree are provided, the decision tree analyzes multiple characteristics of text boxes, classifies and merges the text boxes, text reduction is realized, and the problem that the text boxes are misclassified, arranged or overlapped can be solved.

The technical scheme adopted by the application is as follows: an OCR text reduction method based on decision tree, comprising:

preprocessing the text box recognized by OCR;

extracting text box characteristics, and constructing a decision tree based on the text box characteristics;

and classifying and merging the text boxes according to the decision tree, and restoring the original layout of the text.

Further, the preprocessing includes:

numbering each text box, and recording the initial content of each text box;

converting all English characters of the text box into lowercase;

and removing special characters in the text box.

Further, the special characters include characters other than numbers, letters, chinese, punctuation, and spaces.

Further, the text box feature extracting process includes:

extracting the number of words, the number of lines and the position of each text box in the whole document;

extracting the length, width and font of each text box;

the numerical proportion, the letter proportion and the contained keywords in each text box are extracted.

Further, the keywords are keywords that can represent meaning of the text box content, for example, "fig. 1", "table 2", "1.1", "2.1", and the like. The format of these keywords is empirically formulated by an expert and can be identified by regular expressions.

Further, the constructing the decision tree includes:

root node: judging whether the keywords are contained or not; if yes, classifying the text box according to the keyword type, including:

chapter node judgment: subdividing chapter grades according to the width, the fonts and the keyword number of the text boxes;

judging graph nodes; determining the graph according to the characters of the fonts, the positions and the keywords of the text box;

otherwise, classifying the text boxes directly according to the length, width, fonts, positions and the like of the text boxes;

judging the title node: the text box has the widest width and is positioned at the highest position in the page;

judging page number nodes: if the keywords are included, the other contents are all numbers, and if the keywords are not included, the other contents are all numbers; the length is less than one line, and the length is at the highest or lowest position in the page;

judging a paragraph node: and determining the paragraph type according to the numerical proportion and the letter proportion characteristics.

Further, the classifying and merging process includes:

classifying all text boxes according to the decision tree;

restoring the initial content and position arrangement of each text box according to the serial numbers of the text boxes;

and merging text boxes with adjacent positions, consistent fonts and identical width in the same category.

A second aspect of the present application proposes an electronic device comprising a processor and a memory, the memory storing a computer program executable by the processor, the processor being executable to implement the above-mentioned decision tree based OCR text reduction method.

A third aspect of the present application proposes a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements an OCR text reduction method based on decision trees as described above.

Compared with the prior art, the beneficial effects of adopting the technical scheme are as follows: the application focuses on the multiple characteristics of the text boxes except the positions, classifies and reunites the text boxes by using the decision tree, avoids the situation that the text boxes with similar positions are misclassified, and can carry out targeted reduction based on different categories of the text.

Drawings

FIG. 1 is a flow chart of an OCR text reduction method based on decision trees.

FIG. 2 is a flowchart of preprocessing in an embodiment of the present application.

FIG. 3 is a flow chart of feature extraction in an embodiment of the application.

FIG. 4 is a flowchart of decision tree construction in accordance with an embodiment of the present application.

FIG. 5 is a flow chart of classification and merging according to an embodiment of the present application.

Detailed Description

Embodiments of the present application are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar modules or modules having like or similar functions throughout. The embodiments described below by referring to the drawings are illustrative only and are not to be construed as limiting the application. On the contrary, the embodiments of the application include all alternatives, modifications and equivalents as may be included within the spirit and scope of the appended claims.

Example 1

OCR (optical character recognition ) algorithms recognize text in an image or scanned as a text box containing characteristics of text content, length, width, location, etc., but also require a reformatting of the text box to be able to read smoothly. Because the text boxes are easily classified or combined by mistake in the existing OCR recognition process, more attention to more other features is lacking due to the fact that only position features are considered, in order to solve the problem, the embodiment of the application provides an OCR text reduction method based on a decision tree, post-processing is carried out on the recognition result of OCR, and the text boxes in the title, the chapter, the page number, the paragraph and the block diagram are classified and combined by analyzing the multiple features of the text boxes by applying the decision tree so as to reduce the original layout of the text, the situation that the text boxes in the OCR recognition result are classified, arranged or overlapped by mistake is avoided, and the problems of incoherence of text content and easy confusion of the format and the layout of the text are solved. As shown in fig. 1, the specific scheme is as follows:

step S101, preprocessing the text box recognized by OCR.

As shown in fig. 2, in this embodiment, the preprocessing mainly includes: each text box is numbered first, and initial content is recorded, so that subsequent recovery is facilitated.

Simultaneously, all English characters in the text box are converted into lowercase, and special characters in the text box are removed. Through the preprocessing process, the interference items in the text boxes can be effectively removed, the text box features can be extracted more accurately, and the accuracy of text box classification is improved.

In one embodiment, the special characters are non-numeric, non-alphabetic, non-chinese, non-punctuation, non-space characters.

And S102, extracting text box characteristics, and constructing a decision tree based on the text box characteristics.

As shown in fig. 3, in order to classify and merge text boxes, various features of the text boxes need to be extracted first, and in this embodiment, the method includes:

for each text box, the number of words, the number of lines, and the position in the entire document are extracted.

Extracting length, width and font for each text box;

for each text box, the numerical scale, the alphabetical scale, and the keywords contained are extracted.

In this embodiment, the keywords are keywords that can indicate the meaning of the text box content, for example, "fig. 1", "table 2", "1.1", "2.1", and the like. The format of these keywords is empirically formulated by an expert and can be identified by regular expressions.

After determining the features contained in the text box, a decision tree is further built based on the extracted features. The specific process is as follows:

as shown in fig. 4, keywords, font types, width sections of text boxes, and the like included in the entire document are counted first in this embodiment.

And then constructing a decision tree according to the statistical structure:

root node: judging whether keywords (such as 'figure 1', 'table 2', '1.1', '2.1', and the like) are included, classifying the text boxes according to the types of the keywords, including:

chapter node judgment: the chapter level is further subdivided according to the characteristics of the text box, such as width, font, number of keywords, etc.

Judging the graph nodes: further determining which chart belongs to according to the characters such as fonts, positions, keywords and the like.

Otherwise, the text box is classified according to the length, width, font, position, etc. of the text box.

Judging the title node: the text box is widest in width, usually highest in the page, and is usually one line or less in length, not excluding more than one line.

Judging page number nodes: if the key words are not included, the key words are all digital; when keywords such as "page" are included, all the keywords are all the digits except the keywords. The length is less than one line and the position is usually lowest in the page.

Still include, paragraph node judgement: paragraph types (e.g., text, references, etc.) are determined based on the number scale, letter scale, etc. characteristics.

And 103, classifying and merging the text boxes according to the decision tree, and restoring the original layout of the text.

Referring to fig. 5, in this embodiment, all text boxes are classified directly by using the constructed decision tree; restoring the initial content and position arrangement of each text box according to the serial numbers of the text boxes; and merging text boxes with adjacent positions, consistent fonts and identical width in the same category.

The application focuses on the multiple characteristics (such as number/letter proportion, specific keywords and the like) except the positions of the text boxes, and then uses a decision tree to classify and merge the text boxes; the situation that text boxes with similar positions are misclassified is avoided, and targeted reduction can be performed based on different categories of the text.

Example 2

The present embodiment proposes an electronic device comprising a processor and a memory, the memory storing a computer program executable by the processor, the processor being executable to implement the decision tree based OCR text reduction method described in embodiment 1.

The processor may be a central processing unit (CPU, central Processing Unit), other general purpose processors, digital signal processors (digital signal processor), application specific integrated circuits (Application Specific Integrated Circuit), off-the-shelf programmable gate arrays (Field programmable gate array) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory may be used to store the computer program and/or the modules, and the processor may implement various functions of a transcoding device between different front-end frameworks of the present application by executing or executing data stored in the memory. The memory may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, application programs required for at least one function (such as a sound playing function, an image playing function, etc.), and the like. In addition, the memory may include high-speed random access memory, and may also include non-volatile memory, such as a hard disk, memory, plug-in hard disk, smart memory card, secure digital card, flash memory card, at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device.

Having described the basic concept of the application, it will be apparent to those skilled in the art that the foregoing detailed disclosure is by way of example only and is not intended to be limiting. Although not explicitly described herein, various modifications, improvements, and adaptations to the present disclosure may occur to one skilled in the art. Such modifications, improvements, and modifications are intended to be suggested within this specification, and therefore, such modifications, improvements, and modifications are intended to be included within the spirit and scope of the exemplary embodiments of the present application.

Meanwhile, the specification uses specific words to describe the embodiments of the specification. Reference to "one embodiment," "an embodiment," and/or "some embodiments" means that a particular feature, structure, or characteristic is associated with at least one embodiment of the present description. Thus, it should be emphasized and should be appreciated that two or more references to "an embodiment" or "one embodiment" or "an alternative embodiment" in various positions in this specification are not necessarily referring to the same embodiment. Furthermore, certain features, structures, or characteristics of one or more embodiments of the present description may be combined as suitable.

Furthermore, those skilled in the art will appreciate that the various aspects of the specification can be illustrated and described in terms of several patentable categories or circumstances, including any novel and useful procedures, machines, products, or materials, or any novel and useful modifications thereof. Accordingly, aspects of the present description may be performed entirely by hardware, entirely by software (including firmware, resident software, micro-code, etc.), or by a combination of hardware and software. The above hardware or software may be referred to as a "data block," module, "" engine, "" unit, "" component, "or" system.

Example 3

The present embodiment proposes a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the decision tree based OCR text reduction method described in embodiment 1.

The computer readable storage medium may contain a propagated data signal with the computer program code embodied therein, for example, on a baseband or as part of a carrier wave. The propagated signal may take on a variety of forms, including electro-magnetic, optical, etc., or any suitable combination thereof. A computer storage medium may be any computer readable medium that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code located on a computer storage medium may be propagated through any suitable medium, including radio, cable, fiber optic cable, RF, or the like, or a combination of any of the foregoing.

The computer program code necessary for operation of portions of the present description may be written in any one or more programming languages, including an object oriented programming language such as Java, scala, smalltalk, eiffel, JADE, emerald, C ++, c#, VB NET, python, and the like, a conventional programming language such as C language, visual Basic, fortran 2003, perl, COBOL 2002, PHP, ABAP, a dynamic programming language such as Python, ruby, and Groovy, or other programming languages, and the like. The program code may execute entirely on the user's computer or as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any form of network, such as a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet), or the use of services such as software as a service (SaaS) in a cloud computing environment.

Furthermore, the order in which the elements and sequences are processed, the use of numerical letters, or other designations in the description are not intended to limit the order in which the processes and methods of the description are performed unless explicitly recited in the claims. While certain presently useful inventive embodiments have been discussed in the foregoing disclosure, by way of various examples, it is to be understood that such details are merely illustrative and that the appended claims are not limited to the disclosed embodiments, but, on the contrary, are intended to cover all modifications and equivalent arrangements included within the spirit and scope of the embodiments of the present disclosure. For example, while the system components described above may be implemented by hardware devices, they may also be implemented solely by software solutions, such as installing the described system on an existing server or mobile device.

Likewise, it should be noted that in order to simplify the presentation disclosed in this specification and thereby aid in understanding one or more inventive embodiments, various features are sometimes grouped together in a single embodiment, figure, or description thereof. This method of disclosure, however, is not intended to imply that more features than are presented in the claims are required for the present description. Indeed, less than all of the features of a single embodiment disclosed above.

Each patent, patent application publication, and other material, such as articles, books, specifications, publications, documents, etc., referred to in this specification is incorporated herein by reference in its entirety. Except for application history documents that are inconsistent or conflicting with the content of this specification, documents that are currently or later attached to this specification in which the broadest scope of the claims to this specification is limited are also. It is noted that, if the description, definition, and/or use of a term in an attached material in this specification does not conform to or conflict with what is described in this specification, the description, definition, and/or use of the term in this specification controls.

Finally, it should be understood that the embodiments described in this specification are merely illustrative of the principles of the embodiments of this specification. Other variations are possible within the scope of this description. Thus, by way of example, and not limitation, alternative configurations of embodiments of the present specification may be considered as consistent with the teachings of the present specification. Accordingly, the embodiments of the present specification are not limited to only the embodiments explicitly described and depicted in the present specification.

While preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the application.

It will be apparent to those skilled in the art that various modifications and variations can be made to the present application without departing from the spirit or scope of the application. Thus, it is intended that the present application also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims

1. An OCR text reduction method based on a decision tree, comprising:

preprocessing the text box recognized by OCR;

classifying and merging the text boxes according to the decision tree, and restoring the original layout of the text;

the pretreatment comprises the following steps:

numbering each text box, and recording the initial content of each text box;

converting all English characters of the text box into lowercase;

removing special characters in the text box;

the text box feature extraction process comprises the following steps:

extracting the length, width and font of each text box;

extracting the digital proportion, the letter proportion and the contained keywords in each text box;

the constructing a decision tree includes:

otherwise, classifying the text boxes directly according to the length, width, font and position of the text boxes;

judging a paragraph node: determining a specific paragraph type according to the numerical proportion and the letter proportion characteristics;

the classifying and merging process comprises the following steps:

classifying all text boxes according to the decision tree;

2. The decision tree based OCR text reduction method of claim 1, wherein the special characters comprise non-numeric, non-alphabetic, non-chinese, non-punctuation, non-space characters.

3. The decision tree based OCR text reduction method according to claim 1, wherein the keywords are keywords that can represent meaning of text box content, and are identified by regular expressions.

4. An electronic device comprising a processor and a memory, the memory storing a computer program executable by the processor, the processor executing the computer program to implement the decision tree based OCR text reduction method of any one of claims 1-3.

5. A computer readable storage medium, on which a computer program is stored, which computer program, when being executed by a processor, implements a decision tree based OCR text reduction method according to any one of claims 1-3.