CN110889401A

CN110889401A - Text layout identification method based on opencv library

Info

Publication number: CN110889401A
Application number: CN201911059127.XA
Authority: CN
Inventors: 潘定; 梁倬骞; 温秋华; 曹志鹏; 翁秀木
Original assignee: Jinan University
Current assignee: Jinan University
Priority date: 2019-11-01
Filing date: 2019-11-01
Publication date: 2020-03-17

Abstract

The invention discloses a text layout identification method based on an opencv library, which comprises the following specific steps: step 1, converting a needed pdf file into a plurality of pictures through Smallpdf, wherein each page of pdf has one picture; step 2, performing expansion operation on the picture by using OpenCV; step 3, calling an open-source Tesseract OCR API to perform character recognition. The text layout identification method based on the opencv library solves the problems that the automatic identification of a computer is difficult due to the fact that the format of a text title is not standard and uniform, the final extraction result is not accurate enough, surplus exists frequently, and the problem that the text layout is difficult to widely apply is solved.

Description

Text layout identification method based on opencv library

Technical Field

The invention relates to the technical field of text layout identification, in particular to a text layout identification method based on an opencv library.

Background

With the development of information technology, especially the popularization of the internet, the amount of information is dramatically increasing as if it were a waning ocean. In this complicated information, people need to perform operations such as retrieval, classification, information filtering and the like on text resources to improve the access efficiency to the resources and identify and distinguish valuable information and data, and the directory of the text embodies the structure of the text, can highly summarize the content of the text and refine the subject information of the text, and is the basis for realizing the operations.

Although many documents have clear structures, the extraction of the text frame is very important because no catalog is generated, so that the work efficiency of the user for searching and information filtering is low, and unnecessary time and resource waste is caused.

In the field of text frame extraction, two methods of extracting a text frame based on a title mode and extracting a text frame based on a title font size are mainly adopted, but the problem that the format of a text title is not standard and uniform is often caused, so that difficulty is brought to automatic identification of a computer, the final extraction result is not accurate enough, redundancy is generated, and the problem of wide application is difficult.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a text layout identification method based on an opencv library, which solves the problems that the automatic identification of a computer is difficult, the final extraction result is not accurate enough, surplus exists frequently and the wide application is difficult because the formats of text titles are not standard and uniform.

In order to achieve the purpose, the invention is realized by the following technical scheme: a text layout identification method based on an opencv library comprises the following specific steps:

step 1, converting a needed pdf file into a plurality of pictures through Smallpdf, wherein each page of pdf has one picture;

step 2, performing expansion operation on the picture by using OpenCV;

step 3, calling an open-source Tesseract OCR API to perform character recognition;

and 4, extracting the longest common substring of the recognition result and the result extracted based on the rule, and simplifying the redundant titles extracted based on the rule.

Preferably: the digital image converted from the pdf file in the step 1 is operated, the digital image is an image represented in a two-dimensional array form, a digital unit of the digital image is a pixel, and the pixel is a basic element of the digital image and is obtained by discretizing a continuous space when an analog image is digitized.

Preferably: in step 2, the expansion operation is to convolute the image with structural elements of any shape, usually square or circular, and the expansion operation is to draw the structural elements across the image, extract the maximum pixel value of the coverage area of the structural elements, and replace the pixels at the anchor point position.

Preferably: in step 2, the OpenCV comprises not less than five hundred functions, and the OpenCV body is divided into five modules, wherein the five modules are a CV module, an MLL module, a HIGhGUI module, a CxCore module and a CV Aux module.

Preferably: the CV module performs image processing and visual algorithms; the MLL module comprises a statistical classifier and a HIGhGUI module for carrying out input and output of GUI, images and videos; the CxCore module comprises a basic structure and an algorithm as well as XML support and a drawing function; the CV Aux module stores some algorithms and functions to be eliminated.

Preferably, there are two ways to call Tesseract OCR in step 3, one is CMD way, Tesseract is executed in the command line, the other is by calling API, tess4j is java API packaged for Tesseract, and JAR is introduced by dependence.

Preferably, the specific method of step 4 is:

A. reading the document based on the rule extraction result line by line;

B. extracting the maximum common substring;

C. and outputting the new file for reading.

Preferably, in step a, since the title document extracted based on the rule is one title in a row, performing row operation using the line feed is the optimal choice.

Advantageous effects

The invention provides a text layout identification method based on an opencv library. The method has the following beneficial effects:

according to the text layout recognition method based on the opencv library, the text frame extraction accuracy is improved through the angle of computer vision, the difference of the title and the text in human eye vision provides an innovative thought and direction for work of the text, the root cause of the difference is the difference of the computer vision and the digital image, and the title can be accurately extracted by utilizing the difference of the title characters and the text characters in the computer vision.

Drawings

FIG. 1 is a flow chart of text layout identification based on opencv library in the present invention;

FIG. 2 is a diagram of OpenCV major modules in accordance with the present invention;

FIG. 3 is a cross-matching flow chart based on OpenCV processing results and rule extraction results according to the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1-3, the present invention provides a technical solution: a text layout identification method based on an opencv library comprises the following specific steps:

step 2, performing expansion operation on the picture by using OpenCV;

Further, the digital image converted from the pdf file in the step 1 is operated, the digital image is an image represented in a two-dimensional array form, digital units of the digital image are pixels, the pixels are basic elements of the digital image, and the pixels are obtained by discretizing a continuous space when the analog image is digitized;

further, in step 2, performing convolution on the image and a structural element with any shape, usually a square or a circle, by using a dilation operation, wherein the dilation operation draws the structural element through the image, extracts the maximum pixel value of a coverage area of the structural element, and replaces the pixel of the anchor point position;

further, the OpenCV in step 2 includes not less than five hundred functions, and the OpenCV body is divided into five modules, which are a CV module, an MLL module, a HIGhGUI module, a CxCore module, and a CV Aux module, respectively;

further, the CV module performs image processing and visual algorithms; the MLL module comprises a statistical classifier and a HIGhGUI module for carrying out input and output of GUI, images and videos; the CxCore module comprises a basic structure and an algorithm as well as XML support and a drawing function; the CV Aux module stores a plurality of algorithms and functions to be eliminated;

furthermore, there are two ways to call the Tesseract OCR in step 3, one is the CMD way, the Tesseract is executed in the command line, the other is through calling the API, tess4j is the javaAPI for encapsulating Tesseract, and JAR is introduced by depending on the invocation;

further, the specific method of step 4 is:

A. reading the document based on the rule extraction result line by line;

B. extracting the maximum common substring;

C. outputting a new file for reading;

further, since the title document extracted based on the rule in step a is one title in a line, performing line feed with a line break is an optimal choice.

A text layout identification method based on an opencv library comprises the following specific steps: step 1, converting a needed pdf file into a plurality of pictures through Smallpdf, wherein each page of pdf has one picture;

the digital image converted from the pdf file in the step 1 is operated, wherein the digital image is an image represented in a two-dimensional array form, a digital unit of the digital image is a pixel, and the pixel is a basic element of the digital image and is obtained by discretizing a continuous space when an analog image is digitized; pixels are basic elements of digital images, which are obtained by discretizing a continuous space when an analog image is digitized, each pixel has integer row (height) and column (width) position coordinates, and simultaneously each pixel has integer gray values or color values;

the pixels of each image typically correspond to a particular "location" in two-dimensional space, and there are one or more sampled values associated with that point that constitute a value. The different digital images according to these sampling numbers and characteristics can be divided into: binary images, color images, pseudo-color images, three-dimensional images, and the like;

binary image: the luminance value of each pixel in an image can only be taken from 0 to 1, a grayscale image, also called a grayscale image: each pixel in the image may be represented by a luminance value of 0 to 255. Different gray levels are represented between 0 and 255, 0 represents black, and 255 represents white;

step 2, performing expansion operation on the picture by using OpenCV, wherein in the step 2, the expansion operation is used for convolving the image with structural elements in any shapes, usually squares or circles, wherein the structural elements are drawn through the image by the expansion operation, the maximum pixel values of the coverage areas of the structural elements are extracted, and the pixels at the anchor points are replaced;

the expansion operation in the image morphology mainly introduces two operation principles of expansion in the morphology and corrosion corresponding to the expansion; swelling and Erosion (Dilation and Erosion).

The expansion and corrosion can achieve a variety of functions, mainly as follows: noise is eliminated; segmenting (isolate) individual image elements, connecting (join) adjacent elements in the image; searching for an obvious maximum value area or a minimum value area in the image; solving the gradient of the image;

in morphology, structural elements are the most important and basic concepts. The role of the structural element in the morphological transformation is equivalent to the "filtering window" in the signal processing, denoted by b (x), and the definition of erosion and dilation for each point x in the working space E is: expansion:

and (3) corrosion:

the expansion of E by B (x) results in a set of points where the intersection of B and E is not empty as a result of the translation of B, and the erosion of E by B (x) results in a set of all points where B is contained in E as a result of the translation of B;

in the step 2, functions contained in OpenCV are not less than five hundred, an OpenCV main body is divided into five modules which are a CV module, an MLL module, an HIGhGUI module, a CxCore module and a CV Aux module respectively, and the CV module carries out image processing and visual algorithm; the MLL module comprises a statistical classifier and a HIGhGUI module for carrying out input and output of GUI, images and videos; the CxCore module comprises a basic structure and an algorithm as well as XML support and a drawing function; the CV Aux module stores a plurality of algorithms and functions to be eliminated;

OpenCV (open source computer vision library) is a computer vision library of open source code, OpenCV is written in C/C + + language and can be run on operating systems such as Linux/Windows/Mac, etc., and OpenCV also provides interfaces of Python, Ruby, MATLAB and other languages;

step 3, calling an open-source Tesseract OCR API to perform character recognition, wherein Tesseract is used as a recognition tool and an engine of OCR, can easily detect text information in an image, is used as an open-source Optical Character Recognition (OCR) item, and can recognize an image verification code, for example, for a text picture in a TIF format, Tesseract recognizes characters in the picture, and then writes the recognized characters into a text file, so that the recognition effect is very accurate, and if the text picture in different languages is to be recognized, the corresponding data file needs to be downloaded to enable Tesseract to recognize images in more formats. It is characterized in that: open-source; light-weight; the automatic learning capability can also establish own common character strings through training, and layout analysis is supported.

Tesseract is available directly from many Linux releases. The Tesseract package is generally named as 'TESSERACT' or 'Tesseract.OCT', can be found in a software warehouse of a search release, and also provides language training data (a search library), so that the Tesseract can train or download corresponding training data by self, decompress the training data and copy a traineddata file to a 'tessdata' directory;

two ways of calling Tesseract OCR in the step 3 are provided, one way is a CMD way, Tesseract is executed in a command line, the other way is by calling API, tess4j is java API for encapsulating tessseract, and JAR is introduced by depending on the method; and 4, extracting the longest common substring of the recognition result and the result extracted based on the rule, and simplifying the redundant titles extracted based on the rule, wherein the specific method in the step 4 comprises the following steps: step A, reading a document based on a rule extraction result according to lines; in the step A, because the title document extracted based on the rule is a line with one title, line operation by using a line feed character is the optimal selection, readline in a buffer reader of JAVA IO is a method frequently used for reading data by reading stream, line feed is followed when a line feed mark 'n' or 'r' (enter) is read, data of the line is returned in a character string form at the same time, null is returned when all data are read, and by using the method, a line of title is returned and operated each time;

step B, maximum common substring extraction is carried out, the maximum common substring extraction is carried out on the read line of titles and the result extracted based on the OpenCV, and as part of redundant titles meeting the title rule exist in the result extracted based on the OpenCV in a messy code mode, the longest common substring does not exist in matching, the longest common substring can be filtered in the final result, and the more accurate text frame extraction effect is achieved; the longest common substring can be obtained by using a brute force matching algorithm, wherein a character string extracted from a short text catalogue is S, a recognized text is T, the character length subtracted from the short text S each time is represented by a variable i, the character length obtained from the begin position of S is represented by begin, the end position of the obtained character is represented by end, and 0 < i ═ L [ S ], L [ S ] -i, and 0[24 ]. The process is as follows:

initializing a variable i to be 0;

initializing begin to be 0, end ═ L [ S ] -i;

obtaining a continuous character sub-string S [ begin.. end ] from the begin position of the S character string to the end position;

c, comparing the continuous character sub-string S [ begin.. end ] obtained in the step c with the text T, calling conteins () in java to judge whether the T contains the character sub-string S [ begin.. end ], and if so, returning a function value S [ begin.. end ];

begin and end gradually increase with step length of 1, start again from step c until end > L [ S ];

when end is greater than L [ S ], i is increased by 1, and the step is started from step b again until the function value is successfully returned in step d, or i is equal to L [ S ] and returns to null;

and C, outputting the new file for reading.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. A text layout identification method based on an opencv library comprises the following specific steps:

step 2, performing expansion operation on the picture by using OpenCV;

2. The opencv-library-based text layout recognition method as claimed in claim 1, wherein: the digital image converted from the pdf file in the step 1 is operated, the digital image is an image represented in a two-dimensional array form, a digital unit of the digital image is a pixel, and the pixel is a basic element of the digital image and is obtained by discretizing a continuous space when an analog image is digitized.

3. The opencv-library-based text layout recognition method as claimed in claim 1, wherein: in step 2, the expansion operation is to convolute the image with structural elements of any shape, usually square or circular, and the expansion operation is to draw the structural elements across the image, extract the maximum pixel value of the coverage area of the structural elements, and replace the pixels at the anchor point position.

4. The opencv-library-based text layout recognition method as claimed in claim 1, wherein: in step 2, the OpenCV comprises not less than five hundred functions, and the OpenCV body is divided into five modules, wherein the five modules are a CV module, an MLL module, a HIGhGUI module, a CxCore module and a CV Aux module.

5. The opencv-library-based text layout recognition method as claimed in claim 4, wherein: the CV module performs image processing and visual algorithms; the MLL module comprises a statistical classifier and a HIGhGUI module for carrying out input and output of GUI, images and videos; the CxCore module comprises a basic structure and an algorithm as well as XML support and a drawing function; the CV Aux module stores some algorithms and functions to be eliminated.

6. The opencv-library-based text layout recognition method as claimed in claim 1, wherein there are two ways to invoke the Tesseract OCR in step 3, one is CMD, to execute Tesseract in the command line, the other is by invoking API, tess4j is java API for Tesseract encapsulation, and JAR is introduced by dependence.

7. The opencv library-based text layout recognition method according to claim 1, wherein the specific method in step 4 is as follows:

A. reading the document based on the rule extraction result line by line;

B. extracting the maximum common substring;

C. and outputting the new file for reading.

8. The opencv-library-based text layout recognition method as claimed in claim 7, wherein in step a, since the title document extracted based on the rule is one title in a row, the row operation using the line feed is the optimal choice.