CN110889401A - Text layout identification method based on opencv library - Google Patents

Text layout identification method based on opencv library Download PDF

Info

Publication number
CN110889401A
CN110889401A CN201911059127.XA CN201911059127A CN110889401A CN 110889401 A CN110889401 A CN 110889401A CN 201911059127 A CN201911059127 A CN 201911059127A CN 110889401 A CN110889401 A CN 110889401A
Authority
CN
China
Prior art keywords
opencv
module
library
text layout
image
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911059127.XA
Other languages
Chinese (zh)
Inventor
潘定
梁倬骞
温秋华
曹志鹏
翁秀木
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jinan University
Original Assignee
Jinan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jinan University filed Critical Jinan University
Priority to CN201911059127.XA priority Critical patent/CN110889401A/en
Publication of CN110889401A publication Critical patent/CN110889401A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/22Image preprocessing by selection of a specific region containing or referencing a pattern; Locating or processing of specific regions to guide the detection or recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • G06V10/267Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion by performing operations on regions, e.g. growing, shrinking or watersheds

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Character Input (AREA)

Abstract

The invention discloses a text layout identification method based on an opencv library, which comprises the following specific steps: step 1, converting a needed pdf file into a plurality of pictures through Smallpdf, wherein each page of pdf has one picture; step 2, performing expansion operation on the picture by using OpenCV; step 3, calling an open-source Tesseract OCR API to perform character recognition. The text layout identification method based on the opencv library solves the problems that the automatic identification of a computer is difficult due to the fact that the format of a text title is not standard and uniform, the final extraction result is not accurate enough, surplus exists frequently, and the problem that the text layout is difficult to widely apply is solved.

Description

Text layout identification method based on opencv library
Technical Field
The invention relates to the technical field of text layout identification, in particular to a text layout identification method based on an opencv library.
Background
With the development of information technology, especially the popularization of the internet, the amount of information is dramatically increasing as if it were a waning ocean. In this complicated information, people need to perform operations such as retrieval, classification, information filtering and the like on text resources to improve the access efficiency to the resources and identify and distinguish valuable information and data, and the directory of the text embodies the structure of the text, can highly summarize the content of the text and refine the subject information of the text, and is the basis for realizing the operations.
Although many documents have clear structures, the extraction of the text frame is very important because no catalog is generated, so that the work efficiency of the user for searching and information filtering is low, and unnecessary time and resource waste is caused.
In the field of text frame extraction, two methods of extracting a text frame based on a title mode and extracting a text frame based on a title font size are mainly adopted, but the problem that the format of a text title is not standard and uniform is often caused, so that difficulty is brought to automatic identification of a computer, the final extraction result is not accurate enough, redundancy is generated, and the problem of wide application is difficult.
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides a text layout identification method based on an opencv library, which solves the problems that the automatic identification of a computer is difficult, the final extraction result is not accurate enough, surplus exists frequently and the wide application is difficult because the formats of text titles are not standard and uniform.
In order to achieve the purpose, the invention is realized by the following technical scheme: a text layout identification method based on an opencv library comprises the following specific steps:
step 1, converting a needed pdf file into a plurality of pictures through Smallpdf, wherein each page of pdf has one picture;
step 2, performing expansion operation on the picture by using OpenCV;
step 3, calling an open-source Tesseract OCR API to perform character recognition;
and 4, extracting the longest common substring of the recognition result and the result extracted based on the rule, and simplifying the redundant titles extracted based on the rule.
Preferably: the digital image converted from the pdf file in the step 1 is operated, the digital image is an image represented in a two-dimensional array form, a digital unit of the digital image is a pixel, and the pixel is a basic element of the digital image and is obtained by discretizing a continuous space when an analog image is digitized.
Preferably: in step 2, the expansion operation is to convolute the image with structural elements of any shape, usually square or circular, and the expansion operation is to draw the structural elements across the image, extract the maximum pixel value of the coverage area of the structural elements, and replace the pixels at the anchor point position.
Preferably: in step 2, the OpenCV comprises not less than five hundred functions, and the OpenCV body is divided into five modules, wherein the five modules are a CV module, an MLL module, a HIGhGUI module, a CxCore module and a CV Aux module.
Preferably: the CV module performs image processing and visual algorithms; the MLL module comprises a statistical classifier and a HIGhGUI module for carrying out input and output of GUI, images and videos; the CxCore module comprises a basic structure and an algorithm as well as XML support and a drawing function; the CV Aux module stores some algorithms and functions to be eliminated.
Preferably, there are two ways to call Tesseract OCR in step 3, one is CMD way, Tesseract is executed in the command line, the other is by calling API, tess4j is java API packaged for Tesseract, and JAR is introduced by dependence.
Preferably, the specific method of step 4 is:
A. reading the document based on the rule extraction result line by line;
B. extracting the maximum common substring;
C. and outputting the new file for reading.
Preferably, in step a, since the title document extracted based on the rule is one title in a row, performing row operation using the line feed is the optimal choice.
Advantageous effects
The invention provides a text layout identification method based on an opencv library. The method has the following beneficial effects:
according to the text layout recognition method based on the opencv library, the text frame extraction accuracy is improved through the angle of computer vision, the difference of the title and the text in human eye vision provides an innovative thought and direction for work of the text, the root cause of the difference is the difference of the computer vision and the digital image, and the title can be accurately extracted by utilizing the difference of the title characters and the text characters in the computer vision.
Drawings
FIG. 1 is a flow chart of text layout identification based on opencv library in the present invention;
FIG. 2 is a diagram of OpenCV major modules in accordance with the present invention;
FIG. 3 is a cross-matching flow chart based on OpenCV processing results and rule extraction results according to the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Referring to fig. 1-3, the present invention provides a technical solution: a text layout identification method based on an opencv library comprises the following specific steps:
step 1, converting a needed pdf file into a plurality of pictures through Smallpdf, wherein each page of pdf has one picture;
step 2, performing expansion operation on the picture by using OpenCV;
step 3, calling an open-source Tesseract OCR API to perform character recognition;
and 4, extracting the longest common substring of the recognition result and the result extracted based on the rule, and simplifying the redundant titles extracted based on the rule.
Further, the digital image converted from the pdf file in the step 1 is operated, the digital image is an image represented in a two-dimensional array form, digital units of the digital image are pixels, the pixels are basic elements of the digital image, and the pixels are obtained by discretizing a continuous space when the analog image is digitized;
further, in step 2, performing convolution on the image and a structural element with any shape, usually a square or a circle, by using a dilation operation, wherein the dilation operation draws the structural element through the image, extracts the maximum pixel value of a coverage area of the structural element, and replaces the pixel of the anchor point position;
further, the OpenCV in step 2 includes not less than five hundred functions, and the OpenCV body is divided into five modules, which are a CV module, an MLL module, a HIGhGUI module, a CxCore module, and a CV Aux module, respectively;
further, the CV module performs image processing and visual algorithms; the MLL module comprises a statistical classifier and a HIGhGUI module for carrying out input and output of GUI, images and videos; the CxCore module comprises a basic structure and an algorithm as well as XML support and a drawing function; the CV Aux module stores a plurality of algorithms and functions to be eliminated;
furthermore, there are two ways to call the Tesseract OCR in step 3, one is the CMD way, the Tesseract is executed in the command line, the other is through calling the API, tess4j is the javaAPI for encapsulating Tesseract, and JAR is introduced by depending on the invocation;
further, the specific method of step 4 is:
A. reading the document based on the rule extraction result line by line;
B. extracting the maximum common substring;
C. outputting a new file for reading;
further, since the title document extracted based on the rule in step a is one title in a line, performing line feed with a line break is an optimal choice.
A text layout identification method based on an opencv library comprises the following specific steps: step 1, converting a needed pdf file into a plurality of pictures through Smallpdf, wherein each page of pdf has one picture;
the digital image converted from the pdf file in the step 1 is operated, wherein the digital image is an image represented in a two-dimensional array form, a digital unit of the digital image is a pixel, and the pixel is a basic element of the digital image and is obtained by discretizing a continuous space when an analog image is digitized; pixels are basic elements of digital images, which are obtained by discretizing a continuous space when an analog image is digitized, each pixel has integer row (height) and column (width) position coordinates, and simultaneously each pixel has integer gray values or color values;
the pixels of each image typically correspond to a particular "location" in two-dimensional space, and there are one or more sampled values associated with that point that constitute a value. The different digital images according to these sampling numbers and characteristics can be divided into: binary images, color images, pseudo-color images, three-dimensional images, and the like;
binary image: the luminance value of each pixel in an image can only be taken from 0 to 1, a grayscale image, also called a grayscale image: each pixel in the image may be represented by a luminance value of 0 to 255. Different gray levels are represented between 0 and 255, 0 represents black, and 255 represents white;
step 2, performing expansion operation on the picture by using OpenCV, wherein in the step 2, the expansion operation is used for convolving the image with structural elements in any shapes, usually squares or circles, wherein the structural elements are drawn through the image by the expansion operation, the maximum pixel values of the coverage areas of the structural elements are extracted, and the pixels at the anchor points are replaced;
the expansion operation in the image morphology mainly introduces two operation principles of expansion in the morphology and corrosion corresponding to the expansion; swelling and Erosion (Dilation and Erosion).
The expansion and corrosion can achieve a variety of functions, mainly as follows: noise is eliminated; segmenting (isolate) individual image elements, connecting (join) adjacent elements in the image; searching for an obvious maximum value area or a minimum value area in the image; solving the gradient of the image;
in morphology, structural elements are the most important and basic concepts. The role of the structural element in the morphological transformation is equivalent to the "filtering window" in the signal processing, denoted by b (x), and the definition of erosion and dilation for each point x in the working space E is: expansion:
Figure BDA0002257390340000061
and (3) corrosion:
Figure BDA0002257390340000062
the expansion of E by B (x) results in a set of points where the intersection of B and E is not empty as a result of the translation of B, and the erosion of E by B (x) results in a set of all points where B is contained in E as a result of the translation of B;
in the step 2, functions contained in OpenCV are not less than five hundred, an OpenCV main body is divided into five modules which are a CV module, an MLL module, an HIGhGUI module, a CxCore module and a CV Aux module respectively, and the CV module carries out image processing and visual algorithm; the MLL module comprises a statistical classifier and a HIGhGUI module for carrying out input and output of GUI, images and videos; the CxCore module comprises a basic structure and an algorithm as well as XML support and a drawing function; the CV Aux module stores a plurality of algorithms and functions to be eliminated;
OpenCV (open source computer vision library) is a computer vision library of open source code, OpenCV is written in C/C + + language and can be run on operating systems such as Linux/Windows/Mac, etc., and OpenCV also provides interfaces of Python, Ruby, MATLAB and other languages;
step 3, calling an open-source Tesseract OCR API to perform character recognition, wherein Tesseract is used as a recognition tool and an engine of OCR, can easily detect text information in an image, is used as an open-source Optical Character Recognition (OCR) item, and can recognize an image verification code, for example, for a text picture in a TIF format, Tesseract recognizes characters in the picture, and then writes the recognized characters into a text file, so that the recognition effect is very accurate, and if the text picture in different languages is to be recognized, the corresponding data file needs to be downloaded to enable Tesseract to recognize images in more formats. It is characterized in that: open-source; light-weight; the automatic learning capability can also establish own common character strings through training, and layout analysis is supported.
Tesseract is available directly from many Linux releases. The Tesseract package is generally named as 'TESSERACT' or 'Tesseract.OCT', can be found in a software warehouse of a search release, and also provides language training data (a search library), so that the Tesseract can train or download corresponding training data by self, decompress the training data and copy a traineddata file to a 'tessdata' directory;
two ways of calling Tesseract OCR in the step 3 are provided, one way is a CMD way, Tesseract is executed in a command line, the other way is by calling API, tess4j is java API for encapsulating tessseract, and JAR is introduced by depending on the method; and 4, extracting the longest common substring of the recognition result and the result extracted based on the rule, and simplifying the redundant titles extracted based on the rule, wherein the specific method in the step 4 comprises the following steps: step A, reading a document based on a rule extraction result according to lines; in the step A, because the title document extracted based on the rule is a line with one title, line operation by using a line feed character is the optimal selection, readline in a buffer reader of JAVA IO is a method frequently used for reading data by reading stream, line feed is followed when a line feed mark 'n' or 'r' (enter) is read, data of the line is returned in a character string form at the same time, null is returned when all data are read, and by using the method, a line of title is returned and operated each time;
step B, maximum common substring extraction is carried out, the maximum common substring extraction is carried out on the read line of titles and the result extracted based on the OpenCV, and as part of redundant titles meeting the title rule exist in the result extracted based on the OpenCV in a messy code mode, the longest common substring does not exist in matching, the longest common substring can be filtered in the final result, and the more accurate text frame extraction effect is achieved; the longest common substring can be obtained by using a brute force matching algorithm, wherein a character string extracted from a short text catalogue is S, a recognized text is T, the character length subtracted from the short text S each time is represented by a variable i, the character length obtained from the begin position of S is represented by begin, the end position of the obtained character is represented by end, and 0 < i ═ L [ S ], L [ S ] -i, and 0[24 ]. The process is as follows:
initializing a variable i to be 0;
initializing begin to be 0, end ═ L [ S ] -i;
obtaining a continuous character sub-string S [ begin.. end ] from the begin position of the S character string to the end position;
c, comparing the continuous character sub-string S [ begin.. end ] obtained in the step c with the text T, calling conteins () in java to judge whether the T contains the character sub-string S [ begin.. end ], and if so, returning a function value S [ begin.. end ];
begin and end gradually increase with step length of 1, start again from step c until end > L [ S ];
when end is greater than L [ S ], i is increased by 1, and the step is started from step b again until the function value is successfully returned in step d, or i is equal to L [ S ] and returns to null;
and C, outputting the new file for reading.
It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.
Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims (8)

1. A text layout identification method based on an opencv library comprises the following specific steps:
step 1, converting a needed pdf file into a plurality of pictures through Smallpdf, wherein each page of pdf has one picture;
step 2, performing expansion operation on the picture by using OpenCV;
step 3, calling an open-source Tesseract OCR API to perform character recognition;
and 4, extracting the longest common substring of the recognition result and the result extracted based on the rule, and simplifying the redundant titles extracted based on the rule.
2. The opencv-library-based text layout recognition method as claimed in claim 1, wherein: the digital image converted from the pdf file in the step 1 is operated, the digital image is an image represented in a two-dimensional array form, a digital unit of the digital image is a pixel, and the pixel is a basic element of the digital image and is obtained by discretizing a continuous space when an analog image is digitized.
3. The opencv-library-based text layout recognition method as claimed in claim 1, wherein: in step 2, the expansion operation is to convolute the image with structural elements of any shape, usually square or circular, and the expansion operation is to draw the structural elements across the image, extract the maximum pixel value of the coverage area of the structural elements, and replace the pixels at the anchor point position.
4. The opencv-library-based text layout recognition method as claimed in claim 1, wherein: in step 2, the OpenCV comprises not less than five hundred functions, and the OpenCV body is divided into five modules, wherein the five modules are a CV module, an MLL module, a HIGhGUI module, a CxCore module and a CV Aux module.
5. The opencv-library-based text layout recognition method as claimed in claim 4, wherein: the CV module performs image processing and visual algorithms; the MLL module comprises a statistical classifier and a HIGhGUI module for carrying out input and output of GUI, images and videos; the CxCore module comprises a basic structure and an algorithm as well as XML support and a drawing function; the CV Aux module stores some algorithms and functions to be eliminated.
6. The opencv-library-based text layout recognition method as claimed in claim 1, wherein there are two ways to invoke the Tesseract OCR in step 3, one is CMD, to execute Tesseract in the command line, the other is by invoking API, tess4j is java API for Tesseract encapsulation, and JAR is introduced by dependence.
7. The opencv library-based text layout recognition method according to claim 1, wherein the specific method in step 4 is as follows:
A. reading the document based on the rule extraction result line by line;
B. extracting the maximum common substring;
C. and outputting the new file for reading.
8. The opencv-library-based text layout recognition method as claimed in claim 7, wherein in step a, since the title document extracted based on the rule is one title in a row, the row operation using the line feed is the optimal choice.
CN201911059127.XA 2019-11-01 2019-11-01 Text layout identification method based on opencv library Pending CN110889401A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911059127.XA CN110889401A (en) 2019-11-01 2019-11-01 Text layout identification method based on opencv library

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911059127.XA CN110889401A (en) 2019-11-01 2019-11-01 Text layout identification method based on opencv library

Publications (1)

Publication Number Publication Date
CN110889401A true CN110889401A (en) 2020-03-17

Family

ID=69746735

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911059127.XA Pending CN110889401A (en) 2019-11-01 2019-11-01 Text layout identification method based on opencv library

Country Status (1)

Country Link
CN (1) CN110889401A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112328825A (en) * 2020-10-15 2021-02-05 苏州零泉科技有限公司 Picture construction method based on natural language processing

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110287784A (en) * 2019-05-20 2019-09-27 暨南大学 A kind of annual report text structure recognition methods

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110287784A (en) * 2019-05-20 2019-09-27 暨南大学 A kind of annual report text structure recognition methods

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112328825A (en) * 2020-10-15 2021-02-05 苏州零泉科技有限公司 Picture construction method based on natural language processing

Similar Documents

Publication Publication Date Title
US10817741B2 (en) Word segmentation system, method and device
US10521513B2 (en) Language generation from flow diagrams
CN110889402A (en) Business license content identification method and system based on deep learning
CN111753717B (en) Method, device, equipment and medium for extracting structured information of text
Demilew et al. Ancient Geez script recognition using deep learning
KR20140091762A (en) Text detection using multi-layer connected components with histograms
JPH01253077A (en) Detection of string
Isheawy et al. Optical character recognition (OCR) system
US20170039192A1 (en) Language generation from flow diagrams
Akinbade et al. An adaptive thresholding algorithm-based optical character recognition system for information extraction in complex images
Kaundilya et al. Automated text extraction from images using OCR system
CN116189162A (en) Ship plate detection and identification method and device, electronic equipment and storage medium
Nayak et al. Automatic number plate recognition
Natei et al. Extracting text from image document and displaying its related information
CN110889401A (en) Text layout identification method based on opencv library
Aravinda et al. Template matching method for Kannada handwritten recognition based on correlation analysis
Panchal et al. An investigation on feature and text extraction from images using image recognition in Android
CN117076455A (en) Intelligent identification-based policy structured storage method, medium and system
CN112149523A (en) Method and device for OCR recognition and picture extraction based on deep learning and co-searching algorithm, electronic equipment and storage medium
Fang Semantic segmentation of PHT based on improved DeeplabV3+
CN115273108B (en) Automatic collection method and system for artificial intelligent identification
CN110414497A (en) Method, device, server and storage medium for electronizing object
CN114842482B (en) Image classification method, device, equipment and storage medium
CN111291758B (en) Method and device for recognizing seal characters
Ou et al. ERCS: An efficient and robust card recognition system for camera-based image

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20200317

WD01 Invention patent application deemed withdrawn after publication