CN112699712A - Document image region separation method and device and storage medium - Google Patents

Document image region separation method and device and storage medium Download PDF

Info

Publication number
CN112699712A
CN112699712A CN201911008335.7A CN201911008335A CN112699712A CN 112699712 A CN112699712 A CN 112699712A CN 201911008335 A CN201911008335 A CN 201911008335A CN 112699712 A CN112699712 A CN 112699712A
Authority
CN
China
Prior art keywords
connected region
regions
region
text
condition
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911008335.7A
Other languages
Chinese (zh)
Inventor
王祺尧
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Hikvision Digital Technology Co Ltd
Original Assignee
Hangzhou Hikvision Digital Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Hikvision Digital Technology Co Ltd filed Critical Hangzhou Hikvision Digital Technology Co Ltd
Priority to CN201911008335.7A priority Critical patent/CN112699712A/en
Publication of CN112699712A publication Critical patent/CN112699712A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content
    • G06V30/413Classification of content, e.g. text, photographs or tables
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/14Image acquisition
    • G06V30/148Segmentation of character regions
    • G06V30/153Segmentation of character regions using recognition of characters or words
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/28Character recognition specially adapted to the type of the alphabet, e.g. Latin alphabet
    • G06V30/287Character recognition specially adapted to the type of the alphabet, e.g. Latin alphabet of Kanji, Hiragana or Katakana characters

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Character Input (AREA)

Abstract

The invention provides a method and a device for separating areas of a document image and a storage medium. The method comprises the steps of preprocessing a document image to obtain a binary image, analyzing connected regions of the binary image to obtain a first connected region set, dividing the first connected region set into a second connected region set and a third connected region set according to the attribute characteristics of the connected regions in the first connected region set, wherein the second connected region set comprises connected regions of non-text elements, the third connected region set comprises connected regions of text elements, and determining the connected regions of the text elements from the third connected region set further according to the attribute characteristics and adjacent position information of the connected regions in the third connected region set. The method can extract more non-text elements from the text area, and improves the accuracy of text non-text separation of the document image.

Description

Document image region separation method and device and storage medium
Technical Field
The present invention relates to the field of layout analysis technologies, and in particular, to a method and an apparatus for separating regions of a document image, and a storage medium.
Background
With the popularization of electronic reading materials such as electronic periodicals, electronic books and the like, people convert more and more traditional paper reading materials into digital images through image acquisition equipment such as image-text scanners, cameras and the like, and the digital images are shared in a network. We refer to such digital images, which primarily carry textual information such as words, pictures, tables, etc., as document images. The document image is derived from a paper file, has the characteristics of intuitive content, convenience in carrying and transmission and the like, and is widely applied to various industrial fields.
The layout analysis of the document image is mainly to automatically process and divide layout data of the document image by using a computer, and identify the position and the attribute of areas such as characters, pictures, figures, tables and the like on the layout of the image. The current methods for separating regions of document images mainly include the following two methods: the pixel-by-pixel judgment method mainly aims at each pixel point in an image, calculates the edge intensity value of each pixel point according to the gray value of the pixel point, then determines a binary image corresponding to the image to be processed according to the edge intensity value of each pixel point, determines the text region judgment value of each pixel point according to the binary image and the gray value of each pixel point, and finally judges the text region and the non-text region of the image according to the judgment value of each pixel point. The other is a text detection technology based on deep learning, which uses a deep learning network to detect characters in an image, such as characters detection methods like fast RCNN, EAST, Textbox + + and the like, and then uses the detected text as a text region.
The first method performs calculation based on a single pixel and a peripheral pixel value, only focuses on a local part of an image, cannot grasp the whole structure of the image and the position between each element, and has low accuracy and robustness of region separation. The second method can only detect the characters, and cannot distinguish the characters in the picture and the table from the paragraph characters in the normal document.
Disclosure of Invention
The invention provides a method and a device for separating a region of a document image and a storage medium, which improve the accuracy of text non-text separation of the document image.
The first aspect of the present invention provides a method for separating regions of a document image, including:
preprocessing a document image to obtain a binary image;
analyzing a connected region of the binarized image to obtain a first connected region set;
acquiring and dividing the first communication area set into a second communication area set and a third communication area set according to the attribute characteristics of each communication area in the first communication area set; the second connected region set comprises connected regions of non-text elements, and the third connected region set comprises connected regions of text elements;
and acquiring and determining the connected regions of the text elements from the third connected region set according to the attribute characteristics and the adjacent position information of each connected region in the third connected region set.
Optionally, the attribute characteristics of each of the connected regions include the number of pixels, the pixel density, the aspect ratio of each of the connected regions, and a first number of other connected regions nested in a minimum bounding rectangle of each of the connected regions.
In a possible implementation manner, the dividing, according to an attribute feature of each connected region in the first connected region set, the first connected region set into a second connected region set and a third connected region set includes:
and judging whether the attribute characteristics of each connected region in the first connected region set meet a first non-text element condition, and dividing the connected regions of which the attribute characteristics meet the first non-text element condition into a second connected region set.
In one possible implementation, the first non-text element condition includes at least one of:
the number of pixels of the connected area is less than the preset number of pixels;
the pixel density of the connected region is less than the preset pixel density;
the aspect ratio of the communication area is smaller than the preset aspect ratio;
the first number of other connected regions nested within the smallest circumscribed rectangle of connected regions is greater than a first value.
In a possible implementation manner, the obtaining adjacent position information of each connected region in the third connected region set includes:
performing blank area analysis on each connected area in the third connected area set to determine adjacent connected areas corresponding to each connected area in the third connected area set;
and acquiring adjacent position information corresponding to each connected region in the third connected region set, wherein the adjacent position information comprises the distance between each connected region and the adjacent connected region corresponding to each connected region, and the number of left connected regions and the number of right connected regions of each connected region.
In a possible implementation manner, the determining a connected region of a text element from the third connected region set according to the attribute feature and the adjacent position information of each connected region in the third connected region set includes:
judging whether the attribute characteristics of each connected region in the third connected region set meet a second non-text element condition or not;
judging whether the adjacent position information of each connected region in the third connected region set meets a third non-text element condition;
determining connected regions in the third connected region set, wherein the connected regions do not satisfy the second non-text element condition, and the connected regions which satisfy the second non-text element condition but do not satisfy the third non-text element condition, as connected regions of text elements.
In one possible implementation, the second non-textual condition includes:
a first condition and a second condition; alternatively, the first and second electrodes may be,
the first condition and the third condition; alternatively, the first and second electrodes may be,
the first condition, the second condition, and the third condition;
wherein the first condition is Ai=max(Ω1)∩Ai>k1×median(Ω1) (ii) a The second condition is Hi=max(Ω2)∩Hi>k2×median(Ω2) (ii) a The thirdProvided that W isi=max(Ω3)∩Wi>k3×median(Ω3);
In the formula (I), the compound is shown in the specification,
Figure BDA0002243441410000031
Ω1a set, Ω, representing the number of pixels of each connected region in the third set of connected regions2A set, Ω, representing the height of each connected region in the third set of connected regions3A set representing the width of each connected region in the third connected region set, mean representing the calculated average value, mean representing the calculated median, AiRepresenting the number of pixels, H, of the ith connected region in the third set of connected regionsiRepresents the height, W, of the smallest bounding rectangle of the ith connected region in the third set of connected regionsiAnd the width of the minimum bounding rectangle of the ith connected region in the third connected region set is represented.
In one possible implementation, the third non-text element condition includes:
the distance between each communication area and the adjacent communication area corresponding to each communication area is greater than or equal to a preset distance; and/or
And the second number of adjacent connected regions corresponding to each connected region is greater than or equal to a second value, and the second number is the maximum value of the left connected region number and the right connected region number.
In one possible implementation, the method further includes:
acquiring a first pixel number of a minimum bounding rectangle of a connected region overlapped with the connected region of the non-text element;
acquiring a second pixel number of the connected region of the non-text element after expansion operation;
and if the first pixel quantity is larger than the second pixel quantity, determining a connected region overlapped with the connected region of the non-text element as the connected region of the text element.
A second aspect of the present invention provides a region separating apparatus of a document image, including:
the image preprocessing module is used for preprocessing the document image to obtain a binary image;
the connected region analysis module is used for analyzing the connected regions of the binary image to obtain a first connected region set;
an obtaining module, configured to obtain attribute features of each connected region in the first connected region set;
the connected region dividing module is used for dividing the first connected region set into a second connected region set and a third connected region set according to the attribute characteristics of each connected region in the first connected region set; the second connected region set comprises connected regions of non-text elements, and the third connected region set comprises connected regions of text elements;
the acquiring module is further configured to acquire attribute features and adjacent position information of each connected region in the third connected region set;
and the connected region dividing module is further used for determining the connected regions of the text elements from the third connected region set according to the attribute characteristics and the adjacent position information of each connected region in the third connected region set.
A third aspect of the present invention provides a region separating apparatus for a document image, comprising:
a memory;
a processor; and
a computer program;
wherein the computer program is stored in the memory and configured to be executed by the processor to implement the method according to any one of the first aspect of the invention.
A fourth aspect of the invention provides a computer readable storage medium having stored thereon a computer program for execution by a processor to perform the method according to any one of the first aspect of the invention.
The embodiment of the invention provides a method and a device for separating areas of a document image and a storage medium. The method comprises the steps of preprocessing a document image to obtain a binary image, analyzing connected regions of the binary image to obtain a first connected region set, dividing the first connected region set into a second connected region set and a third connected region set according to the attribute characteristics of the connected regions in the first connected region set, wherein the second connected region set comprises connected regions of non-text elements, the third connected region set comprises connected regions of text elements, and determining the connected regions of the text elements from the third connected region set further according to the attribute characteristics and adjacent position information of the connected regions in the third connected region set. The method can extract more non-text elements from the text area, and improves the accuracy of text non-text separation of the document image.
Drawings
FIG. 1 is a flowchart illustrating a method for separating regions of a document image according to an embodiment of the present invention;
FIG. 2 is a diagram illustrating a pixel adjacency in an image according to an embodiment of the present invention;
FIG. 3 is a labeled diagram of connected component analysis provided in accordance with an embodiment of the present invention;
FIG. 4 is a schematic view of a communication area in the vicinity of a communication area provided by an embodiment of the present invention;
FIG. 5 is a schematic diagram of a position relationship between two connected regions according to an embodiment of the present invention;
FIG. 6 is a diagram illustrating a blank area analysis of a Chinese document according to an embodiment of the present invention;
FIG. 7 is a diagram illustrating a text non-text separation result of a region separation method for a document image according to an embodiment of the present invention;
FIG. 8 is a flowchart illustrating a method for separating regions of a document image according to another embodiment of the present invention;
FIG. 9 is a functional structure diagram of an apparatus for separating regions of a document image according to an embodiment of the present invention;
fig. 10 is a schematic diagram of a hardware structure of a region separating apparatus for document images according to an embodiment of the present invention.
Detailed Description
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present invention. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the invention, as detailed in the appended claims.
The terms "comprising" and "having," and any variations thereof, in the description and claims of this invention are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus.
Reference throughout this specification to "one embodiment" or "another embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the present application. Thus, the appearances of the phrases "in some embodiments" or "in this embodiment" in various places throughout this specification are not necessarily all referring to the same embodiment. It should be noted that the embodiments and features of the embodiments may be combined with each other without conflict.
The method for separating the regions of the document image, provided by the embodiment of the invention, is used for processing the connected regions in the document image as a unit, separating the text and the non-text regions in the document image based on the attribute characteristics of the connected regions and the position relation between the connected regions, effectively separating the regions of various document layouts and documents in different languages, and has high separation accuracy and stable effect. In addition, the region separation method for the document image provided by the embodiment can not only distinguish the text region and the non-text region, but also further remove more non-text elements from the connected region overlapped with the connected region of the non-text elements, and finally obtain an accurate text region.
The following describes the region separation method of a document image in detail with reference to several specific embodiments, which may be combined with each other, and the description of the same or similar contents in different embodiments is not repeated.
Fig. 1 is a flowchart illustrating a method for separating regions of a document image according to an embodiment of the present invention. The execution subject of the method can be executed by any area separation device for executing document images, the device can be realized by software and/or hardware, and the device can be a camera and an intelligent analysis server. As shown in fig. 1, the method provided by this embodiment includes the following steps:
step 101, preprocessing a document image to obtain a binary image.
In this embodiment, the document image may be a photographed document or a scanned document, and the embodiment is not limited thereto. Specifically, the document image includes a text region and a non-text region, the text region includes text elements such as chinese characters, english letters, numbers, symbols, formulas, and the like, and the non-text region includes pictures, table pictures, icons, and the like.
After the document image is acquired, preprocessing is carried out on the document image, and the preprocessing process comprises the steps of image binarization, rotation correction, perspective correction and the like.
The image binarization is a process of setting the gray value of a pixel point on a document image to be 0 or 255, that is, the whole image presents an obvious black-and-white effect. The binarization of the image greatly reduces the data volume in the image, and improves the speed of image processing.
In the planar image processing, since the inclination, distortion, and the like of the document image captured or scanned are likely to occur due to the lens angle or the failure of the scanner to perform normal horizontal and vertical scanning according to the line and column thereof during image scanning, it is necessary to perform rotation correction, perspective correction, and the like on the document image to ensure that the document image to be processed has no distortion such as rotation and bending, for example, the rotation angle is less than 2 degrees.
In this embodiment, the order of steps of preprocessing the document image is not limited. The document image can be firstly subjected to binarization processing, and then the binarized image is subjected to rotation correction and perspective correction; or the document image can be firstly subjected to rotation correction and perspective correction, and then the corrected document image is subjected to binarization processing.
102, analyzing the connected regions of the binary image to obtain a first connected region set.
The connected region analysis is a process for extracting and labeling the connected regions in the binary image. The connected region is an image region formed by foreground pixel points which have the same pixel value and are adjacent in position in the image.
Fig. 2 is a schematic diagram of an adjacency relationship of pixels in an image according to an embodiment of the present invention, and as shown in fig. 2, there are 2 types of adjacency relationships of common image pixels: 4 contiguous with 8 contiguous. This embodiment uses 8 neighbors to determine whether pixels belong to the same connected region.
In this embodiment, a sketch library in Python is used to label a connected region in an image. Fig. 3 is a labeled schematic diagram of connected component analysis according to an embodiment of the present invention. As shown in fig. 3, each connected region corresponds to a circumscribed rectangle.
Let the ith connected region be denoted as CCiConnected regions have the following properties:
(1)B(CCi) Represents CCiThe sides of the rectangle are parallel to the coordinate axes. (Xl)i,Yli) And (Xr)i,Yri) The coordinates of the upper left corner and the lower right corner of the rectangular frame are respectively.
(2)AiRepresents CCiThe number of pixels in (1).
(3)Ai BRepresents B (CC)i) Size of (1), WiAnd HiRespectively represent B (CC)i) Width and height of (d);
(4)λirepresents CCiThe density of (A) is calculated asi=Ai/Ai B
(5)InciIndicates nesting in B (CC)i) Other rectangular frame B (CC) of (1)j) Number of (1), j ≠ i (note: computing InciFirst, press Yl to the rectangular frameiComponent ordering, which can greatly reduce computational complexity).
(6)HWi rateRepresents CCiHas an aspect ratio of HW as a calculation methodi rate=min(Wi,Hi)/max(Wi,Hi)。
It should be noted that, a connected region analysis is performed on the binarized image to obtain a first connected region set, where the first connected region set includes a plurality of connected regions including text regions and non-text regions, and the plurality of connected regions in the first connected region set need to be identified and divided according to steps 103 to 106, so as to obtain an accurate text region finally, and a specific implementation process is as follows.
And 103, acquiring the attribute characteristics of each connected region in the first connected region set.
Referring to step 102, the attribute characteristics of each connected region in the first set of connected regions include the number of pixels, the pixel density, the aspect ratio of each connected region, and the first number of other connected regions nested in the minimum bounding rectangle of each connected region.
And 104, dividing the first communication area set into a second communication area set and a third communication area set according to the attribute characteristics of each communication area in the first communication area set.
And the second connected region set comprises connected regions of non-text elements, and the third connected region set comprises connected regions of text elements.
Specifically, whether the attribute characteristics of each connected region in the first connected region set meet the first non-text element condition or not is judged, and the connected regions of which the attribute characteristics meet the first non-text element condition are divided into the second connected region set. The first non-text element condition is used to perform a preliminary filtering of non-text elements in the document image. If the connected region satisfies the first non-text element condition, then the connected region is considered to be a connected region of the non-text element.
Wherein the first non-text element condition comprises at least one of:
(1) the number of pixels of the connected region is less than the preset number of pixels. Can be specifically represented by formula Ai<TareaIs represented by the formula, wherein TareaFor a predetermined number of pixels, it is usually set at 5-7 pixels, e.g. TareaA connected region representing a very small pixel value is treated as non-text, 6 pixel.
(2) The pixel density of the connected region is less than the preset pixel density. Can be specifically represented by the formula λi<TdensIs represented by the formula, wherein TdensFor a predetermined pixel density, it is usually set to [0.05,0.07 ]]E.g. Tdens0.06, the pixel density of the connected component is too low, which may be noise, diagonal elements, rectangular boxes, and so on.
(3) The aspect ratio of the communicating region is less than the preset aspect ratio.
Specifically, in Hi<Wi(the height of the connected region is less than the width), HWi rate<TrateWherein T is1 rateIs a first predetermined aspect ratio (predetermined aspect ratio for transverse filtration), typically set at [0.05,0.07 ]]E.g. T1 rate0.06. In general, the width of a text pixel cannot be larger than 16.66 times the height.
At Hi>Wi(the height of the connected region is greater than the width), HWi rate<TrateWherein T is2 rateFor the second predetermined aspect ratio (the predetermined aspect ratio for longitudinal filtration), is typically set at [0.02,0.04 ]]E.g. T2 rate=0.03。
Typically, the predetermined width and height of the longitudinal filtration is more stringent than the transverse filtration.
(4) The first number of other connected regions nested within the connected region minimum bounding rectangle is greater than a first value. Can be specifically represented by the formula Inci>TinsideIs represented by the formula, wherein TinsideIs a first value, typically set to 4, representing the others nested in the minimum bounding rectangle of the connected regionThe first number of connected regions is greater than 4, which is considered non-text, and this condition applies to both latin and chinese documents.
It should be noted that the preset parameters (including the preset number of pixels, the preset pixel density, the preset aspect ratio, and the first value) are empirical values obtained through a large number of calculations and verifications, and are suitable for various types and resolution of documents.
The purpose of this step is to eliminate the obvious noise or non-text elements in the binary image and obtain the preliminary separation result. It should be noted that, in addition to the connected regions of the text elements, the third connected region set also includes connected regions of non-text elements identified as text elements, that is, in the above process, there may be a case where the non-text elements are misjudged as text elements. In order to improve the accuracy of text non-text separation, step 105 and step 106 need to be further executed to further separate out non-text elements in the text area, which is described in the following.
And 105, acquiring the attribute characteristics and the adjacent position relation of each connected region in the third connected region set.
Referring to step 102, the attribute characteristics of each connected region in the third set of connected regions include the number of pixels, the pixel density, the aspect ratio of each connected region, and the first number of other connected regions nested in the minimum bounding rectangle of each connected region.
In this step, obtaining the adjacent position relationship of each connected region in the third connected region set includes the following steps: performing blank area analysis on each connected area in the third connected area set to determine adjacent connected areas corresponding to each connected area in the third connected area set; and acquiring adjacent position information corresponding to each connected region in the third connected region set.
The adjacent position information comprises the distance between each connected region and the adjacent connected region corresponding to each connected region, the left connected region number and the right connected region number of each connected region.
For ease of understanding, the blank area analysis will be described below with reference to fig. 4 and 5.
Fig. 4 is a schematic diagram of a vicinity of a certain communication region according to an embodiment of the present invention. Referring to FIG. 4, in the above-described blank region analysis, for the connected region CCiThe following variables need to be calculated:
(1)LNNiand RNNi:LNNiRepresents CCiLeft connected region (i.e., CC)iLeft neighbor of) RNN, RNNiRepresents CCiRight connected region (i.e., CC)iThe right neighbor of (c).
In addition, CCiMay include one or more of the left connected region or the right connected region. When the left communication area or the right communication area comprises a plurality of the LNNsiConcrete representation closest to CCiLeft connected region, RNNiConcrete representation closest to CCiRight connected region.
Exemplary, CC in FIG. 43LNN of3={CC2},RNN3={CC4};CC4LNN of4={CC3},
Figure BDA0002243441410000103
It is noted that CCiLeft and right connected region requirements and CCiIn the same row direction.
(2)LNWSi,RNWSi:LNWSiRepresents CCiDistance to left connected region (blank size), RNWSiRepresents CCiDistance from the right connected region (blank size). Let CCiIs LNNi={CCjThe right neighbor is RNNi={CCkThen LNWSiAnd RNWSiThe calculation formula is as follows:
LNWSi=Xli-Xrj
RNWSi=Xlk-Xri
(3)LNi,RNi:LNirepresents CCiRight connected region becoming other connected regionA set of other connected regions of time; RN (radio network node)iRepresents CCiAnd the other connected region set when the left connected region is the other connected region.
Figure BDA0002243441410000101
Figure BDA0002243441410000102
Exemplary, CC in FIG. 43LN of3={CC1,CC2},RN3={CC4,CC5}。
(4)numLNi,numRNi:numLNiRepresents CCiThe number of other connected regions when the right connected region becomes the other connected region; numRNiRepresents CCiThe number of other connected regions when the left connected region is the other connected region.
numLNi=|LNi|
numRNi=|RNi|
Exemplary, CC in FIG. 43numLN of3=numRN3=2
(5) WS: representing a set of all blank areas.
WS={RNWSi>0|CCi∈CCuEither WS ═ LNWSi>0|CCi∈CCu}
Wherein, CCuRepresenting all connected regions in the document image.
And obtaining the variable of each connected region in the third connected region set by performing blank region analysis on each connected region in the third connected region set, and obtaining the adjacent position information of each connected region and the adjacent connected region.
The above-described blank area analysis considers only the case where connected areas in the document image are separated from each other, as shown in fig. 4. However, the overlap of connected regions may also occur in the document image. Fig. 5 is a schematic diagram of a positional relationship between two communication areas provided in the embodiment of the present invention, and as shown in fig. 5, a situation that the two communication areas overlap with each other may occur.
Specifically, for an English document, the overlapping condition of FIG. 5(b) may occur, and for a Chinese document, the overlapping condition of FIGS. 5(b), (c), and (d) may occur. It should be noted that the situation of fig. 5(e) does not occur no matter in the english document or the chinese document, and at this time, CC does not occuriAnd CCjMust be connected as a connected region.
Illustratively, the italic characters in Chinese and English have the overlapping condition of FIG. 5(b), the Chinese character "painting" has the overlapping condition of FIG. 5(c), and the Chinese character "Hui" has the overlapping condition of FIG. 5 (d).
For the overlap condition of FIG. 5(b), CC is judgediAnd CCjLeft and right positional relationship therebetween, CC in FIG. 5(b)iIs CCjLeft connected region of (C)jIs CCiA distance between the right communication area and the left communication area is set to 0; for the overlapping case of FIG. 5(c), CCiAnd CCjThe left/right communication areas are mutually communicated, and the distance between the left/right communication areas is set to be 0; for the overlap case of FIG. 5(d), nesting inhibition, CC, is performedjAbsence of left/right connected regions, CCiAnd CCjIs set to 0.
It should be noted that, in the embodiment, when determining an adjacent connected region corresponding to a connected region, not only the position relationship of the connected regions that are separated from each other is considered, but also the position relationship of the connected regions that are overlapped left and right, overlapped up and down, and nested is considered, so that a more accurate determination basis is provided for subsequent condition determination, and therefore, the english letters or chinese characters in the chinese and english documents are prevented from being erroneously determined as non-text elements.
And 106, determining the connected regions of the text elements from the third connected region set according to the attribute characteristics and the adjacent position information of each connected region in the third connected region set.
In this embodiment, the connected regions in the third set of connected regions may include non-text elements that were not identified in step 104, such as small icons, noise, etc. near the text elements. To further improve the accuracy of text non-text separation, the following process may be used to find and separate non-text elements near the text region:
judging whether the attribute characteristics of each connected region in the third connected region set meet a second non-text element condition or not; judging whether the adjacent position information of each connected region in the third connected region set meets a third non-text element condition; and determining the connected regions which do not satisfy the second non-text element condition and the connected regions which satisfy the second non-text element condition but do not satisfy the third non-text element condition in the third connected region set as the connected regions of the text elements.
The process can further improve the accuracy of text non-text separation.
The second non-text condition comprises a first condition and a second condition; alternatively, the first condition and the third condition; alternatively, the first condition, the second condition, and the third condition.
Wherein the first condition can be represented as Ai=max(Ω1)∩Ai>k1×median(Ω1);
The second condition may be denoted as Hi=max(Ω2)∩Hi>k2×median(Ω2);
The third condition may be denoted as Wi=max(Ω3)∩Wi>k3×median(Ω3)。
In the formula (I), the compound is shown in the specification,
Figure BDA0002243441410000121
Ω1a set representing the number of pixels of each connected region in the third set of connected regions;
Ω2a set representing the height of each connected region in the third set of connected regions;
Ω3a set representing a width of each connected region in the third set of connected regions;
mean represents the calculated mean, and mean represents the calculated median;
Airepresenting CCs in a third set of connected regionsiThe number of pixels (i-th connected region);
Hirepresenting CCs in a third set of connected regionsiThe height of the minimum bounding rectangle of (a);
Wirepresenting CCs in a third set of connected regionsiThe width of the minimum bounding rectangle.
The third non-text element condition includes:
the distance between each communicated region and the adjacent communicated region corresponding to each communicated region is greater than or equal to a preset distance; and/or
And the second number of the adjacent connected regions corresponding to each connected region is greater than or equal to a second value, and the second number is the maximum value of the left connected region number and the right connected region number.
The third non-text element condition described above can be expressed by the following two conditional formulas:
conditional formula 1: min (LNWS)i,RNWSi)≥10×max(medianWS,meanWS)
Conditional formula 2: max (numLN)i,numRNi)≥3
The condition formula 1 can be understood as that if the distance between a connected region in the fourth connected region set and its adjacent connected region is too large, the connected region is identified as a connected region of a non-text element; conditional formula 2 can be understood that if the height of the connected component in the fourth connected component set is too large, and the connected component spans three lines of text, the connected component is regarded as a connected component of the non-text element.
As long as at least one of the following two conditional formulas is satisfied, the corresponding connected component can be identified as a connected component of the non-text element.
Correspondingly, the connected regions with the distance smaller than the preset distance and the second number smaller than the second value are determined as the connected regions of the text elements.
Alternatively, for a Chinese document, the conditional formula 2 may be changed to: max (numLN)i,numRNi)>4, preventing from going outMisrecognitions of the chinese document now occur. FIG. 6 is a schematic diagram of the analysis of blank areas of a Chinese document according to an embodiment of the present invention, as shown in FIG. 6, in the case that FIGS. 5(c) and 5(d) are not considered, the "received" word in FIG. 6(a) has 5 connected areas, the "body" word has 1 connected area, and the left connected areas of the 4 connected areas of the received "word are both the" body "word according to max (numLN)i,numRNi)>3, the "body" word will be misinterpreted as a connected region of non-textual elements. When fig. 5(c) and 5(d) are considered, the connected regions inside the "receiver" word in fig. 6(a) are left and right connected regions, the left connected region of the "receiver" word having only one connected region is the "body" word, and max (numLN) is not satisfiedi,numRNi)>3, so the "body" word is not misinterpreted as a connected region of non-textual elements.
When fig. 5(b) and (d) are not taken into consideration, the word "asking" in fig. 6(b) has 4 connected regions in common, the word "asking" has 1 connected region in common, and all the left connected regions of the 4 connected regions of the word "asking" are the word "asking", which is based on max (numLN)i,numRNi)>3, the "to" word will be misinterpreted as a connected region of non-text elements. When fig. 5(b) and 5(d) are considered, the connected regions inside the "requirement" word in fig. 6(b) are left and right connected regions, and the left connected region of the "requirement" word having only one connected region is the "requirement" word, and max (numLN) is not satisfiedi,numRNi)>3, so the "to" word is not misinterpreted as a connected region of non-textual elements.
In summary, the blank region analysis can avoid the Chinese characters being misjudged as the connected regions of the non-text elements.
Optionally, in some embodiments, if a connected region in the third connected region set satisfies both the first condition and the second condition, or satisfies both the first condition and the third condition, then the connected region may be used as a connected region of a candidate non-text element. And if the connected region in the third connected region set does not meet the combination of any one of the conditions of the second non-text elements, determining the connected region as the connected region of the text element.
Further, the connected region of the candidate non-text element may be further analyzed to determine whether the adjacent position relationship of the connected region of the candidate non-text element satisfies a third non-text element condition, if the third non-text element condition is not satisfied, the connected region is determined as the connected region of the text element, and if the third non-text element condition is satisfied, the connected region is determined as the connected region of the non-text element.
Fig. 7 is a schematic diagram of a text non-text separation result of the region separation method for the document image according to the embodiment of the present invention. As shown in fig. 7, fig. 7(a) is an original document image, fig. 7(b) is a connected region of separated non-text elements, and fig. 7(c) is a text region, it can be seen that by the region separation method provided in this embodiment, more non-text elements (for example, bar-shaped non-text elements between the text regions in fig. 7 (a)) can be accurately extracted from the text region, thereby improving the accuracy of text non-text separation of the document image.
The method for separating the regions of the document image, provided by the embodiment of the invention, includes the steps of preprocessing the document image to obtain a binary image, analyzing the connected regions of the binary image to obtain a first connected region set, and dividing the first connected region set into a second connected region set and a third connected region set according to the attribute characteristics of each connected region in the first connected region set, wherein the second connected region set comprises connected regions of non-text elements, and the third connected region set comprises connected regions of text elements and connected regions of non-text elements identified as the text elements. And further determining the connected regions of the text elements from the third connected region set according to the attribute characteristics and the adjacent position information of each connected region in the third connected region set. The method can extract more non-text elements from the text area, and improves the accuracy of text non-text separation of the document image.
Fig. 8 is a flowchart illustrating a method for separating regions of a document image according to another embodiment of the present invention. On the basis of the embodiment shown in fig. 1, as shown in fig. 8, after step 106, the method further includes:
step 201, a first pixel number of a minimum bounding rectangle of a connected region overlapped with a connected region of a non-text element is obtained.
The connected component of the non-text element in this embodiment includes: the connected component of the non-text element in the second set of connected components determined in step 104 of the above embodiment, and the connected component of the non-text element determined in step 106 of the above embodiment.
The connected component areas where the connected components of the non-text elements overlap refer to connected components in a preset range near the connected component areas of the non-text elements, including the cases of fig. 5(a), 5(b), 5(c), and 5 (d). It should be noted that, the distance between the two connected regions in fig. 5(a) is smaller than or equal to the preset distance, and it can be regarded that the two connected regions overlap.
After determining a connected component overlapped with the connected component of the non-text element, acquiring a pixel value of a minimum bounding rectangle corresponding to the connected component, namely a first pixel number.
Step 202, obtaining a second number of pixels after the expansion operation is performed on the connected region of the non-text element.
In this embodiment, for the connected region of the non-text element, a smaller kernel is used for the dilation operation, where the size of the kernel is (mf, mf), mf is a value related to the image resolution, mf is min (h, w)/200, and generally mf has a value range of [3,10], where h and w represent the height and width of the document image, respectively.
And after the expansion operation is carried out on the connected region of the non-text element, acquiring the pixel value of the minimum circumscribed rectangle corresponding to the expanded connected region of the non-text element, namely the second pixel number.
Step 203, if the first number of pixels is larger than the second number of pixels, determining a connected region overlapped with the connected region of the non-text element as the connected region of the text element.
Correspondingly, if the first pixel number is less than or equal to the second pixel number, the connected region overlapped with the connected region of the non-text element is determined as the connected region of the non-text element.
The above process may be regarded as a noise elimination process, and a small connected region near a connected region of a non-text element may be correctly determined as a connected region of the non-text element, and other connected regions that do not satisfy the above condition may be determined as connected regions of a text element, thereby further improving the accuracy of detecting a text region of a document image.
The method for separating the regions of the document image provided by the embodiment can be used for further identifying and filtering a noise region with smaller size near the non-text elements after the non-text elements are identified and filtered on the document image, so as to obtain a more accurate text region.
Fig. 9 is a functional structure diagram of an area separating apparatus for document images according to an embodiment of the present invention. As shown in fig. 9, the present embodiment provides a region separating apparatus 300 for document images, comprising:
the image preprocessing module 301 is configured to preprocess the document image to obtain a binarized image;
a connected region analysis module 302, configured to perform connected region analysis on the binarized image to obtain a first connected region set;
an obtaining module 303, configured to obtain attribute features of each connected region in the first connected region set;
a connected region dividing module 304, configured to divide the first connected region set into a second connected region set and a third connected region set according to an attribute feature of each connected region in the first connected region set; the second connected region set comprises connected regions of non-text elements, and the third connected region set comprises connected regions of text elements;
the obtaining module 303 is further configured to obtain attribute features and adjacent position information of each connected region in the third connected region set;
the connected region dividing module 304 is further configured to determine a connected region of a text element from the third connected region set according to the attribute feature and the adjacent position information of each connected region in the third connected region set.
Optionally, the attribute characteristics of each of the connected regions include the number of pixels, the pixel density, the aspect ratio of each of the connected regions, and a first number of other connected regions nested in a minimum bounding rectangle of each of the connected regions.
Optionally, the connected region dividing module 304 is specifically configured to:
and judging whether the attribute characteristics of each connected region in the first connected region set meet a first non-text element condition, and dividing the connected regions of which the attribute characteristics meet the first non-text element condition into a second connected region set.
Optionally, the first non-text element condition includes at least one of the following conditions:
the number of pixels of the connected area is less than the preset number of pixels;
the pixel density of the connected region is less than the preset pixel density;
the aspect ratio of the communication area is smaller than the preset aspect ratio;
the first number of other connected regions nested within the smallest circumscribed rectangle of connected regions is greater than a first value.
Optionally, the connected component analyzing module 302 is further configured to:
performing blank area analysis on each connected area in the third connected area set to determine adjacent connected areas corresponding to each connected area in the third connected area set;
the obtaining module 303 is specifically configured to obtain adjacent position information corresponding to each connected region in the third connected region set, where the adjacent position information includes a distance between each connected region and an adjacent connected region corresponding to each connected region, and a left connected region number and a right connected region number of each connected region.
Optionally, the connected region dividing module 304 is specifically configured to:
judging whether the attribute characteristics of each connected region in the third connected region set meet a second non-text element condition or not;
judging whether the adjacent position information of each connected region in the third connected region set meets a third non-text element condition;
determining connected regions in the third connected region set, wherein the connected regions do not satisfy the second non-text element condition, and the connected regions which satisfy the second non-text element condition but do not satisfy the third non-text element condition, as connected regions of text elements.
Optionally, the second non-text condition includes:
a first condition and a second condition; alternatively, the first and second electrodes may be,
the first condition and the third condition; alternatively, the first and second electrodes may be,
the first condition, the second condition, and the third condition;
wherein the first condition is Ai=max(Ω1)∩Ai>k1×median(Ω1) (ii) a The second condition is Hi=max(Ω2)∩Hi>k2×median(Ω2) (ii) a The third condition is Wi=max(Ω3)∩Wi>k3×median(Ω3);
In the formula (I), the compound is shown in the specification,
Figure BDA0002243441410000171
Ω1a set, Ω, representing the number of pixels of each connected region in the third set of connected regions2A set, Ω, representing the height of each connected region in the third set of connected regions3A set representing the width of each connected region in the third connected region set, mean representing the calculated average value, mean representing the calculated median, AiRepresenting the number of pixels, H, of the ith connected region in the third set of connected regionsiRepresents the height, W, of the smallest bounding rectangle of the ith connected region in the third set of connected regionsiAnd the width of the minimum bounding rectangle of the ith connected region in the third connected region set is represented.
Optionally, the third non-text element condition includes:
the distance between each communication area and the adjacent communication area corresponding to each communication area is greater than or equal to a preset distance; and/or
And the second number of adjacent connected regions corresponding to each connected region is greater than or equal to a second value, and the second number is the maximum value of the left connected region number and the right connected region number.
Optionally, the obtaining module 303 is further configured to obtain a first pixel number of a minimum bounding rectangle of a connected region overlapped with the connected region of the non-text element;
the obtaining module 303 is further configured to obtain a second number of pixels after performing dilation operation on the connected region of the non-text element;
the connected region dividing module 304 is further configured to: and if the first pixel quantity is larger than the second pixel quantity, determining a connected region overlapped with the connected region of the non-text element as the connected region of the text element.
The area separation apparatus for document images provided in this embodiment may implement the technical solutions of the above method embodiments, and the implementation principles and technical effects are similar, which are not described herein again.
Fig. 10 is a schematic diagram of a hardware structure of a region separating apparatus for document images according to an embodiment of the present invention. As shown in fig. 10, the present embodiment provides a region separating apparatus 400 for document images, comprising:
a memory 401;
a processor 402; and
a computer program;
wherein the computer program is stored in the memory 401 and configured to be executed by the processor 402 to implement the technical solution of any one of the foregoing method embodiments, and the implementation principle and technical effect thereof are similar, and are not described herein again.
Optionally, the memory 401 may be separate or integrated with the processor 402.
When the memory 401 is a device independent of the processor 402, the document image segmentation apparatus 400 further includes:
a bus 403 for connecting the memory 401 and the processor 402.
Embodiments of the present invention also provide a computer-readable storage medium, on which a computer program is stored, where the computer program is executed by a processor 402 to implement the steps performed by the area separating apparatus 400 for document images in the above embodiments.
It should be understood that the Processor may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of a method disclosed in connection with the present invention may be embodied directly in a hardware processor, or in a combination of the hardware and software modules within the processor.
The memory may comprise a high-speed RAM memory, and may further comprise a non-volatile storage NVM, such as at least one disk memory, and may also be a usb disk, a removable hard disk, a read-only memory, a magnetic or optical disk, etc.
The bus may be an Industry Standard Architecture (ISA) bus, a Peripheral Component Interconnect (PCI) bus, an Extended ISA (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, the buses in the figures of the present application are not limited to only one bus or one type of bus.
The storage medium may be implemented by any type or combination of volatile or non-volatile memory devices, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks. A storage media may be any available media that can be accessed by a general purpose or special purpose computer.
An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. Of course, the storage medium may also be integral to the processor. The processor and the storage medium may reside in an Application Specific Integrated Circuits (ASIC). Of course, the processor and the storage medium may reside as discrete components in a regionally separate apparatus of a document image.
Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims (12)

1. A method for separating regions of a document image, comprising:
preprocessing a document image to obtain a binary image;
analyzing a connected region of the binarized image to obtain a first connected region set;
acquiring and dividing the first communication area set into a second communication area set and a third communication area set according to the attribute characteristics of each communication area in the first communication area set; the second connected region set comprises connected regions of non-text elements, and the third connected region set comprises connected regions of text elements;
and acquiring and determining the connected regions of the text elements from the third connected region set according to the attribute characteristics and the adjacent position information of each connected region in the third connected region set.
2. The method of claim 1, wherein the attribute features of each of the connected regions include a number of pixels, a pixel density, an aspect ratio of each of the connected regions, and a first number of other connected regions nested in a smallest bounding rectangle of each of the connected regions.
3. The method according to claim 1, wherein the dividing the first set of connected regions into a second set of connected regions and a third set of connected regions according to the attribute characteristics of each connected region in the first set of connected regions comprises:
and judging whether the attribute characteristics of each connected region in the first connected region set meet a first non-text element condition, and dividing the connected regions of which the attribute characteristics meet the first non-text element condition into a second connected region set.
4. The method of claim 3, wherein the first non-text element condition comprises at least one of:
the number of pixels of the connected area is less than the preset number of pixels;
the pixel density of the connected region is less than the preset pixel density;
the aspect ratio of the communication area is smaller than the preset aspect ratio;
the first number of other connected regions nested within the smallest circumscribed rectangle of connected regions is greater than a first value.
5. The method of claim 1, wherein obtaining neighbor location information for each connected region in the third set of connected regions comprises:
performing blank area analysis on each connected area in the third connected area set to determine adjacent connected areas corresponding to each connected area in the third connected area set;
and acquiring adjacent position information corresponding to each connected region in the third connected region set, wherein the adjacent position information comprises the distance between each connected region and the adjacent connected region corresponding to each connected region, and the number of left connected regions and the number of right connected regions of each connected region.
6. The method according to claim 1, wherein the determining the connected regions of the text element from the third set of connected regions according to the attribute characteristics and the adjacent position information of each connected region in the third set of connected regions comprises:
judging whether the attribute characteristics of each connected region in the third connected region set meet a second non-text element condition or not;
judging whether the adjacent position information of each connected region in the third connected region set meets a third non-text element condition;
determining connected regions in the third connected region set, wherein the connected regions do not satisfy the second non-text element condition, and the connected regions which satisfy the second non-text element condition but do not satisfy the third non-text element condition, as connected regions of text elements.
7. The method of claim 6, wherein the second non-textual condition comprises:
a first condition and a second condition; alternatively, the first and second electrodes may be,
the first condition and the third condition; alternatively, the first and second electrodes may be,
the first condition, the second condition, and the third condition;
wherein the first condition is Ai=max(Ω1)∩Ai>k1×median(Ω1) (ii) a The second condition is Hi=max(Ω2)∩Hi>k2×median(Ω2) (ii) a The third condition is Wi=max(Ω3)∩Wi>k3×median(Ω3);
In the formula (I), the compound is shown in the specification,
Figure FDA0002243441400000021
Ω1a set, Ω, representing the number of pixels of each connected region in the third set of connected regions2A set, Ω, representing the height of each connected region in the third set of connected regions3A set representing the width of each connected region in the third connected region set, mean representing the calculated average value, mean representing the calculated median, AiRepresenting the number of pixels, H, of the ith connected region in the third set of connected regionsiRepresents the height, W, of the smallest bounding rectangle of the ith connected region in the third set of connected regionsiAnd the width of the minimum bounding rectangle of the ith connected region in the third connected region set is represented.
8. The method of claim 6, wherein the third non-text element condition comprises:
the distance between each communication area and the adjacent communication area corresponding to each communication area is greater than or equal to a preset distance; and/or
And the second number of adjacent connected regions corresponding to each connected region is greater than or equal to a second value, and the second number is the maximum value of the left connected region number and the right connected region number.
9. The method according to any one of claims 1 to 8, further comprising:
acquiring a first pixel number of a minimum bounding rectangle of a connected region overlapped with the connected region of the non-text element;
acquiring a second pixel number of the connected region of the non-text element after expansion operation;
and if the first pixel quantity is larger than the second pixel quantity, determining a connected region overlapped with the connected region of the non-text element as the connected region of the text element.
10. An area separating apparatus for a document image, comprising:
the image preprocessing module is used for preprocessing the document image to obtain a binary image;
the connected region analysis module is used for analyzing the connected regions of the binary image to obtain a first connected region set;
an obtaining module, configured to obtain attribute features of each connected region in the first connected region set;
the connected region dividing module is used for dividing the first connected region set into a second connected region set and a third connected region set according to the attribute characteristics of each connected region in the first connected region set; the second connected region set comprises connected regions of non-text elements, and the third connected region set comprises connected regions of text elements;
the acquiring module is further configured to acquire attribute features and adjacent position information of each connected region in the third connected region set;
and the connected region dividing module is further used for determining the connected regions of the text elements from the third connected region set according to the attribute characteristics and the adjacent position information of each connected region in the third connected region set.
11. An area separating apparatus for a document image, comprising:
a memory;
a processor; and
a computer program;
wherein the computer program is stored in the memory and configured to be executed by the processor to implement the method of any of claims 1 to 9.
12. A computer-readable storage medium, having stored thereon a computer program for execution by a processor to perform the method of any one of claims 1 to 9.
CN201911008335.7A 2019-10-22 2019-10-22 Document image region separation method and device and storage medium Pending CN112699712A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911008335.7A CN112699712A (en) 2019-10-22 2019-10-22 Document image region separation method and device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911008335.7A CN112699712A (en) 2019-10-22 2019-10-22 Document image region separation method and device and storage medium

Publications (1)

Publication Number Publication Date
CN112699712A true CN112699712A (en) 2021-04-23

Family

ID=75504921

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911008335.7A Pending CN112699712A (en) 2019-10-22 2019-10-22 Document image region separation method and device and storage medium

Country Status (1)

Country Link
CN (1) CN112699712A (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101520783A (en) * 2008-02-29 2009-09-02 富士通株式会社 Method and device for searching keywords based on image content
US8009928B1 (en) * 2008-01-23 2011-08-30 A9.Com, Inc. Method and system for detecting and recognizing text in images
US20140072219A1 (en) * 2012-09-08 2014-03-13 Konica Minolta Laboratory U.S.A., Inc. Document image binarization and segmentation using image phase congruency
KR101571681B1 (en) * 2014-12-29 2015-11-25 주식회사 디오텍 Method for analysing structure of document using homogeneous region
CN109460763A (en) * 2018-10-29 2019-03-12 南京大学 A kind of text area extraction method positioned based on multi-level document component with growth
CN109948598A (en) * 2019-05-15 2019-06-28 达而观信息科技(上海)有限公司 Document layout intelligent analysis method and device

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8009928B1 (en) * 2008-01-23 2011-08-30 A9.Com, Inc. Method and system for detecting and recognizing text in images
CN101520783A (en) * 2008-02-29 2009-09-02 富士通株式会社 Method and device for searching keywords based on image content
US20140072219A1 (en) * 2012-09-08 2014-03-13 Konica Minolta Laboratory U.S.A., Inc. Document image binarization and segmentation using image phase congruency
KR101571681B1 (en) * 2014-12-29 2015-11-25 주식회사 디오텍 Method for analysing structure of document using homogeneous region
CN109460763A (en) * 2018-10-29 2019-03-12 南京大学 A kind of text area extraction method positioned based on multi-level document component with growth
CN109948598A (en) * 2019-05-15 2019-06-28 达而观信息科技(上海)有限公司 Document layout intelligent analysis method and device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
何耘娴: "印刷体文档图像的中文字符识别", 《中国优秀硕士学位论文全文数据库 信息科技辑》, no. 2011 *

Similar Documents

Publication Publication Date Title
US8559748B2 (en) Edge detection
US8712188B2 (en) System and method for document orientation detection
US8977072B1 (en) Method and system for detecting and recognizing text in images
JP5492205B2 (en) Segment print pages into articles
CN110647882A (en) Image correction method, device, equipment and storage medium
EP2974261A2 (en) Systems and methods for classifying objects in digital images captured using mobile devices
US20150055857A1 (en) Text detection in natural images
CN112183038A (en) Form identification and typing method, computer equipment and computer readable storage medium
JP4574503B2 (en) Image processing apparatus, image processing method, and program
US20190266431A1 (en) Method, apparatus, and computer-readable medium for processing an image with horizontal and vertical text
Malik et al. An efficient skewed line segmentation technique for cursive script OCR
US7903876B2 (en) Distortion correction of a captured image
WO2018140001A1 (en) Print quality diagnosis
CN111209865A (en) File content extraction method and device, electronic equipment and storage medium
CN112800824B (en) Method, device, equipment and storage medium for processing scanned file
CN110321887B (en) Document image processing method, document image processing apparatus, and storage medium
US20120250985A1 (en) Context Constraints for Correcting Mis-Detection of Text Contents in Scanned Images
Epshtein Determining document skew using inter-line spaces
US8891822B2 (en) System and method for script and orientation detection of images using artificial neural networks
CN112699712A (en) Document image region separation method and device and storage medium
CN116524503A (en) Multi-line text line extraction method, device, equipment and readable storage medium
Kaur et al. Page segmentation in OCR system-a review
CN116030472A (en) Text coordinate determining method and device
US11783458B2 (en) Image processing method, image processing device, electronic device and storage medium
Kleber et al. Robust skew estimation of handwritten and printed documents based on grayvalue images

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination