CN112699712A

CN112699712A - Document image region separation method and device and storage medium

Info

Publication number: CN112699712A
Application number: CN201911008335.7A
Authority: CN
Inventors: 王祺尧
Original assignee: Hangzhou Hikvision Digital Technology Co Ltd
Current assignee: Hangzhou Hikvision Digital Technology Co Ltd
Priority date: 2019-10-22
Filing date: 2019-10-22
Publication date: 2021-04-23

Abstract

The invention provides a method and a device for separating areas of a document image and a storage medium. The method comprises the steps of preprocessing a document image to obtain a binary image, analyzing connected regions of the binary image to obtain a first connected region set, dividing the first connected region set into a second connected region set and a third connected region set according to the attribute characteristics of the connected regions in the first connected region set, wherein the second connected region set comprises connected regions of non-text elements, the third connected region set comprises connected regions of text elements, and determining the connected regions of the text elements from the third connected region set further according to the attribute characteristics and adjacent position information of the connected regions in the third connected region set. The method can extract more non-text elements from the text area, and improves the accuracy of text non-text separation of the document image.

Description

Document image region separation method and device and storage medium

Technical Field

The present invention relates to the field of layout analysis technologies, and in particular, to a method and an apparatus for separating regions of a document image, and a storage medium.

Background

With the popularization of electronic reading materials such as electronic periodicals, electronic books and the like, people convert more and more traditional paper reading materials into digital images through image acquisition equipment such as image-text scanners, cameras and the like, and the digital images are shared in a network. We refer to such digital images, which primarily carry textual information such as words, pictures, tables, etc., as document images. The document image is derived from a paper file, has the characteristics of intuitive content, convenience in carrying and transmission and the like, and is widely applied to various industrial fields.

The layout analysis of the document image is mainly to automatically process and divide layout data of the document image by using a computer, and identify the position and the attribute of areas such as characters, pictures, figures, tables and the like on the layout of the image. The current methods for separating regions of document images mainly include the following two methods: the pixel-by-pixel judgment method mainly aims at each pixel point in an image, calculates the edge intensity value of each pixel point according to the gray value of the pixel point, then determines a binary image corresponding to the image to be processed according to the edge intensity value of each pixel point, determines the text region judgment value of each pixel point according to the binary image and the gray value of each pixel point, and finally judges the text region and the non-text region of the image according to the judgment value of each pixel point. The other is a text detection technology based on deep learning, which uses a deep learning network to detect characters in an image, such as characters detection methods like fast RCNN, EAST, Textbox + + and the like, and then uses the detected text as a text region.

The first method performs calculation based on a single pixel and a peripheral pixel value, only focuses on a local part of an image, cannot grasp the whole structure of the image and the position between each element, and has low accuracy and robustness of region separation. The second method can only detect the characters, and cannot distinguish the characters in the picture and the table from the paragraph characters in the normal document.

Disclosure of Invention

The invention provides a method and a device for separating a region of a document image and a storage medium, which improve the accuracy of text non-text separation of the document image.

The first aspect of the present invention provides a method for separating regions of a document image, including:

preprocessing a document image to obtain a binary image;

analyzing a connected region of the binarized image to obtain a first connected region set;

acquiring and dividing the first communication area set into a second communication area set and a third communication area set according to the attribute characteristics of each communication area in the first communication area set; the second connected region set comprises connected regions of non-text elements, and the third connected region set comprises connected regions of text elements;

and acquiring and determining the connected regions of the text elements from the third connected region set according to the attribute characteristics and the adjacent position information of each connected region in the third connected region set.

Optionally, the attribute characteristics of each of the connected regions include the number of pixels, the pixel density, the aspect ratio of each of the connected regions, and a first number of other connected regions nested in a minimum bounding rectangle of each of the connected regions.

In a possible implementation manner, the dividing, according to an attribute feature of each connected region in the first connected region set, the first connected region set into a second connected region set and a third connected region set includes:

and judging whether the attribute characteristics of each connected region in the first connected region set meet a first non-text element condition, and dividing the connected regions of which the attribute characteristics meet the first non-text element condition into a second connected region set.

In one possible implementation, the first non-text element condition includes at least one of:

the number of pixels of the connected area is less than the preset number of pixels;

the pixel density of the connected region is less than the preset pixel density;

the aspect ratio of the communication area is smaller than the preset aspect ratio;

the first number of other connected regions nested within the smallest circumscribed rectangle of connected regions is greater than a first value.

In a possible implementation manner, the obtaining adjacent position information of each connected region in the third connected region set includes:

performing blank area analysis on each connected area in the third connected area set to determine adjacent connected areas corresponding to each connected area in the third connected area set;

and acquiring adjacent position information corresponding to each connected region in the third connected region set, wherein the adjacent position information comprises the distance between each connected region and the adjacent connected region corresponding to each connected region, and the number of left connected regions and the number of right connected regions of each connected region.

In a possible implementation manner, the determining a connected region of a text element from the third connected region set according to the attribute feature and the adjacent position information of each connected region in the third connected region set includes:

judging whether the attribute characteristics of each connected region in the third connected region set meet a second non-text element condition or not;

judging whether the adjacent position information of each connected region in the third connected region set meets a third non-text element condition;

determining connected regions in the third connected region set, wherein the connected regions do not satisfy the second non-text element condition, and the connected regions which satisfy the second non-text element condition but do not satisfy the third non-text element condition, as connected regions of text elements.

In one possible implementation, the second non-textual condition includes:

a first condition and a second condition; alternatively, the first and second electrodes may be,

the first condition and the third condition; alternatively, the first and second electrodes may be,

the first condition, the second condition, and the third condition;

wherein the first condition is A_i＝max(Ω₁)∩A_i>k₁×median(Ω₁) (ii) a The second condition is H_i＝max(Ω₂)∩H_i>k₂×median(Ω₂) (ii) a The thirdProvided that W is_i＝max(Ω₃)∩W_i>k₃×median(Ω₃)；

In the formula (I), the compound is shown in the specification,

Ω₁a set, Ω, representing the number of pixels of each connected region in the third set of connected regions₂A set, Ω, representing the height of each connected region in the third set of connected regions₃A set representing the width of each connected region in the third connected region set, mean representing the calculated average value, mean representing the calculated median, A_iRepresenting the number of pixels, H, of the ith connected region in the third set of connected regions_iRepresents the height, W, of the smallest bounding rectangle of the ith connected region in the third set of connected regions_iAnd the width of the minimum bounding rectangle of the ith connected region in the third connected region set is represented.

In one possible implementation, the third non-text element condition includes:

the distance between each communication area and the adjacent communication area corresponding to each communication area is greater than or equal to a preset distance; and/or

And the second number of adjacent connected regions corresponding to each connected region is greater than or equal to a second value, and the second number is the maximum value of the left connected region number and the right connected region number.

In one possible implementation, the method further includes:

acquiring a first pixel number of a minimum bounding rectangle of a connected region overlapped with the connected region of the non-text element;

acquiring a second pixel number of the connected region of the non-text element after expansion operation;

and if the first pixel quantity is larger than the second pixel quantity, determining a connected region overlapped with the connected region of the non-text element as the connected region of the text element.

A second aspect of the present invention provides a region separating apparatus of a document image, including:

the image preprocessing module is used for preprocessing the document image to obtain a binary image;

the connected region analysis module is used for analyzing the connected regions of the binary image to obtain a first connected region set;

an obtaining module, configured to obtain attribute features of each connected region in the first connected region set;

the connected region dividing module is used for dividing the first connected region set into a second connected region set and a third connected region set according to the attribute characteristics of each connected region in the first connected region set; the second connected region set comprises connected regions of non-text elements, and the third connected region set comprises connected regions of text elements;

the acquiring module is further configured to acquire attribute features and adjacent position information of each connected region in the third connected region set;

and the connected region dividing module is further used for determining the connected regions of the text elements from the third connected region set according to the attribute characteristics and the adjacent position information of each connected region in the third connected region set.

A third aspect of the present invention provides a region separating apparatus for a document image, comprising:

a memory;

a processor; and

a computer program;

wherein the computer program is stored in the memory and configured to be executed by the processor to implement the method according to any one of the first aspect of the invention.

A fourth aspect of the invention provides a computer readable storage medium having stored thereon a computer program for execution by a processor to perform the method according to any one of the first aspect of the invention.

The embodiment of the invention provides a method and a device for separating areas of a document image and a storage medium. The method comprises the steps of preprocessing a document image to obtain a binary image, analyzing connected regions of the binary image to obtain a first connected region set, dividing the first connected region set into a second connected region set and a third connected region set according to the attribute characteristics of the connected regions in the first connected region set, wherein the second connected region set comprises connected regions of non-text elements, the third connected region set comprises connected regions of text elements, and determining the connected regions of the text elements from the third connected region set further according to the attribute characteristics and adjacent position information of the connected regions in the third connected region set. The method can extract more non-text elements from the text area, and improves the accuracy of text non-text separation of the document image.

Drawings

FIG. 1 is a flowchart illustrating a method for separating regions of a document image according to an embodiment of the present invention;

FIG. 2 is a diagram illustrating a pixel adjacency in an image according to an embodiment of the present invention;

FIG. 3 is a labeled diagram of connected component analysis provided in accordance with an embodiment of the present invention;

FIG. 4 is a schematic view of a communication area in the vicinity of a communication area provided by an embodiment of the present invention;

FIG. 5 is a schematic diagram of a position relationship between two connected regions according to an embodiment of the present invention;

FIG. 6 is a diagram illustrating a blank area analysis of a Chinese document according to an embodiment of the present invention;

FIG. 7 is a diagram illustrating a text non-text separation result of a region separation method for a document image according to an embodiment of the present invention;

FIG. 8 is a flowchart illustrating a method for separating regions of a document image according to another embodiment of the present invention;

FIG. 9 is a functional structure diagram of an apparatus for separating regions of a document image according to an embodiment of the present invention;

fig. 10 is a schematic diagram of a hardware structure of a region separating apparatus for document images according to an embodiment of the present invention.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present invention. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the invention, as detailed in the appended claims.

The terms "comprising" and "having," and any variations thereof, in the description and claims of this invention are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus.

Reference throughout this specification to "one embodiment" or "another embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the present application. Thus, the appearances of the phrases "in some embodiments" or "in this embodiment" in various places throughout this specification are not necessarily all referring to the same embodiment. It should be noted that the embodiments and features of the embodiments may be combined with each other without conflict.

The method for separating the regions of the document image, provided by the embodiment of the invention, is used for processing the connected regions in the document image as a unit, separating the text and the non-text regions in the document image based on the attribute characteristics of the connected regions and the position relation between the connected regions, effectively separating the regions of various document layouts and documents in different languages, and has high separation accuracy and stable effect. In addition, the region separation method for the document image provided by the embodiment can not only distinguish the text region and the non-text region, but also further remove more non-text elements from the connected region overlapped with the connected region of the non-text elements, and finally obtain an accurate text region.

The following describes the region separation method of a document image in detail with reference to several specific embodiments, which may be combined with each other, and the description of the same or similar contents in different embodiments is not repeated.

Fig. 1 is a flowchart illustrating a method for separating regions of a document image according to an embodiment of the present invention. The execution subject of the method can be executed by any area separation device for executing document images, the device can be realized by software and/or hardware, and the device can be a camera and an intelligent analysis server. As shown in fig. 1, the method provided by this embodiment includes the following steps:

step 101, preprocessing a document image to obtain a binary image.

In this embodiment, the document image may be a photographed document or a scanned document, and the embodiment is not limited thereto. Specifically, the document image includes a text region and a non-text region, the text region includes text elements such as chinese characters, english letters, numbers, symbols, formulas, and the like, and the non-text region includes pictures, table pictures, icons, and the like.

After the document image is acquired, preprocessing is carried out on the document image, and the preprocessing process comprises the steps of image binarization, rotation correction, perspective correction and the like.

The image binarization is a process of setting the gray value of a pixel point on a document image to be 0 or 255, that is, the whole image presents an obvious black-and-white effect. The binarization of the image greatly reduces the data volume in the image, and improves the speed of image processing.

In the planar image processing, since the inclination, distortion, and the like of the document image captured or scanned are likely to occur due to the lens angle or the failure of the scanner to perform normal horizontal and vertical scanning according to the line and column thereof during image scanning, it is necessary to perform rotation correction, perspective correction, and the like on the document image to ensure that the document image to be processed has no distortion such as rotation and bending, for example, the rotation angle is less than 2 degrees.

In this embodiment, the order of steps of preprocessing the document image is not limited. The document image can be firstly subjected to binarization processing, and then the binarized image is subjected to rotation correction and perspective correction; or the document image can be firstly subjected to rotation correction and perspective correction, and then the corrected document image is subjected to binarization processing.

102, analyzing the connected regions of the binary image to obtain a first connected region set.

The connected region analysis is a process for extracting and labeling the connected regions in the binary image. The connected region is an image region formed by foreground pixel points which have the same pixel value and are adjacent in position in the image.

Fig. 2 is a schematic diagram of an adjacency relationship of pixels in an image according to an embodiment of the present invention, and as shown in fig. 2, there are 2 types of adjacency relationships of common image pixels: 4 contiguous with 8 contiguous. This embodiment uses 8 neighbors to determine whether pixels belong to the same connected region.

In this embodiment, a sketch library in Python is used to label a connected region in an image. Fig. 3 is a labeled schematic diagram of connected component analysis according to an embodiment of the present invention. As shown in fig. 3, each connected region corresponds to a circumscribed rectangle.

Let the ith connected region be denoted as CC_iConnected regions have the following properties:

(1)B(CC_i) Represents CC_iThe sides of the rectangle are parallel to the coordinate axes. (Xl)_i,Yl_i) And (Xr)_i,Yr_i) The coordinates of the upper left corner and the lower right corner of the rectangular frame are respectively.

(2)A_iRepresents CC_iThe number of pixels in (1).

(3)A_i ^BRepresents B (CC)_i) Size of (1), W_iAnd H_iRespectively represent B (CC)_i) Width and height of (d);

(4)λ_irepresents CC_iThe density of (A) is calculated as_i＝A_i/A_i ^B。

(5)Inc_iIndicates nesting in B (CC)_i) Other rectangular frame B (CC) of (1)_j) Number of (1), j ≠ i (note: computing Inc_iFirst, press Yl to the rectangular frame_iComponent ordering, which can greatly reduce computational complexity).

(6)HW_i ^rateRepresents CC_iHas an aspect ratio of HW as a calculation method_i ^rate＝min(W_i,H_i)/max(W_i,H_i)。

It should be noted that, a connected region analysis is performed on the binarized image to obtain a first connected region set, where the first connected region set includes a plurality of connected regions including text regions and non-text regions, and the plurality of connected regions in the first connected region set need to be identified and divided according to steps 103 to 106, so as to obtain an accurate text region finally, and a specific implementation process is as follows.

And 103, acquiring the attribute characteristics of each connected region in the first connected region set.

Referring to step 102, the attribute characteristics of each connected region in the first set of connected regions include the number of pixels, the pixel density, the aspect ratio of each connected region, and the first number of other connected regions nested in the minimum bounding rectangle of each connected region.

And 104, dividing the first communication area set into a second communication area set and a third communication area set according to the attribute characteristics of each communication area in the first communication area set.

And the second connected region set comprises connected regions of non-text elements, and the third connected region set comprises connected regions of text elements.

Specifically, whether the attribute characteristics of each connected region in the first connected region set meet the first non-text element condition or not is judged, and the connected regions of which the attribute characteristics meet the first non-text element condition are divided into the second connected region set. The first non-text element condition is used to perform a preliminary filtering of non-text elements in the document image. If the connected region satisfies the first non-text element condition, then the connected region is considered to be a connected region of the non-text element.

Wherein the first non-text element condition comprises at least one of:

(1) the number of pixels of the connected region is less than the preset number of pixels. Can be specifically represented by formula A_i<T^areaIs represented by the formula, wherein T^areaFor a predetermined number of pixels, it is usually set at 5-7 pixels, e.g. T^areaA connected region representing a very small pixel value is treated as non-text, 6 pixel.

(2) The pixel density of the connected region is less than the preset pixel density. Can be specifically represented by the formula λ_i<T^densIs represented by the formula, wherein T^densFor a predetermined pixel density, it is usually set to [0.05,0.07 ]]E.g. T^dens0.06, the pixel density of the connected component is too low, which may be noise, diagonal elements, rectangular boxes, and so on.

(3) The aspect ratio of the communicating region is less than the preset aspect ratio.

Specifically, in H_i<W_i(the height of the connected region is less than the width), HW_i ^rate<T^rateWherein T is₁ ^rateIs a first predetermined aspect ratio (predetermined aspect ratio for transverse filtration), typically set at [0.05,0.07 ]]E.g. T₁ ^rate0.06. In general, the width of a text pixel cannot be larger than 16.66 times the height.

At H_i>W_i(the height of the connected region is greater than the width), HW_i ^rate<T^rateWherein T is₂ ^rateFor the second predetermined aspect ratio (the predetermined aspect ratio for longitudinal filtration), is typically set at [0.02,0.04 ]]E.g. T₂ ^rate＝0.03。

Typically, the predetermined width and height of the longitudinal filtration is more stringent than the transverse filtration.

(4) The first number of other connected regions nested within the connected region minimum bounding rectangle is greater than a first value. Can be specifically represented by the formula Inc_i>T^insideIs represented by the formula, wherein T^insideIs a first value, typically set to 4, representing the others nested in the minimum bounding rectangle of the connected regionThe first number of connected regions is greater than 4, which is considered non-text, and this condition applies to both latin and chinese documents.

It should be noted that the preset parameters (including the preset number of pixels, the preset pixel density, the preset aspect ratio, and the first value) are empirical values obtained through a large number of calculations and verifications, and are suitable for various types and resolution of documents.

The purpose of this step is to eliminate the obvious noise or non-text elements in the binary image and obtain the preliminary separation result. It should be noted that, in addition to the connected regions of the text elements, the third connected region set also includes connected regions of non-text elements identified as text elements, that is, in the above process, there may be a case where the non-text elements are misjudged as text elements. In order to improve the accuracy of text non-text separation, step 105 and step 106 need to be further executed to further separate out non-text elements in the text area, which is described in the following.

And 105, acquiring the attribute characteristics and the adjacent position relation of each connected region in the third connected region set.

Referring to step 102, the attribute characteristics of each connected region in the third set of connected regions include the number of pixels, the pixel density, the aspect ratio of each connected region, and the first number of other connected regions nested in the minimum bounding rectangle of each connected region.

In this step, obtaining the adjacent position relationship of each connected region in the third connected region set includes the following steps: performing blank area analysis on each connected area in the third connected area set to determine adjacent connected areas corresponding to each connected area in the third connected area set; and acquiring adjacent position information corresponding to each connected region in the third connected region set.

The adjacent position information comprises the distance between each connected region and the adjacent connected region corresponding to each connected region, the left connected region number and the right connected region number of each connected region.

For ease of understanding, the blank area analysis will be described below with reference to fig. 4 and 5.

Fig. 4 is a schematic diagram of a vicinity of a certain communication region according to an embodiment of the present invention. Referring to FIG. 4, in the above-described blank region analysis, for the connected region CC_iThe following variables need to be calculated:

(1)LNN_iand RNN_i：LNN_iRepresents CC_iLeft connected region (i.e., CC)_iLeft neighbor of) RNN, RNN_iRepresents CC_iRight connected region (i.e., CC)_iThe right neighbor of (c).

In addition, CC_iMay include one or more of the left connected region or the right connected region. When the left communication area or the right communication area comprises a plurality of the LNNs_iConcrete representation closest to CC_iLeft connected region, RNN_iConcrete representation closest to CC_iRight connected region.

Exemplary, CC in FIG. 4₃LNN of₃＝{CC₂},RNN₃＝{CC₄}；CC₄LNN of₄＝{CC₃},

It is noted that CC_iLeft and right connected region requirements and CC_iIn the same row direction.

(2)LNWS_i，RNWS_i：LNWS_iRepresents CC_iDistance to left connected region (blank size), RNWS_iRepresents CC_iDistance from the right connected region (blank size). Let CC_iIs LNN_i＝{CC_jThe right neighbor is RNN_i＝{CC_kThen LNWS_iAnd RNWS_iThe calculation formula is as follows:

LNWS_i＝Xl_i-Xr_j

RNWS_i＝Xl_k-Xr_i

(3)LN_i，RN_i：LN_irepresents CC_iRight connected region becoming other connected regionA set of other connected regions of time; RN (radio network node)_iRepresents CC_iAnd the other connected region set when the left connected region is the other connected region.

Exemplary, CC in FIG. 4₃LN of₃＝{CC₁,CC₂},RN₃＝{CC₄,CC₅}。

(4)numLN_i，numRN_i：numLN_iRepresents CC_iThe number of other connected regions when the right connected region becomes the other connected region; numRN_iRepresents CC_iThe number of other connected regions when the left connected region is the other connected region.

numLN_i＝|LN_i|

numRN_i＝|RN_i|

Exemplary, CC in FIG. 4₃numLN of₃＝numRN₃＝2

(5) WS: representing a set of all blank areas.

WS＝{RNWS_i>0|CC_i∈CC_uEither WS ═ LNWS_i>0|CC_i∈CC_u}

Wherein, CC_uRepresenting all connected regions in the document image.

And obtaining the variable of each connected region in the third connected region set by performing blank region analysis on each connected region in the third connected region set, and obtaining the adjacent position information of each connected region and the adjacent connected region.

The above-described blank area analysis considers only the case where connected areas in the document image are separated from each other, as shown in fig. 4. However, the overlap of connected regions may also occur in the document image. Fig. 5 is a schematic diagram of a positional relationship between two communication areas provided in the embodiment of the present invention, and as shown in fig. 5, a situation that the two communication areas overlap with each other may occur.

Specifically, for an English document, the overlapping condition of FIG. 5(b) may occur, and for a Chinese document, the overlapping condition of FIGS. 5(b), (c), and (d) may occur. It should be noted that the situation of fig. 5(e) does not occur no matter in the english document or the chinese document, and at this time, CC does not occur_iAnd CC_jMust be connected as a connected region.

Illustratively, the italic characters in Chinese and English have the overlapping condition of FIG. 5(b), the Chinese character "painting" has the overlapping condition of FIG. 5(c), and the Chinese character "Hui" has the overlapping condition of FIG. 5 (d).

For the overlap condition of FIG. 5(b), CC is judged_iAnd CC_jLeft and right positional relationship therebetween, CC in FIG. 5(b)_iIs CC_jLeft connected region of (C)_jIs CC_iA distance between the right communication area and the left communication area is set to 0; for the overlapping case of FIG. 5(c), CC_iAnd CC_jThe left/right communication areas are mutually communicated, and the distance between the left/right communication areas is set to be 0; for the overlap case of FIG. 5(d), nesting inhibition, CC, is performed_jAbsence of left/right connected regions, CC_iAnd CC_jIs set to 0.

It should be noted that, in the embodiment, when determining an adjacent connected region corresponding to a connected region, not only the position relationship of the connected regions that are separated from each other is considered, but also the position relationship of the connected regions that are overlapped left and right, overlapped up and down, and nested is considered, so that a more accurate determination basis is provided for subsequent condition determination, and therefore, the english letters or chinese characters in the chinese and english documents are prevented from being erroneously determined as non-text elements.

And 106, determining the connected regions of the text elements from the third connected region set according to the attribute characteristics and the adjacent position information of each connected region in the third connected region set.

In this embodiment, the connected regions in the third set of connected regions may include non-text elements that were not identified in step 104, such as small icons, noise, etc. near the text elements. To further improve the accuracy of text non-text separation, the following process may be used to find and separate non-text elements near the text region:

judging whether the attribute characteristics of each connected region in the third connected region set meet a second non-text element condition or not; judging whether the adjacent position information of each connected region in the third connected region set meets a third non-text element condition; and determining the connected regions which do not satisfy the second non-text element condition and the connected regions which satisfy the second non-text element condition but do not satisfy the third non-text element condition in the third connected region set as the connected regions of the text elements.

The process can further improve the accuracy of text non-text separation.

The second non-text condition comprises a first condition and a second condition; alternatively, the first condition and the third condition; alternatively, the first condition, the second condition, and the third condition.

Wherein the first condition can be represented as A_i＝max(Ω₁)∩A_i>k₁×median(Ω₁)；

The second condition may be denoted as H_i＝max(Ω₂)∩H_i>k₂×median(Ω₂)；

The third condition may be denoted as W_i＝max(Ω₃)∩W_i>k₃×median(Ω₃)。

In the formula (I), the compound is shown in the specification,

Ω₁a set representing the number of pixels of each connected region in the third set of connected regions;

Ω₂a set representing the height of each connected region in the third set of connected regions;

Ω₃a set representing a width of each connected region in the third set of connected regions;

mean represents the calculated mean, and mean represents the calculated median;

A_irepresenting CCs in a third set of connected regions_iThe number of pixels (i-th connected region);

H_irepresenting CCs in a third set of connected regions_iThe height of the minimum bounding rectangle of (a);

W_irepresenting CCs in a third set of connected regions_iThe width of the minimum bounding rectangle.

The third non-text element condition includes:

the distance between each communicated region and the adjacent communicated region corresponding to each communicated region is greater than or equal to a preset distance; and/or

And the second number of the adjacent connected regions corresponding to each connected region is greater than or equal to a second value, and the second number is the maximum value of the left connected region number and the right connected region number.

The third non-text element condition described above can be expressed by the following two conditional formulas:

conditional formula 1: min (LNWS)_i,RNWS_i)≥10×max(medianWS,meanWS)

Conditional formula 2: max (numLN)_i,numRN_i)≥3

The condition formula 1 can be understood as that if the distance between a connected region in the fourth connected region set and its adjacent connected region is too large, the connected region is identified as a connected region of a non-text element; conditional formula 2 can be understood that if the height of the connected component in the fourth connected component set is too large, and the connected component spans three lines of text, the connected component is regarded as a connected component of the non-text element.

As long as at least one of the following two conditional formulas is satisfied, the corresponding connected component can be identified as a connected component of the non-text element.

Correspondingly, the connected regions with the distance smaller than the preset distance and the second number smaller than the second value are determined as the connected regions of the text elements.

Alternatively, for a Chinese document, the conditional formula 2 may be changed to: max (numLN)_i,numRN_i)>4, preventing from going outMisrecognitions of the chinese document now occur. FIG. 6 is a schematic diagram of the analysis of blank areas of a Chinese document according to an embodiment of the present invention, as shown in FIG. 6, in the case that FIGS. 5(c) and 5(d) are not considered, the "received" word in FIG. 6(a) has 5 connected areas, the "body" word has 1 connected area, and the left connected areas of the 4 connected areas of the received "word are both the" body "word according to max (numLN)_i,numRN_i)>3, the "body" word will be misinterpreted as a connected region of non-textual elements. When fig. 5(c) and 5(d) are considered, the connected regions inside the "receiver" word in fig. 6(a) are left and right connected regions, the left connected region of the "receiver" word having only one connected region is the "body" word, and max (numLN) is not satisfied_i,numRN_i)>3, so the "body" word is not misinterpreted as a connected region of non-textual elements.

When fig. 5(b) and (d) are not taken into consideration, the word "asking" in fig. 6(b) has 4 connected regions in common, the word "asking" has 1 connected region in common, and all the left connected regions of the 4 connected regions of the word "asking" are the word "asking", which is based on max (numLN)_i,numRN_i)>3, the "to" word will be misinterpreted as a connected region of non-text elements. When fig. 5(b) and 5(d) are considered, the connected regions inside the "requirement" word in fig. 6(b) are left and right connected regions, and the left connected region of the "requirement" word having only one connected region is the "requirement" word, and max (numLN) is not satisfied_i,numRN_i)>3, so the "to" word is not misinterpreted as a connected region of non-textual elements.

In summary, the blank region analysis can avoid the Chinese characters being misjudged as the connected regions of the non-text elements.

Optionally, in some embodiments, if a connected region in the third connected region set satisfies both the first condition and the second condition, or satisfies both the first condition and the third condition, then the connected region may be used as a connected region of a candidate non-text element. And if the connected region in the third connected region set does not meet the combination of any one of the conditions of the second non-text elements, determining the connected region as the connected region of the text element.

Further, the connected region of the candidate non-text element may be further analyzed to determine whether the adjacent position relationship of the connected region of the candidate non-text element satisfies a third non-text element condition, if the third non-text element condition is not satisfied, the connected region is determined as the connected region of the text element, and if the third non-text element condition is satisfied, the connected region is determined as the connected region of the non-text element.

Fig. 7 is a schematic diagram of a text non-text separation result of the region separation method for the document image according to the embodiment of the present invention. As shown in fig. 7, fig. 7(a) is an original document image, fig. 7(b) is a connected region of separated non-text elements, and fig. 7(c) is a text region, it can be seen that by the region separation method provided in this embodiment, more non-text elements (for example, bar-shaped non-text elements between the text regions in fig. 7 (a)) can be accurately extracted from the text region, thereby improving the accuracy of text non-text separation of the document image.

The method for separating the regions of the document image, provided by the embodiment of the invention, includes the steps of preprocessing the document image to obtain a binary image, analyzing the connected regions of the binary image to obtain a first connected region set, and dividing the first connected region set into a second connected region set and a third connected region set according to the attribute characteristics of each connected region in the first connected region set, wherein the second connected region set comprises connected regions of non-text elements, and the third connected region set comprises connected regions of text elements and connected regions of non-text elements identified as the text elements. And further determining the connected regions of the text elements from the third connected region set according to the attribute characteristics and the adjacent position information of each connected region in the third connected region set. The method can extract more non-text elements from the text area, and improves the accuracy of text non-text separation of the document image.

Fig. 8 is a flowchart illustrating a method for separating regions of a document image according to another embodiment of the present invention. On the basis of the embodiment shown in fig. 1, as shown in fig. 8, after step 106, the method further includes:

step 201, a first pixel number of a minimum bounding rectangle of a connected region overlapped with a connected region of a non-text element is obtained.

The connected component of the non-text element in this embodiment includes: the connected component of the non-text element in the second set of connected components determined in step 104 of the above embodiment, and the connected component of the non-text element determined in step 106 of the above embodiment.

The connected component areas where the connected components of the non-text elements overlap refer to connected components in a preset range near the connected component areas of the non-text elements, including the cases of fig. 5(a), 5(b), 5(c), and 5 (d). It should be noted that, the distance between the two connected regions in fig. 5(a) is smaller than or equal to the preset distance, and it can be regarded that the two connected regions overlap.

After determining a connected component overlapped with the connected component of the non-text element, acquiring a pixel value of a minimum bounding rectangle corresponding to the connected component, namely a first pixel number.

Step 202, obtaining a second number of pixels after the expansion operation is performed on the connected region of the non-text element.

In this embodiment, for the connected region of the non-text element, a smaller kernel is used for the dilation operation, where the size of the kernel is (mf, mf), mf is a value related to the image resolution, mf is min (h, w)/200, and generally mf has a value range of [3,10], where h and w represent the height and width of the document image, respectively.

And after the expansion operation is carried out on the connected region of the non-text element, acquiring the pixel value of the minimum circumscribed rectangle corresponding to the expanded connected region of the non-text element, namely the second pixel number.

Step 203, if the first number of pixels is larger than the second number of pixels, determining a connected region overlapped with the connected region of the non-text element as the connected region of the text element.

Correspondingly, if the first pixel number is less than or equal to the second pixel number, the connected region overlapped with the connected region of the non-text element is determined as the connected region of the non-text element.

The above process may be regarded as a noise elimination process, and a small connected region near a connected region of a non-text element may be correctly determined as a connected region of the non-text element, and other connected regions that do not satisfy the above condition may be determined as connected regions of a text element, thereby further improving the accuracy of detecting a text region of a document image.

The method for separating the regions of the document image provided by the embodiment can be used for further identifying and filtering a noise region with smaller size near the non-text elements after the non-text elements are identified and filtered on the document image, so as to obtain a more accurate text region.

Fig. 9 is a functional structure diagram of an area separating apparatus for document images according to an embodiment of the present invention. As shown in fig. 9, the present embodiment provides a region separating apparatus 300 for document images, comprising:

the image preprocessing module 301 is configured to preprocess the document image to obtain a binarized image;

a connected region analysis module 302, configured to perform connected region analysis on the binarized image to obtain a first connected region set;

an obtaining module 303, configured to obtain attribute features of each connected region in the first connected region set;

a connected region dividing module 304, configured to divide the first connected region set into a second connected region set and a third connected region set according to an attribute feature of each connected region in the first connected region set; the second connected region set comprises connected regions of non-text elements, and the third connected region set comprises connected regions of text elements;

the obtaining module 303 is further configured to obtain attribute features and adjacent position information of each connected region in the third connected region set;

the connected region dividing module 304 is further configured to determine a connected region of a text element from the third connected region set according to the attribute feature and the adjacent position information of each connected region in the third connected region set.

Optionally, the connected region dividing module 304 is specifically configured to:

Optionally, the first non-text element condition includes at least one of the following conditions:

Optionally, the connected component analyzing module 302 is further configured to:

the obtaining module 303 is specifically configured to obtain adjacent position information corresponding to each connected region in the third connected region set, where the adjacent position information includes a distance between each connected region and an adjacent connected region corresponding to each connected region, and a left connected region number and a right connected region number of each connected region.

Optionally, the second non-text condition includes:

the first condition, the second condition, and the third condition;

wherein the first condition is A_i＝max(Ω₁)∩A_i>k₁×median(Ω₁) (ii) a The second condition is H_i＝max(Ω₂)∩H_i>k₂×median(Ω₂) (ii) a The third condition is W_i＝max(Ω₃)∩W_i>k₃×median(Ω₃)；

In the formula (I), the compound is shown in the specification,

Optionally, the third non-text element condition includes:

Optionally, the obtaining module 303 is further configured to obtain a first pixel number of a minimum bounding rectangle of a connected region overlapped with the connected region of the non-text element;

the obtaining module 303 is further configured to obtain a second number of pixels after performing dilation operation on the connected region of the non-text element;

the connected region dividing module 304 is further configured to: and if the first pixel quantity is larger than the second pixel quantity, determining a connected region overlapped with the connected region of the non-text element as the connected region of the text element.

The area separation apparatus for document images provided in this embodiment may implement the technical solutions of the above method embodiments, and the implementation principles and technical effects are similar, which are not described herein again.

Fig. 10 is a schematic diagram of a hardware structure of a region separating apparatus for document images according to an embodiment of the present invention. As shown in fig. 10, the present embodiment provides a region separating apparatus 400 for document images, comprising:

a memory 401;

a processor 402; and

a computer program;

wherein the computer program is stored in the memory 401 and configured to be executed by the processor 402 to implement the technical solution of any one of the foregoing method embodiments, and the implementation principle and technical effect thereof are similar, and are not described herein again.

Optionally, the memory 401 may be separate or integrated with the processor 402.

When the memory 401 is a device independent of the processor 402, the document image segmentation apparatus 400 further includes:

a bus 403 for connecting the memory 401 and the processor 402.

Embodiments of the present invention also provide a computer-readable storage medium, on which a computer program is stored, where the computer program is executed by a processor 402 to implement the steps performed by the area separating apparatus 400 for document images in the above embodiments.

It should be understood that the Processor may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of a method disclosed in connection with the present invention may be embodied directly in a hardware processor, or in a combination of the hardware and software modules within the processor.

The memory may comprise a high-speed RAM memory, and may further comprise a non-volatile storage NVM, such as at least one disk memory, and may also be a usb disk, a removable hard disk, a read-only memory, a magnetic or optical disk, etc.

The bus may be an Industry Standard Architecture (ISA) bus, a Peripheral Component Interconnect (PCI) bus, an Extended ISA (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, the buses in the figures of the present application are not limited to only one bus or one type of bus.

The storage medium may be implemented by any type or combination of volatile or non-volatile memory devices, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks. A storage media may be any available media that can be accessed by a general purpose or special purpose computer.

An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. Of course, the storage medium may also be integral to the processor. The processor and the storage medium may reside in an Application Specific Integrated Circuits (ASIC). Of course, the processor and the storage medium may reside as discrete components in a regionally separate apparatus of a document image.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims

1. A method for separating regions of a document image, comprising:

preprocessing a document image to obtain a binary image;

2. The method of claim 1, wherein the attribute features of each of the connected regions include a number of pixels, a pixel density, an aspect ratio of each of the connected regions, and a first number of other connected regions nested in a smallest bounding rectangle of each of the connected regions.

3. The method according to claim 1, wherein the dividing the first set of connected regions into a second set of connected regions and a third set of connected regions according to the attribute characteristics of each connected region in the first set of connected regions comprises:

4. The method of claim 3, wherein the first non-text element condition comprises at least one of:

5. The method of claim 1, wherein obtaining neighbor location information for each connected region in the third set of connected regions comprises:

6. The method according to claim 1, wherein the determining the connected regions of the text element from the third set of connected regions according to the attribute characteristics and the adjacent position information of each connected region in the third set of connected regions comprises:

7. The method of claim 6, wherein the second non-textual condition comprises:

the first condition, the second condition, and the third condition;

In the formula (I), the compound is shown in the specification,

8. The method of claim 6, wherein the third non-text element condition comprises:

9. The method according to any one of claims 1 to 8, further comprising:

10. An area separating apparatus for a document image, comprising:

11. An area separating apparatus for a document image, comprising:

a memory;

a processor; and

a computer program;

wherein the computer program is stored in the memory and configured to be executed by the processor to implement the method of any of claims 1 to 9.

12. A computer-readable storage medium, having stored thereon a computer program for execution by a processor to perform the method of any one of claims 1 to 9.