CN106096592B - A kind of printed page analysis method of digital book - Google Patents

A kind of printed page analysis method of digital book Download PDF

Info

Publication number
CN106096592B
CN106096592B CN201610584126.7A CN201610584126A CN106096592B CN 106096592 B CN106096592 B CN 106096592B CN 201610584126 A CN201610584126 A CN 201610584126A CN 106096592 B CN106096592 B CN 106096592B
Authority
CN
China
Prior art keywords
region
image
segmentation
gabor
coordinate
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201610584126.7A
Other languages
Chinese (zh)
Other versions
CN106096592A (en
Inventor
鲁伟明
刘佳卉
庄越挺
吴飞
魏宝刚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University ZJU
Original Assignee
Zhejiang University ZJU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University ZJU filed Critical Zhejiang University ZJU
Priority to CN201610584126.7A priority Critical patent/CN106096592B/en
Publication of CN106096592A publication Critical patent/CN106096592A/en
Application granted granted Critical
Publication of CN106096592B publication Critical patent/CN106096592B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content
    • G06V30/414Extracting the geometrical structure, e.g. layout tree; Block segmentation, e.g. bounding boxes for graphics or text
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/20Image enhancement or restoration by the use of local operators
    • G06T5/30Erosion or dilatation, e.g. thinning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/50Image enhancement or restoration by the use of more than one image, e.g. averaging, subtraction
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10004Still image; Photographic image
    • G06T2207/10008Still image; Photographic image from scanner, fax or copier
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20024Filtering details
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20212Image combination
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30176Document

Abstract

The invention discloses a kind of printed page analysis methods of digital book.Printed page analysis method based on region segmentation is substantially a kind of image segmentation and territorial classification with JEPG or TIF format storage books.The invention firstly uses morphology operations and the method extraction image edge of Gabor linear filter and the basis of overdivided region is combined to merge, realize the segmentation in books region, then area filling and sequence are carried out to the image block of segmentation, rebuild reading order, feature extraction finally is carried out to image-region, training classifier obtains each area attribute, realizes the region disconnecting of books, improves the identification correctness of OCR engine and the accuracy rate of Books Search.

Description

A kind of printed page analysis method of digital book
Technical field
The present invention relates to, with the printed page analysis technology of picture existing for image format, especially relate in a kind of pair of digital library And a kind of region segmentation and the printed page analysis technology of classification.
Background technique
With the research and development of computer and network technologies, digital library is from processing based on information and simple Man-machine interface gradually to the understanding development between Knowledge based engineering processing and extensive machine, to enable people to utilize meter Calculation machine and network more broadly expand the ability of intellection, it is in need exchange, propagate, storage and using knowledge neck Domain, including e-commerce, education, tele-medicine etc. play extremely important effect.
Since the books in CADAL digital library exist in the form of images, need to be handled by OCR, identification is wherein Text, carry out printed page analysis, depth service can be carried out.The OCR software of profession has at present: Han Wang OCR, ABBYY FineReader, SimpleOCR, TopOCR, FreeOCR etc..
The professional OCR software that digital library uses is Han Wang OCR, supports processing gray scale, black, colored three kinds of colors The image file of the formats such as color JPG, PDF, TIF, BMP, while can recognize simplified, traditional font and English three kinds of language, finally will It is converted into a variety of output forms such as TXT, RTF, HTM and XLS.And during actual scanning, we are it can be found that if picture Middle embedded images block, then can be largely affected by scanning effect.As shown in Figure 1, wherein (a) is to need to carry out OCR processing Original image, figure (b) are to separate without text image, the text file effect picture obtained after direct OCR.As can be seen that original Image-region in figure is unable to get correct effect, the case where messy code occurs after OCR processing, this is because should There are image blocks in picture file.
Therefore in order to improve OCR software treatment effect, more accurate text file is obtained, needs to carry out image file Further printed page analysis realizes the separation of text and image block, obtains the coordinate information of image block, thus in OCR processing Filtering function is realized when analysis, and the content completed is needed here it is this secondary design.
Printed page analysis is often combined with OCR technique, needs to carry out printed page analysis, printed page analysis to image before OCR identification As a result accuracy, the effect for directly affecting document identification and restoring.Printed page analysis existing for the OCR software found in experiment is asked Topic has: (shown in such as Fig. 1 (b)):
(1) formulas solutions go wrong, and can not extract complete formula or extract formula as character block.
(2) can not illustration as identification process figure or illustration identification it is imperfect.
(3) because scanning fuzzy problem, character block is wrongly classified as illustration region.
(4) descriptive matter in which there beside figure does not properly separate.
Summary of the invention
The purpose of the present invention is to provide a kind of digital book printed page analysis methods cut based on region with classification, thus OCR software treatment effect is improved, more accurate text file is obtained
It is as follows that the technical scheme adopted by the invention to solve the technical problem: a kind of printed page analysis method of digital book, packet Include following steps:
1) region segmentation of the books space of a whole page: region segmentation is carried out to digital book page, firstly, utilizing morphology operations pair Original image does burn into opening and closing, edge detection, expansion removal processing, and segmentation picture is obtained, has in over-segmentation picture more Region fragment, carry out original image edge followed by Gabor linear filter and extract again, fragmentation region is carried out just Step merges, and is finally remerged using the location information in fragmentation region to the region after preliminary merge;
2) reading order reconstructs: being filled, increases between virtual point and region to the region after step 1) segmentation Distance relation, design optimization problem and constraint condition, obtain reading order;
3) area type is classified: being extracted, the feature of screening cut zone, is formed reliable feature vector, obtain region class Type attribute stores useful region therein;
4) it by the coordinates feedback in each region to OCR engine, to realize filtering function when OCR processing is analyzed, improves soft Part treatment effect.
Further, books layout area described in step 1) segmentation in morphology operations, Gabor linear filtering, Fragmentation region merging technique, specifically:
1.1) based on the morphological method of image border, i.e., swollen using burn into after carrying out grey scale change to original image Swollen, opening and closing operation extracts image edge, and after removing isolated image, segmentation picture is obtained;
1.2) Gabor linear filtering: carries out smooth operation with the filtering of Gabor real part, imaginary part filtering carry out edge detection with It is initial to merge:
Gabor plural number expression:
Real part:
Imaginary part:
Wherein:
X '=x cos θ+y sin θ
Y '=- x sin θ+y cos θ
The meaning of Parameters in Formula and the allocation problem of parameter:
X, y are the position of spatial domain pixel;
Wavelength (λ): its value is as unit of pixel, 2≤λ≤input image size/5;
Direction (θ): the parameter specifies the direction of Gabor function parallel stripes, and value range is (0 °~360 °);
Phase offset (ψ): its value range (0 °~180 °), -90 spend corresponding antisymmetric functions, centered on 0 degree symmetrically Center-on function, 90 degree of corresponding antisymmetric functions, 180 degree corresponds to center-off function;
Length-width ratio (Υ): i.e. space aspect ratio, the parameter determine the ellipticity of Gabor function shape, as Υ=1, Shape is circle, and as Υ < 1, shape is elongated with parallel stripes direction;Bandwidth (b): half response of Gabor filter is empty Between frequency, the ratio of bandwidth b and σ/λ is related, wherein σ indicate Gabor function the Gauss factor standard deviation:
1.3) fragmentation region remerges:
1.3.1) row merging
By step 1.1) and 1.2) cut zone tentatively merged, and obtain the position coordinates in each region (left, top, right, bottom), wherein literal line is incomplete literal line, and a line text is cut into the more of separation A region unit by comparing the height of each interregional position coordinates and region, while considering the error analysis of image scanning, Assuming that the coordinate of region x is (leftx, topx, rightx, bottomx), the coordinate of region y is (lefty, topy, righty, bottomy), while the region merging technique for meeting following three coordinate conditions is one text row:
By traversing to region, realize that the row of colleague's text merges;
1.3.2) overlapping region merging
Cut zone has more coincidence or repeat region, is merged according to the coordinate feature of repeat region.It is full simultaneously The region of sufficient following two coordinate condition is defined as overlapping region, and carries out region merging technique:
Further, reading order reconstruct specifically includes following sub-step:
2.1) capable filling is carried out to known region.Assuming that N number of region has been obtained in step 1), X={ x1,x2,x3, ....xN, xiIndicate any one region, xiIt is carried out according to the bottom value in region incremental, x is obtained by step 1)iRegion Position tentatively judges whether region is image-region according to the threshold value of height firstly, enabling height=top-bottom, leads to The positional relationship and translation specifications for judging region are crossed, region is expanded, x is obtainediAfter expansion regional location (left, Top, right, bottom), following two kinds of situations need to carry out line of text expansion, it is assumed that the coordinate of image-region x is (leftx, topx, rightx, bottomx), the coordinate of region y is (lefty, topy, righty, bottomy)
2.1.1) text filed to appear near image-region:
2.1.2) adjacent text filed expansion:
bottomx< topx+1
2.2) selection of neighbour's block: according to xiThe coordinate of central point is calculated in regional locationTo which two region x be calculatediAnd xjBetween away from From:
Wherein, if the positional relationship between two regions is unsatisfactory for normal reading habit, that is, meet in following two condition and appoint Meaning one, then enable distij=1.7976931348623157E308;
Wherein, error amount is depending on scanning result;
2.3) design optimization problem is as follows:
Reading order is constructed using integral linear programming makes the sum of the distance between all areas for minimum, formula table Show as follows:
Obtain the independent circuit of each covering all areas;
2.4) constraint condition designs, in which:
Design constraint is as follows:
(4) each region can only be connected with other another regions;
Two-way side is not allow between (5) two regions;
(6) consider reading order, it is desirable that the sequence between two regions cannot be upper left;
(4) arteface virtual point and initiation region, end region obtain constraint condition:
ui-uj+nxij≤n-1 when1≤i!=j≤n
Each feasible solution include only one cover the closed area sequence of all areas, wherein define virtual point to Beginning region, the distance of end region are 0, and the distance of virtual point to other regions is 1.7976931348623157E308;Definition Initiation region is the smallest region of bottom value, and definition end region is the maximum region of bottom value.
Further, the step 3) specifically includes following sub-step:
3.1) connected domain in bianry image is marked, after obtaining the label matrix of bianry image, obtains measurement mark Infuse a series of attributes of each tab area in matrix L;
3.2) regional texture feature carries out function calculating using flow cytometer instrument cases, and the spread foundation of these functions is Gray level co-occurrence matrixes GLCM.Gray level co-occurrence matrixes can reflect image grayscale about direction, adjacent spaces, the summation of amplitude of variation Information.
3.3) feature is selected, characteristic value is selected to carry out classification analysis to region;
3.4) random forests algorithm given area type, including content text, image, table, formula, header page are utilized Foot, the page number, dead sector, formula label.
5, a kind of digital book printed page analysis method according to claim 4, which is characterized in that the step 3.3) In, select 29 characteristic values to carry out classification analysis to region, specifically: area, maxal, minal, eccent, equivDia, solidity,extent,peri,autoc,contr,corrm,corrp,cprom,cshad,dissi,energ,entro, homom,homop,maxpr,sosvh,savgh,svarh,senth,dvarh,inf1h,inf2h,indnc,idmnc;Specifically Feature meaning is as follows:
The method of the present invention has the advantages that compared with prior art
1. by morphology operations and Gabor edge filter combination processing image, in utilization morphological method to picture On the basis of carrying out over-segmentation, detected again using the edge that Gabor edge filter carries out the region of over-segmentation, to realize segmentation The basis in region merges
2. merging using text filed row and overlapping region merging, the accuracy and integrality of region segmentation are improved;
3. be filled by judge the positional relationship and translation specifications in region to line of text, construction virtual point and Initiation region, termination area, design optimization problem and constraint condition improve the correctness of reading order reconstruct, improve reading The reading quality of person;
4. gray level co-occurrence matrixes are based on, in the base of the feature extraction based on bianry image connected region in image classification On plinth, textural characteristics are extended, are calculated using the function that flow cytometer instrument cases are calculated.Random tree calculation is finally used Method handles classification problem.It is demonstrated experimentally that the accuracy rate that this method extracts image-region is higher, to complex background or it is distributed not The image of rule also can preferably realize the differentiation of each attribute region.
Detailed description of the invention
Fig. 1 OCR printed page analysis effect picture is shown;
The existing printed page analysis effect picture of Fig. 2;
Fig. 3 is the method for the present invention modules exhibit figure;
Fig. 4 is flow chart of the invention;
Sample in Fig. 5 CADAL digital library;
Fig. 6 is the display diagram for completing region segmentation and reading order reconstruct;
Fig. 7 is area type classification results.
Specific embodiment
Invention is further described in detail in the following with reference to the drawings and specific embodiments.
A kind of printed page analysis method of digital book provided by the invention, comprising the following steps:
1) region segmentation of the books space of a whole page: region segmentation is carried out to digital book page, firstly, utilizing morphology operations pair Original image does burn into opening and closing, edge detection, expansion removal processing, and segmentation picture is obtained, has in over-segmentation picture more Region fragment, carry out original image edge followed by Gabor linear filter and extract again, fragmentation region is carried out just Step merges, and is finally remerged using the location information in fragmentation region to the region after preliminary merge;
2) reading order reconstructs: being filled, increases between virtual point and region to the region after step 1) segmentation Distance relation, design optimization problem and constraint condition, obtain reading order;
3) area type is classified: being extracted, the feature of screening cut zone, is formed reliable feature vector, obtain region class Type attribute stores useful region therein;
4) it by the coordinates feedback in each region to OCR engine, to realize filtering function when OCR processing is analyzed, improves soft Part treatment effect.
Further, books layout area described in step 1) segmentation in morphology operations, Gabor linear filtering, Fragmentation region merging technique, specifically:
1.1) based on the morphological method of image border, i.e., swollen using burn into after carrying out grey scale change to original image Swollen, opening and closing operation extracts image edge, and after removing isolated image, segmentation picture is obtained;Specific implementation step is as follows:
(1) threshold process is carried out to obtained gray level image, obtains bianry image.
(2) bianry image is corroded.It creates the flat type disc structure that a radius is 8 and multiple corruption is carried out to image Erosion.
(3) expansive working is carried out to image after corrosion.At this point, creating the square of a 8*8 to achieve the desired results Structural element carries out expansion process to image.
(4) image background is estimated with morphology opening operation imopen and closed operation imclose.Opening operation is in order to smooth Image outline makes narrow junction disconnect realizing the deletion of tiny protrusion, and one radius of building is 5 when opening operation Disc structure.Closed operation is equivalent to the inverse process of opening operation, connects the part that narrow notch makes an entirety, closes Operation is the square structure element using 5*5.
(5) the different point of gray value in tag image, utilizes edge detection operator Sobel.Sobel operator can smoothly make an uproar Sound has preferable effect to influence of noise is eliminated.
Include cross form and vertical framework in Sobel operator, in actual use, commonly uses following two templates to detect Image border.
Detection level edge cross form:
Detect vertical pingbian template along longitudinal direction:
Gradient magnitude calculation formula:
Gradient direction calculation formula:
(6) isolated image is removed.Specified morphological operation is carried out to image using function bwmorph is removed, but N times are operated using this.
1.2) Gabor linear filtering: Gabor filter has that space is optimal and the optimal characteristic of frequency, has and the mankind The similar characteristic of biological vision, corresponding to the partial structurtes information of spatial frequency, spatial position and direction selection, Gabor Filter can be indicated and be described well.Meanwhile Gabor filter has the characteristics that self similarity, i.e. a morther wavelet are logical It crosses expansion and rotation process can produce the Gabor filter of different parameters needed for experiment.In practical application, frequency domain not Same scale, on different directions, Gabor can extract correlated characteristic.
There are two parts, i.e. real and imaginary parts by Gabor, carry out smooth operation with the filtering of Gabor real part, imaginary part filtering carries out Edge detection and initial merging:
Gabor plural number expression:
Real part:
Imaginary part:
Wherein:
X '=x cos θ+y sin θ
Y '=- x sin θ+y cos θ
The meaning of Parameters in Formula and the allocation problem of parameter:
X, y are the position of spatial domain pixel;
Wavelength (λ): its value is as unit of pixel, 2≤λ≤input image size/5;
Direction (θ): the parameter specifies the direction of Gabor function parallel stripes, and value range is (0 °~360 °);
Phase offset (ψ): its value range (0 °~180 °), -90 spend corresponding antisymmetric functions, centered on 0 degree symmetrically Center-on function, 90 degree of corresponding antisymmetric functions, 180 degree corresponds to center-off function;
Length-width ratio (Υ): i.e. space aspect ratio, the parameter determine the ellipticity of Gabor function shape, as Υ=1, Shape is circle, and as Υ < 1, shape is elongated with parallel stripes direction;
Bandwidth (b): half response spatial frequency of Gabor filter, the ratio of bandwidth b and σ/λ is related, and wherein σ is indicated The standard deviation of the Gauss factor of Gabor function:
Specific implementation step is as follows:
(1) im image is converted to double type;
(2) cosine filter and sinusoidal filter are constructed, adjusts filter parameter, and tested;
(3) rotation of filter is realized using imrotate function;
EvenFilter=imrotate (evenFilter, angle, ' bilinear');
OddFilter=imrotate (oddFilter, angle, ' bilinear');
(4) filter is called to be filtered operation;
Eim=filter2 (evenFilter, im);%Even filter result
Oim=filter2 (oddFilter, im);%Odd filter result
Aim=sqrt (Eim.^2+Oim.^2);%Amplitud
Return to magnitude image.
1.3) fragmentation region remerges:
1.3.1) row merging
By step 1.1) and 1.2) cut zone tentatively merged, and obtain the position coordinates in each region (left, top, right, bottom), wherein literal line is incomplete literal line, and a line text is cut into the more of separation A region unit by comparing the height of each interregional position coordinates and region, while considering the error analysis of image scanning, Assuming that the coordinate of region x is (leftx, topx, rightx, bottomx), the coordinate of region y is (lefty, topy, righty, bottomy), while the region merging technique for meeting following three coordinate conditions is one text row:
By traversing to region, realize that the row of colleague's text merges;
1.3.2) overlapping region merging
Cut zone has more coincidence or repeat region, is merged according to the coordinate feature of repeat region.It is full simultaneously The region of sufficient following two coordinate condition is defined as overlapping region, and carries out region merging technique:
Shown in specific step is as follows:
It is known: to give two rectangles A and B, we can define the top left co-ordinate (A.left, A.top) of rectangle A, right Lower angular coordinate is (A.right, A.bottom), and the top left co-ordinate (B.left, B.top) of rectangle B, bottom right angular coordinate is (B.right,B.bottom)
Output: if rectangle A is overlapped with rectangle B or has intersection, rectangle A is merged with rectangle B, rectangle after being merged The coordinate of C.
Arbitrary point (x, y) in rectangle A should meet following inequality group including the point on four sides
A.left≤x≤A.right ①
A.top≤y≤A.bottom ②
Similarly, the point in B meets
B.left≤x≤B.right ③
B.top≤y≤B.bottom ④
1. 2. 3. 4. if A, B has coincidence, it there will necessarily be a little while meeting, so having
Max (A.left, B.left)≤min (A.right, B.right)
Max (A.top, B.top)≤min (A.bottom, B.bottom)
My available more complete image block and text block after merging.
Further, reading order reconstruct specifically includes following sub-step:
2.1) capable filling is carried out to known region.Assuming that N number of region has been obtained in step 1), X={ x1,x2,x3, ....xN, xiIndicate any one region, xiIt is carried out according to the bottom value in region incremental, x is obtained by step 1)iRegion Position tentatively judges whether region is image-region according to the threshold value of height firstly, enabling height=top-bottom, leads to The positional relationship and translation specifications for judging region are crossed, region is expanded, x is obtainediAfter expansion regional location (left, Top, right, bottom), following two kinds of situations need to carry out line of text expansion, it is assumed that the coordinate of image-region x is (leftx, topx, rightx, bottomx), the seat of region y
It is designated as (lefty, topy, righty, bottomy)
2.1.1) text filed to appear near image-region:
2.1.2) adjacent text filed expansion:
bottomx< topx+1
2.2) selection of neighbour's block: according to xiThe coordinate of central point is calculated in regional locationTo which two region x be calculatediAnd xjBetween away from From:
Wherein, if the positional relationship between two regions is unsatisfactory for normal reading habit, that is, meet in following two condition and appoint Meaning one, then enable distij=1.7976931348623157E308;
Wherein, error amount is depending on scanning result;
2.3) design optimization problem is as follows:
Reading order is constructed using integral linear programming makes the sum of the distance between all areas for minimum, formula table Show as follows:
Obtain the independent circuit of each covering all areas;
2.4) constraint condition designs, in which:
Design constraint is as follows:
(1) each region can only be connected with other another regions;
Two-way side is not allow between (2) two regions;
(3) consider reading order, it is desirable that the sequence between two regions cannot be upper left;
(4) arteface virtual point and initiation region, end region obtain constraint condition:
ui-uj+nxij≤n-1 when1≤i!=j≤n
Each feasible solution include only one cover the closed area sequence of all areas, wherein define virtual point to Beginning region, the distance of end region are 0, and the distance of virtual point to other regions is 1.7976931348623157E308;Definition Initiation region is the smallest region of bottom value, and definition end region is the maximum region of bottom value.
The step 3) specifically includes following sub-step:
3.1) connected domain in bianry image is marked, after obtaining the label matrix of bianry image, obtains measurement mark Infuse a series of attributes of each tab area in matrix L;
3.2) regional texture feature carries out function calculating using flow cytometer instrument cases, and the spread foundation of these functions is Gray level co-occurrence matrixes GLCM.Gray level co-occurrence matrixes can reflect image grayscale about direction, adjacent spaces, the summation of amplitude of variation Information.
3.3) feature is selected, 29 characteristic values is selected to carry out classification analysis to region;Specifically: area, maxal,minal,eccent,equivDia,solidity,extent,peri,autoc,contr,corrm,corrp, cprom,cshad,dissi,energ,entro,homom,homop,maxpr,sosvh,savgh,svarh,senth, dvarh,inf1h,inf2h,indnc,idmnc;Specific features meaning is as follows:
3.4) random forests algorithm given area type, including content text, image, table, formula, header page are utilized Foot, the page number, dead sector, formula label.
Embodiment
The specific steps of this example implementation are described in detail below with reference to method of the invention, here with the library CADAL number Word library scanning e-book in certain one page as an example, as shown in figure 5, all processes of explanatory diagram 3.
1) original image is read, gray proces are carried out to color image, read in gray scale picture
2) image edge is extracted.The square structure element for creating a 8*8 carries out expansion process to image.With form It learns opening operation imopen and closed operation imclose and estimates image background.Opening operation is to make narrow company for smoothed image profile The place of connecing disconnects realizing the deletion of tiny protrusion, the disc structure that one radius of building is 5 when opening operation.Closed operation phase When in the inverse process of opening operation, connecting the part that narrow notch makes an entirety, closed operation is to use 5*5 just Square structure element.The different point of gray value, utilizes edge detection operator Sobel in tag image.
3) linear Gabor filter filtering is carried out, the edge detection of a closer step is carried out on over-segmentation picture basis
4) cut zone of coincidence and inclusion relation is merged, utilizes the coordinate feature of each cut zone.
5) line of text that step 4) obtains is expanded, after being expanded each region coordinate (left, right, top, Bottom), (centerx, centery) is calculated, to calculate distij
Construction virtual point, the region start-, the region end-, and to above-mentioned N+3 regional structure constraint condition
ui-uj+nxij≤n-1 when1≤i!=j≤n
And solving optimization problem:
6) provincial characteristics for extracting segmentation, is based on gray level co-occurrence matrixes, extracts the characteristic value of the cut zone after merging.Packet Include characteristic value and textural characteristics based on bianry image connected domain.
7) area attribute classification, storage zone position are carried out using the characteristic value that step 6) obtains.

Claims (4)

1. a kind of printed page analysis method of digital book, which comprises the following steps:
1) region segmentation of the books space of a whole page: region segmentation is carried out to digital book page, firstly, using morphology operations to original Image does burn into opening and closing, edge detection, expansion removal processing, and segmentation picture is obtained, has more area in over-segmentation picture Domain fragment carries out extracting again for original image edge followed by Gabor linear filter, is tentatively closed to fragmentation region And finally the region after preliminary merge is remerged using the location information in fragmentation region;
2) reading order reconstructs: being filled to the region after step 1) segmentation, increases the distance between virtual point and region Relationship, design optimization problem and constraint condition, obtain reading order;The reading order reconstruct specifically includes following sub-step It is rapid:
2.1) capable filling is carried out to known region;Assuming that N number of region has been obtained in step 1), X={ x1,x2,x3,....xN, xiIndicate any one region, xiIt carries out incremental according to the bottom value of segmentation rear region, x is obtained by step 1)iRegion position It sets, firstly, enabling height=top-bottom, tentatively judges whether region is image-region according to the threshold value of height, pass through The positional relationship and translation specifications for judging region, expand region, obtain xiAfter expansion regional location (left, Top, right, bottom), following two kinds of situations need to carry out line of text expansion, it is assumed that the coordinate of image-region x is (leftx, topx, rightx, bottomx), the coordinate of region y is (lefty, topy, righty, bottomy)
2.1.1) text filed to appear near image-region:
2.1.2) adjacent text filed expansion:
bottomx<topx+1
2.2) selection of neighbour's block: according to xiThe coordinate of central point is calculated in regional locationTo which two region x be calculatediAnd xjBetween away from From:
Wherein, if the positional relationship between two regions is unsatisfactory for normal reading habit, that is, meet any one in following two condition It is a, then enable distij=1.7976931348623157E308;
Wherein, error amount is depending on scanning result;
2.3) design optimization problem is as follows:
Reading order is constructed using integral linear programming makes the sum of the distance between all areas for minimum, and formula indicates such as Under:
Obtain the independent circuit of each covering all areas;
2.4) constraint condition designs, in which:
Design constraint is as follows:
(1) each region can only be connected with other another regions;
Two-way side is not allow between (2) two regions;
(3) consider reading order, it is desirable that the sequence between two regions cannot be upper left;
(4) arteface virtual point and initiation region, end region obtain constraint condition:
ui-uj+nxij≤n-1 when 1≤i!=j≤n
Each feasible solution includes only one and covers the closed area sequence of all areas, wherein defining virtual point to sintering Domain, the distance of end region are 0, and the distance of virtual point to other regions is 1.7976931348623157E308;Definition starting Region is the smallest region of bottom value, and definition end region is the maximum region of bottom value;
3) area type is classified: being extracted, the feature of screening cut zone, is formed reliable feature vector, obtain area type category Property, useful region therein is stored;
4) it by the coordinates feedback in each region to OCR engine, to realize filtering function when OCR processing is analyzed, improves at software Manage effect.
2. a kind of printed page analysis method of digital book according to claim 1, which is characterized in that described in step 1) Morphology operations, Gabor linear filtering, fragmentation region merging technique in the segmentation of books layout area, specifically:
1.1) it based on the morphological method of image border, i.e., after carrying out grey scale change to original image, using burn into expansion, opens Image edge is extracted in closed operation, and after removing isolated image, segmentation picture is obtained;
1.2) Gabor linear filtering: carries out smooth operation with the filtering of Gabor real part, and imaginary part filtering carries out edge detection and initially Merge:
Gabor plural number expression:
Real part:
Imaginary part:
Wherein:
X '=xcos θ+ysin θ
Y '=- xsin θ+ycos θ
The meaning of Parameters in Formula and the allocation problem of parameter:
X, y are the position of spatial domain pixel;
Wavelength X: its value is as unit of pixel, 2≤λ≤input image size/5;
Direction θ: the parameter specifies the direction of Gabor function parallel stripes, and value range is 0 °~360 °;
Phase offset ψ: 0 °~180 ° of its value range, -90 spend corresponding antisymmetric function, symmetrical center- centered on 0 degree On function, 90 degree of corresponding antisymmetric functions, 180 degree correspond to center-off function;
Length-width ratio Υ: i.e. space aspect ratio, the parameter determine the ellipticity of Gabor function shape, and as Υ=1, shape is Circle, as Υ < 1, shape is elongated with parallel stripes direction;Half response spatial frequency of bandwidth b:Gabor filter, band The ratio of wide b and σ/λ is related, and wherein σ indicates the standard deviation of the Gauss factor of Gabor function:
1.3) fragmentation region remerges:
1.3.1) row merging
By step 1.1) and 1.2) cut zone tentatively merged, and obtain each region position coordinates (left, Top, right, bottom), wherein literal line is incomplete literal line, and a line text is cut into the multiple regions of separation Block by comparing the height of each interregional position coordinates and region, while considering the error analysis of image scanning, it is assumed that area The coordinate of domain x is (leftx, topx, rightx, bottomx), the coordinate of region y is (lefty, topy, righty, bottomy), The region merging technique for meeting following three coordinate conditions simultaneously is one text row:
By traversing to region, realize that the row of colleague's text merges;
1.3.2) overlapping region merging
Cut zone has more coincidence or repeat region, is merged according to the coordinate feature of repeat region;It will meet simultaneously The region of following two coordinate condition is defined as overlapping region, carries out region merging technique:
3. a kind of printed page analysis method of digital book according to claim 1, which is characterized in that the step 3) is specific Including following sub-step:
3.1) connected domain in bianry image is marked, after obtaining the label matrix of bianry image, obtains measurement mark square A series of attributes of each tab area in battle array L;
3.2) regional texture feature carries out function calculating using flow cytometer instrument cases, and the spread foundation of these functions is gray scale Co-occurrence matrix GLCM;Gray level co-occurrence matrixes can reflect image grayscale to be believed about the summation of direction, adjacent spaces, amplitude of variation Breath;
3.3) feature is selected, characteristic value is selected to carry out classification analysis to region;
3.4) random forests algorithm given area type, including content text, image, table, formula, headerfooter, page are utilized Code, dead sector, formula label.
4. a kind of printed page analysis method of digital book according to claim 3, which is characterized in that the step 3.3) In, select 29 characteristic values to carry out classification analysis to region, specifically: area, maxal, minal, eccent, equivDia, solidity,extent,peri,autoc,contr,corrm,corrp,cprom,cshad,dissi,energ,entro, homom,homop,maxpr,sosvh,savgh,svarh,senth,dvarh,inf1h,inf2h,indnc,idmnc;Specifically Feature meaning is as follows:
CN201610584126.7A 2016-07-22 2016-07-22 A kind of printed page analysis method of digital book Active CN106096592B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610584126.7A CN106096592B (en) 2016-07-22 2016-07-22 A kind of printed page analysis method of digital book

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610584126.7A CN106096592B (en) 2016-07-22 2016-07-22 A kind of printed page analysis method of digital book

Publications (2)

Publication Number Publication Date
CN106096592A CN106096592A (en) 2016-11-09
CN106096592B true CN106096592B (en) 2019-05-24

Family

ID=57450070

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610584126.7A Active CN106096592B (en) 2016-07-22 2016-07-22 A kind of printed page analysis method of digital book

Country Status (1)

Country Link
CN (1) CN106096592B (en)

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107301418A (en) * 2017-06-28 2017-10-27 江南大学 Printed page analysis in optical character identification
CN109389116B (en) * 2017-08-14 2022-02-08 阿里巴巴(中国)有限公司 Character detection method and device
CN108021900B (en) * 2017-12-18 2022-05-17 科大讯飞股份有限公司 Layout column dividing method and device
CN109308476B (en) * 2018-09-06 2019-08-27 邬国锐 Billing information processing method, system and computer readable storage medium
CN109948123B (en) * 2018-11-27 2023-06-02 创新先进技术有限公司 Image merging method and device
CN110097046A (en) * 2019-03-11 2019-08-06 上海肇观电子科技有限公司 A kind of character detecting method and device, equipment and computer readable storage medium
CN109933756B (en) * 2019-03-22 2022-04-15 腾讯科技(深圳)有限公司 Image file transferring method, device and equipment based on OCR (optical character recognition), and readable storage medium
CN110059596B (en) * 2019-04-03 2020-07-07 北京字节跳动网络技术有限公司 Image identification method, device, medium and electronic equipment
CN110263792B (en) * 2019-06-12 2021-10-22 广东小天才科技有限公司 Image recognizing and reading and data processing method, intelligent pen, system and storage medium
CN113033338B (en) * 2021-03-09 2024-03-29 太极计算机股份有限公司 Electronic header edition headline news position identification method and device

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1604075A (en) * 2004-11-22 2005-04-06 北京北大方正技术研究院有限公司 Method for conducting words reading sequence recovery for newspaper pages
CN1604074A (en) * 2004-11-22 2005-04-06 北京北大方正技术研究院有限公司 Method for determining words reading sequence for columned serial words pages with mutually exclusive pattern and characters
EP1701292A3 (en) * 2005-03-08 2009-09-16 Ricoh Company, Ltd. Document layout analysis with control of non-character area
CN101794278A (en) * 2009-09-21 2010-08-04 广东省标准化研究院 Method and software for digitalizing full text of standard document
CN105373790A (en) * 2015-10-23 2016-03-02 北京汉王数字科技有限公司 Layout analysis method and device
CN105573974A (en) * 2014-10-09 2016-05-11 北大方正集团有限公司 Page layout method, device and system

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1604075A (en) * 2004-11-22 2005-04-06 北京北大方正技术研究院有限公司 Method for conducting words reading sequence recovery for newspaper pages
CN1604074A (en) * 2004-11-22 2005-04-06 北京北大方正技术研究院有限公司 Method for determining words reading sequence for columned serial words pages with mutually exclusive pattern and characters
EP1701292A3 (en) * 2005-03-08 2009-09-16 Ricoh Company, Ltd. Document layout analysis with control of non-character area
CN101794278A (en) * 2009-09-21 2010-08-04 广东省标准化研究院 Method and software for digitalizing full text of standard document
CN105573974A (en) * 2014-10-09 2016-05-11 北大方正集团有限公司 Page layout method, device and system
CN105373790A (en) * 2015-10-23 2016-03-02 北京汉王数字科技有限公司 Layout analysis method and device

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
图文互斥版面中文字阅读顺序的确定;贾娟 等;《中文信息学报》;20051231;第19卷(第5期);全文
智能阅读服务机器人系统关键技术研究;李艳 等;《中国优秀硕士学位论文全文数据库 信息科技辑》;20090415(第04期);全文
版面分析中图文分割方法研究及应用;刘妍妍;《中国优秀硕士学位论文全文数据库 信息科技辑》;20130815(第08期);全文

Also Published As

Publication number Publication date
CN106096592A (en) 2016-11-09

Similar Documents

Publication Publication Date Title
CN106096592B (en) A kind of printed page analysis method of digital book
CN110516208B (en) System and method for extracting PDF document form
Poco et al. Extracting and retargeting color mappings from bitmap images of visualizations
US20200065601A1 (en) Method and system for transforming handwritten text to digital ink
AU2006252025B2 (en) Recognition of parameterised shapes from document images
US8000529B2 (en) System and method for creating an editable template from a document image
AU2006252019B2 (en) Method and Apparatus for Dynamic Connector Analysis
CN109325398A (en) A kind of face character analysis method based on transfer learning
CN104899586B (en) Method and device is identified to the word content for including in image
CN110443239A (en) The recognition methods of character image and its device
CN104240256A (en) Image salient detecting method based on layering sparse modeling
CN104573685A (en) Natural scene text detecting method based on extraction of linear structures
JP2021193610A (en) Information processing method, information processing device, electronic apparatus and storage medium
US7864985B1 (en) Automatic operator-induced artifact detection in document images
CN112949570B (en) Finger vein identification method based on residual attention mechanism
CN108021837A (en) A kind of bar code detection method, bar code detecting device and electronic equipment
CN111553351A (en) Semantic segmentation based text detection method for arbitrary scene shape
Zhou et al. Identifying designs from incomplete, fragmented cultural heritage objects by curve-pattern matching
Oka et al. Vectorization of contour lines from scanned topographic maps
WO2009067022A1 (en) A method for resolving contradicting output data from an optical character recognition (ocr) system, wherein the output data comprises more than one recognition alternative for an image of a character
Hristov et al. A software system for classification of archaeological artefacts represented by 2D plans
CN115620322B (en) Method for identifying table structure of whole-line table based on key point detection
CN109325487B (en) Full-category license plate recognition method based on target detection
CN116259062A (en) CNN handwriting identification method based on multichannel and attention mechanism
Clément et al. Fuzzy directional enlacement landscapes for the evaluation of complex spatial relations

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant