CN106096592B - A kind of printed page analysis method of digital book - Google Patents
A kind of printed page analysis method of digital book Download PDFInfo
- Publication number
- CN106096592B CN106096592B CN201610584126.7A CN201610584126A CN106096592B CN 106096592 B CN106096592 B CN 106096592B CN 201610584126 A CN201610584126 A CN 201610584126A CN 106096592 B CN106096592 B CN 106096592B
- Authority
- CN
- China
- Prior art keywords
- region
- image
- segmentation
- gabor
- coordinate
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/40—Document-oriented image-based pattern recognition
- G06V30/41—Analysis of document content
- G06V30/414—Extracting the geometrical structure, e.g. layout tree; Block segmentation, e.g. bounding boxes for graphics or text
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T5/00—Image enhancement or restoration
- G06T5/20—Image enhancement or restoration by the use of local operators
- G06T5/30—Erosion or dilatation, e.g. thinning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T5/00—Image enhancement or restoration
- G06T5/50—Image enhancement or restoration by the use of more than one image, e.g. averaging, subtraction
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/10—Image acquisition modality
- G06T2207/10004—Still image; Photographic image
- G06T2207/10008—Still image; Photographic image from scanner, fax or copier
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20024—Filtering details
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20212—Image combination
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/30—Subject of image; Context of image processing
- G06T2207/30176—Document
Abstract
The invention discloses a kind of printed page analysis methods of digital book.Printed page analysis method based on region segmentation is substantially a kind of image segmentation and territorial classification with JEPG or TIF format storage books.The invention firstly uses morphology operations and the method extraction image edge of Gabor linear filter and the basis of overdivided region is combined to merge, realize the segmentation in books region, then area filling and sequence are carried out to the image block of segmentation, rebuild reading order, feature extraction finally is carried out to image-region, training classifier obtains each area attribute, realizes the region disconnecting of books, improves the identification correctness of OCR engine and the accuracy rate of Books Search.
Description
Technical field
The present invention relates to, with the printed page analysis technology of picture existing for image format, especially relate in a kind of pair of digital library
And a kind of region segmentation and the printed page analysis technology of classification.
Background technique
With the research and development of computer and network technologies, digital library is from processing based on information and simple
Man-machine interface gradually to the understanding development between Knowledge based engineering processing and extensive machine, to enable people to utilize meter
Calculation machine and network more broadly expand the ability of intellection, it is in need exchange, propagate, storage and using knowledge neck
Domain, including e-commerce, education, tele-medicine etc. play extremely important effect.
Since the books in CADAL digital library exist in the form of images, need to be handled by OCR, identification is wherein
Text, carry out printed page analysis, depth service can be carried out.The OCR software of profession has at present: Han Wang OCR, ABBYY
FineReader, SimpleOCR, TopOCR, FreeOCR etc..
The professional OCR software that digital library uses is Han Wang OCR, supports processing gray scale, black, colored three kinds of colors
The image file of the formats such as color JPG, PDF, TIF, BMP, while can recognize simplified, traditional font and English three kinds of language, finally will
It is converted into a variety of output forms such as TXT, RTF, HTM and XLS.And during actual scanning, we are it can be found that if picture
Middle embedded images block, then can be largely affected by scanning effect.As shown in Figure 1, wherein (a) is to need to carry out OCR processing
Original image, figure (b) are to separate without text image, the text file effect picture obtained after direct OCR.As can be seen that original
Image-region in figure is unable to get correct effect, the case where messy code occurs after OCR processing, this is because should
There are image blocks in picture file.
Therefore in order to improve OCR software treatment effect, more accurate text file is obtained, needs to carry out image file
Further printed page analysis realizes the separation of text and image block, obtains the coordinate information of image block, thus in OCR processing
Filtering function is realized when analysis, and the content completed is needed here it is this secondary design.
Printed page analysis is often combined with OCR technique, needs to carry out printed page analysis, printed page analysis to image before OCR identification
As a result accuracy, the effect for directly affecting document identification and restoring.Printed page analysis existing for the OCR software found in experiment is asked
Topic has: (shown in such as Fig. 1 (b)):
(1) formulas solutions go wrong, and can not extract complete formula or extract formula as character block.
(2) can not illustration as identification process figure or illustration identification it is imperfect.
(3) because scanning fuzzy problem, character block is wrongly classified as illustration region.
(4) descriptive matter in which there beside figure does not properly separate.
Summary of the invention
The purpose of the present invention is to provide a kind of digital book printed page analysis methods cut based on region with classification, thus
OCR software treatment effect is improved, more accurate text file is obtained
It is as follows that the technical scheme adopted by the invention to solve the technical problem: a kind of printed page analysis method of digital book, packet
Include following steps:
1) region segmentation of the books space of a whole page: region segmentation is carried out to digital book page, firstly, utilizing morphology operations pair
Original image does burn into opening and closing, edge detection, expansion removal processing, and segmentation picture is obtained, has in over-segmentation picture more
Region fragment, carry out original image edge followed by Gabor linear filter and extract again, fragmentation region is carried out just
Step merges, and is finally remerged using the location information in fragmentation region to the region after preliminary merge;
2) reading order reconstructs: being filled, increases between virtual point and region to the region after step 1) segmentation
Distance relation, design optimization problem and constraint condition, obtain reading order;
3) area type is classified: being extracted, the feature of screening cut zone, is formed reliable feature vector, obtain region class
Type attribute stores useful region therein;
4) it by the coordinates feedback in each region to OCR engine, to realize filtering function when OCR processing is analyzed, improves soft
Part treatment effect.
Further, books layout area described in step 1) segmentation in morphology operations, Gabor linear filtering,
Fragmentation region merging technique, specifically:
1.1) based on the morphological method of image border, i.e., swollen using burn into after carrying out grey scale change to original image
Swollen, opening and closing operation extracts image edge, and after removing isolated image, segmentation picture is obtained;
1.2) Gabor linear filtering: carries out smooth operation with the filtering of Gabor real part, imaginary part filtering carry out edge detection with
It is initial to merge:
Gabor plural number expression:
Real part:
Imaginary part:
Wherein:
X '=x cos θ+y sin θ
Y '=- x sin θ+y cos θ
The meaning of Parameters in Formula and the allocation problem of parameter:
X, y are the position of spatial domain pixel;
Wavelength (λ): its value is as unit of pixel, 2≤λ≤input image size/5;
Direction (θ): the parameter specifies the direction of Gabor function parallel stripes, and value range is (0 °~360 °);
Phase offset (ψ): its value range (0 °~180 °), -90 spend corresponding antisymmetric functions, centered on 0 degree symmetrically
Center-on function, 90 degree of corresponding antisymmetric functions, 180 degree corresponds to center-off function;
Length-width ratio (Υ): i.e. space aspect ratio, the parameter determine the ellipticity of Gabor function shape, as Υ=1,
Shape is circle, and as Υ < 1, shape is elongated with parallel stripes direction;Bandwidth (b): half response of Gabor filter is empty
Between frequency, the ratio of bandwidth b and σ/λ is related, wherein σ indicate Gabor function the Gauss factor standard deviation:
1.3) fragmentation region remerges:
1.3.1) row merging
By step 1.1) and 1.2) cut zone tentatively merged, and obtain the position coordinates in each region
(left, top, right, bottom), wherein literal line is incomplete literal line, and a line text is cut into the more of separation
A region unit by comparing the height of each interregional position coordinates and region, while considering the error analysis of image scanning,
Assuming that the coordinate of region x is (leftx, topx, rightx, bottomx), the coordinate of region y is (lefty, topy, righty,
bottomy), while the region merging technique for meeting following three coordinate conditions is one text row:
By traversing to region, realize that the row of colleague's text merges;
1.3.2) overlapping region merging
Cut zone has more coincidence or repeat region, is merged according to the coordinate feature of repeat region.It is full simultaneously
The region of sufficient following two coordinate condition is defined as overlapping region, and carries out region merging technique:
Further, reading order reconstruct specifically includes following sub-step:
2.1) capable filling is carried out to known region.Assuming that N number of region has been obtained in step 1), X={ x1,x2,x3,
....xN, xiIndicate any one region, xiIt is carried out according to the bottom value in region incremental, x is obtained by step 1)iRegion
Position tentatively judges whether region is image-region according to the threshold value of height firstly, enabling height=top-bottom, leads to
The positional relationship and translation specifications for judging region are crossed, region is expanded, x is obtainediAfter expansion regional location (left,
Top, right, bottom), following two kinds of situations need to carry out line of text expansion, it is assumed that the coordinate of image-region x is
(leftx, topx, rightx, bottomx), the coordinate of region y is (lefty, topy, righty, bottomy)
2.1.1) text filed to appear near image-region:
2.1.2) adjacent text filed expansion:
bottomx< topx+1
2.2) selection of neighbour's block: according to xiThe coordinate of central point is calculated in regional locationTo which two region x be calculatediAnd xjBetween away from
From:
Wherein, if the positional relationship between two regions is unsatisfactory for normal reading habit, that is, meet in following two condition and appoint
Meaning one, then enable distij=1.7976931348623157E308;
Wherein, error amount is depending on scanning result;
2.3) design optimization problem is as follows:
Reading order is constructed using integral linear programming makes the sum of the distance between all areas for minimum, formula table
Show as follows:
Obtain the independent circuit of each covering all areas;
2.4) constraint condition designs, in which:
Design constraint is as follows:
(4) each region can only be connected with other another regions;
Two-way side is not allow between (5) two regions;
(6) consider reading order, it is desirable that the sequence between two regions cannot be upper left;
(4) arteface virtual point and initiation region, end region obtain constraint condition:
ui-uj+nxij≤n-1 when1≤i!=j≤n
Each feasible solution include only one cover the closed area sequence of all areas, wherein define virtual point to
Beginning region, the distance of end region are 0, and the distance of virtual point to other regions is 1.7976931348623157E308;Definition
Initiation region is the smallest region of bottom value, and definition end region is the maximum region of bottom value.
Further, the step 3) specifically includes following sub-step:
3.1) connected domain in bianry image is marked, after obtaining the label matrix of bianry image, obtains measurement mark
Infuse a series of attributes of each tab area in matrix L;
3.2) regional texture feature carries out function calculating using flow cytometer instrument cases, and the spread foundation of these functions is
Gray level co-occurrence matrixes GLCM.Gray level co-occurrence matrixes can reflect image grayscale about direction, adjacent spaces, the summation of amplitude of variation
Information.
3.3) feature is selected, characteristic value is selected to carry out classification analysis to region;
3.4) random forests algorithm given area type, including content text, image, table, formula, header page are utilized
Foot, the page number, dead sector, formula label.
5, a kind of digital book printed page analysis method according to claim 4, which is characterized in that the step 3.3)
In, select 29 characteristic values to carry out classification analysis to region, specifically: area, maxal, minal, eccent, equivDia,
solidity,extent,peri,autoc,contr,corrm,corrp,cprom,cshad,dissi,energ,entro,
homom,homop,maxpr,sosvh,savgh,svarh,senth,dvarh,inf1h,inf2h,indnc,idmnc;Specifically
Feature meaning is as follows:
The method of the present invention has the advantages that compared with prior art
1. by morphology operations and Gabor edge filter combination processing image, in utilization morphological method to picture
On the basis of carrying out over-segmentation, detected again using the edge that Gabor edge filter carries out the region of over-segmentation, to realize segmentation
The basis in region merges
2. merging using text filed row and overlapping region merging, the accuracy and integrality of region segmentation are improved;
3. be filled by judge the positional relationship and translation specifications in region to line of text, construction virtual point and
Initiation region, termination area, design optimization problem and constraint condition improve the correctness of reading order reconstruct, improve reading
The reading quality of person;
4. gray level co-occurrence matrixes are based on, in the base of the feature extraction based on bianry image connected region in image classification
On plinth, textural characteristics are extended, are calculated using the function that flow cytometer instrument cases are calculated.Random tree calculation is finally used
Method handles classification problem.It is demonstrated experimentally that the accuracy rate that this method extracts image-region is higher, to complex background or it is distributed not
The image of rule also can preferably realize the differentiation of each attribute region.
Detailed description of the invention
Fig. 1 OCR printed page analysis effect picture is shown;
The existing printed page analysis effect picture of Fig. 2;
Fig. 3 is the method for the present invention modules exhibit figure;
Fig. 4 is flow chart of the invention;
Sample in Fig. 5 CADAL digital library;
Fig. 6 is the display diagram for completing region segmentation and reading order reconstruct;
Fig. 7 is area type classification results.
Specific embodiment
Invention is further described in detail in the following with reference to the drawings and specific embodiments.
A kind of printed page analysis method of digital book provided by the invention, comprising the following steps:
1) region segmentation of the books space of a whole page: region segmentation is carried out to digital book page, firstly, utilizing morphology operations pair
Original image does burn into opening and closing, edge detection, expansion removal processing, and segmentation picture is obtained, has in over-segmentation picture more
Region fragment, carry out original image edge followed by Gabor linear filter and extract again, fragmentation region is carried out just
Step merges, and is finally remerged using the location information in fragmentation region to the region after preliminary merge;
2) reading order reconstructs: being filled, increases between virtual point and region to the region after step 1) segmentation
Distance relation, design optimization problem and constraint condition, obtain reading order;
3) area type is classified: being extracted, the feature of screening cut zone, is formed reliable feature vector, obtain region class
Type attribute stores useful region therein;
4) it by the coordinates feedback in each region to OCR engine, to realize filtering function when OCR processing is analyzed, improves soft
Part treatment effect.
Further, books layout area described in step 1) segmentation in morphology operations, Gabor linear filtering,
Fragmentation region merging technique, specifically:
1.1) based on the morphological method of image border, i.e., swollen using burn into after carrying out grey scale change to original image
Swollen, opening and closing operation extracts image edge, and after removing isolated image, segmentation picture is obtained;Specific implementation step is as follows:
(1) threshold process is carried out to obtained gray level image, obtains bianry image.
(2) bianry image is corroded.It creates the flat type disc structure that a radius is 8 and multiple corruption is carried out to image
Erosion.
(3) expansive working is carried out to image after corrosion.At this point, creating the square of a 8*8 to achieve the desired results
Structural element carries out expansion process to image.
(4) image background is estimated with morphology opening operation imopen and closed operation imclose.Opening operation is in order to smooth
Image outline makes narrow junction disconnect realizing the deletion of tiny protrusion, and one radius of building is 5 when opening operation
Disc structure.Closed operation is equivalent to the inverse process of opening operation, connects the part that narrow notch makes an entirety, closes
Operation is the square structure element using 5*5.
(5) the different point of gray value in tag image, utilizes edge detection operator Sobel.Sobel operator can smoothly make an uproar
Sound has preferable effect to influence of noise is eliminated.
Include cross form and vertical framework in Sobel operator, in actual use, commonly uses following two templates to detect
Image border.
Detection level edge cross form:
Detect vertical pingbian template along longitudinal direction:
Gradient magnitude calculation formula:
Gradient direction calculation formula:
(6) isolated image is removed.Specified morphological operation is carried out to image using function bwmorph is removed, but
N times are operated using this.
1.2) Gabor linear filtering: Gabor filter has that space is optimal and the optimal characteristic of frequency, has and the mankind
The similar characteristic of biological vision, corresponding to the partial structurtes information of spatial frequency, spatial position and direction selection, Gabor
Filter can be indicated and be described well.Meanwhile Gabor filter has the characteristics that self similarity, i.e. a morther wavelet are logical
It crosses expansion and rotation process can produce the Gabor filter of different parameters needed for experiment.In practical application, frequency domain not
Same scale, on different directions, Gabor can extract correlated characteristic.
There are two parts, i.e. real and imaginary parts by Gabor, carry out smooth operation with the filtering of Gabor real part, imaginary part filtering carries out
Edge detection and initial merging:
Gabor plural number expression:
Real part:
Imaginary part:
Wherein:
X '=x cos θ+y sin θ
Y '=- x sin θ+y cos θ
The meaning of Parameters in Formula and the allocation problem of parameter:
X, y are the position of spatial domain pixel;
Wavelength (λ): its value is as unit of pixel, 2≤λ≤input image size/5;
Direction (θ): the parameter specifies the direction of Gabor function parallel stripes, and value range is (0 °~360 °);
Phase offset (ψ): its value range (0 °~180 °), -90 spend corresponding antisymmetric functions, centered on 0 degree symmetrically
Center-on function, 90 degree of corresponding antisymmetric functions, 180 degree corresponds to center-off function;
Length-width ratio (Υ): i.e. space aspect ratio, the parameter determine the ellipticity of Gabor function shape, as Υ=1,
Shape is circle, and as Υ < 1, shape is elongated with parallel stripes direction;
Bandwidth (b): half response spatial frequency of Gabor filter, the ratio of bandwidth b and σ/λ is related, and wherein σ is indicated
The standard deviation of the Gauss factor of Gabor function:
Specific implementation step is as follows:
(1) im image is converted to double type;
(2) cosine filter and sinusoidal filter are constructed, adjusts filter parameter, and tested;
(3) rotation of filter is realized using imrotate function;
EvenFilter=imrotate (evenFilter, angle, ' bilinear');
OddFilter=imrotate (oddFilter, angle, ' bilinear');
(4) filter is called to be filtered operation;
Eim=filter2 (evenFilter, im);%Even filter result
Oim=filter2 (oddFilter, im);%Odd filter result
Aim=sqrt (Eim.^2+Oim.^2);%Amplitud
Return to magnitude image.
1.3) fragmentation region remerges:
1.3.1) row merging
By step 1.1) and 1.2) cut zone tentatively merged, and obtain the position coordinates in each region
(left, top, right, bottom), wherein literal line is incomplete literal line, and a line text is cut into the more of separation
A region unit by comparing the height of each interregional position coordinates and region, while considering the error analysis of image scanning,
Assuming that the coordinate of region x is (leftx, topx, rightx, bottomx), the coordinate of region y is (lefty, topy, righty,
bottomy), while the region merging technique for meeting following three coordinate conditions is one text row:
By traversing to region, realize that the row of colleague's text merges;
1.3.2) overlapping region merging
Cut zone has more coincidence or repeat region, is merged according to the coordinate feature of repeat region.It is full simultaneously
The region of sufficient following two coordinate condition is defined as overlapping region, and carries out region merging technique:
Shown in specific step is as follows:
It is known: to give two rectangles A and B, we can define the top left co-ordinate (A.left, A.top) of rectangle A, right
Lower angular coordinate is (A.right, A.bottom), and the top left co-ordinate (B.left, B.top) of rectangle B, bottom right angular coordinate is
(B.right,B.bottom)
Output: if rectangle A is overlapped with rectangle B or has intersection, rectangle A is merged with rectangle B, rectangle after being merged
The coordinate of C.
Arbitrary point (x, y) in rectangle A should meet following inequality group including the point on four sides
A.left≤x≤A.right ①
A.top≤y≤A.bottom ②
Similarly, the point in B meets
B.left≤x≤B.right ③
B.top≤y≤B.bottom ④
1. 2. 3. 4. if A, B has coincidence, it there will necessarily be a little while meeting, so having
Max (A.left, B.left)≤min (A.right, B.right)
Max (A.top, B.top)≤min (A.bottom, B.bottom)
My available more complete image block and text block after merging.
Further, reading order reconstruct specifically includes following sub-step:
2.1) capable filling is carried out to known region.Assuming that N number of region has been obtained in step 1), X={ x1,x2,x3,
....xN, xiIndicate any one region, xiIt is carried out according to the bottom value in region incremental, x is obtained by step 1)iRegion
Position tentatively judges whether region is image-region according to the threshold value of height firstly, enabling height=top-bottom, leads to
The positional relationship and translation specifications for judging region are crossed, region is expanded, x is obtainediAfter expansion regional location (left,
Top, right, bottom), following two kinds of situations need to carry out line of text expansion, it is assumed that the coordinate of image-region x is
(leftx, topx, rightx, bottomx), the seat of region y
It is designated as (lefty, topy, righty, bottomy)
2.1.1) text filed to appear near image-region:
2.1.2) adjacent text filed expansion:
bottomx< topx+1
2.2) selection of neighbour's block: according to xiThe coordinate of central point is calculated in regional locationTo which two region x be calculatediAnd xjBetween away from
From:
Wherein, if the positional relationship between two regions is unsatisfactory for normal reading habit, that is, meet in following two condition and appoint
Meaning one, then enable distij=1.7976931348623157E308;
Wherein, error amount is depending on scanning result;
2.3) design optimization problem is as follows:
Reading order is constructed using integral linear programming makes the sum of the distance between all areas for minimum, formula table
Show as follows:
Obtain the independent circuit of each covering all areas;
2.4) constraint condition designs, in which:
Design constraint is as follows:
(1) each region can only be connected with other another regions;
Two-way side is not allow between (2) two regions;
(3) consider reading order, it is desirable that the sequence between two regions cannot be upper left;
(4) arteface virtual point and initiation region, end region obtain constraint condition:
ui-uj+nxij≤n-1 when1≤i!=j≤n
Each feasible solution include only one cover the closed area sequence of all areas, wherein define virtual point to
Beginning region, the distance of end region are 0, and the distance of virtual point to other regions is 1.7976931348623157E308;Definition
Initiation region is the smallest region of bottom value, and definition end region is the maximum region of bottom value.
The step 3) specifically includes following sub-step:
3.1) connected domain in bianry image is marked, after obtaining the label matrix of bianry image, obtains measurement mark
Infuse a series of attributes of each tab area in matrix L;
3.2) regional texture feature carries out function calculating using flow cytometer instrument cases, and the spread foundation of these functions is
Gray level co-occurrence matrixes GLCM.Gray level co-occurrence matrixes can reflect image grayscale about direction, adjacent spaces, the summation of amplitude of variation
Information.
3.3) feature is selected, 29 characteristic values is selected to carry out classification analysis to region;Specifically: area,
maxal,minal,eccent,equivDia,solidity,extent,peri,autoc,contr,corrm,corrp,
cprom,cshad,dissi,energ,entro,homom,homop,maxpr,sosvh,savgh,svarh,senth,
dvarh,inf1h,inf2h,indnc,idmnc;Specific features meaning is as follows:
3.4) random forests algorithm given area type, including content text, image, table, formula, header page are utilized
Foot, the page number, dead sector, formula label.
Embodiment
The specific steps of this example implementation are described in detail below with reference to method of the invention, here with the library CADAL number
Word library scanning e-book in certain one page as an example, as shown in figure 5, all processes of explanatory diagram 3.
1) original image is read, gray proces are carried out to color image, read in gray scale picture
2) image edge is extracted.The square structure element for creating a 8*8 carries out expansion process to image.With form
It learns opening operation imopen and closed operation imclose and estimates image background.Opening operation is to make narrow company for smoothed image profile
The place of connecing disconnects realizing the deletion of tiny protrusion, the disc structure that one radius of building is 5 when opening operation.Closed operation phase
When in the inverse process of opening operation, connecting the part that narrow notch makes an entirety, closed operation is to use 5*5 just
Square structure element.The different point of gray value, utilizes edge detection operator Sobel in tag image.
3) linear Gabor filter filtering is carried out, the edge detection of a closer step is carried out on over-segmentation picture basis
4) cut zone of coincidence and inclusion relation is merged, utilizes the coordinate feature of each cut zone.
5) line of text that step 4) obtains is expanded, after being expanded each region coordinate (left, right, top,
Bottom), (centerx, centery) is calculated, to calculate distij。
Construction virtual point, the region start-, the region end-, and to above-mentioned N+3 regional structure constraint condition
ui-uj+nxij≤n-1 when1≤i!=j≤n
And solving optimization problem:
6) provincial characteristics for extracting segmentation, is based on gray level co-occurrence matrixes, extracts the characteristic value of the cut zone after merging.Packet
Include characteristic value and textural characteristics based on bianry image connected domain.
7) area attribute classification, storage zone position are carried out using the characteristic value that step 6) obtains.
Claims (4)
1. a kind of printed page analysis method of digital book, which comprises the following steps:
1) region segmentation of the books space of a whole page: region segmentation is carried out to digital book page, firstly, using morphology operations to original
Image does burn into opening and closing, edge detection, expansion removal processing, and segmentation picture is obtained, has more area in over-segmentation picture
Domain fragment carries out extracting again for original image edge followed by Gabor linear filter, is tentatively closed to fragmentation region
And finally the region after preliminary merge is remerged using the location information in fragmentation region;
2) reading order reconstructs: being filled to the region after step 1) segmentation, increases the distance between virtual point and region
Relationship, design optimization problem and constraint condition, obtain reading order;The reading order reconstruct specifically includes following sub-step
It is rapid:
2.1) capable filling is carried out to known region;Assuming that N number of region has been obtained in step 1), X={ x1,x2,x3,....xN,
xiIndicate any one region, xiIt carries out incremental according to the bottom value of segmentation rear region, x is obtained by step 1)iRegion position
It sets, firstly, enabling height=top-bottom, tentatively judges whether region is image-region according to the threshold value of height, pass through
The positional relationship and translation specifications for judging region, expand region, obtain xiAfter expansion regional location (left,
Top, right, bottom), following two kinds of situations need to carry out line of text expansion, it is assumed that the coordinate of image-region x is
(leftx, topx, rightx, bottomx), the coordinate of region y is (lefty, topy, righty, bottomy)
2.1.1) text filed to appear near image-region:
2.1.2) adjacent text filed expansion:
bottomx<topx+1
2.2) selection of neighbour's block: according to xiThe coordinate of central point is calculated in regional locationTo which two region x be calculatediAnd xjBetween away from
From:
Wherein, if the positional relationship between two regions is unsatisfactory for normal reading habit, that is, meet any one in following two condition
It is a, then enable distij=1.7976931348623157E308;
Wherein, error amount is depending on scanning result;
2.3) design optimization problem is as follows:
Reading order is constructed using integral linear programming makes the sum of the distance between all areas for minimum, and formula indicates such as
Under:
Obtain the independent circuit of each covering all areas;
2.4) constraint condition designs, in which:
Design constraint is as follows:
(1) each region can only be connected with other another regions;
Two-way side is not allow between (2) two regions;
(3) consider reading order, it is desirable that the sequence between two regions cannot be upper left;
(4) arteface virtual point and initiation region, end region obtain constraint condition:
ui-uj+nxij≤n-1 when 1≤i!=j≤n
Each feasible solution includes only one and covers the closed area sequence of all areas, wherein defining virtual point to sintering
Domain, the distance of end region are 0, and the distance of virtual point to other regions is 1.7976931348623157E308;Definition starting
Region is the smallest region of bottom value, and definition end region is the maximum region of bottom value;
3) area type is classified: being extracted, the feature of screening cut zone, is formed reliable feature vector, obtain area type category
Property, useful region therein is stored;
4) it by the coordinates feedback in each region to OCR engine, to realize filtering function when OCR processing is analyzed, improves at software
Manage effect.
2. a kind of printed page analysis method of digital book according to claim 1, which is characterized in that described in step 1)
Morphology operations, Gabor linear filtering, fragmentation region merging technique in the segmentation of books layout area, specifically:
1.1) it based on the morphological method of image border, i.e., after carrying out grey scale change to original image, using burn into expansion, opens
Image edge is extracted in closed operation, and after removing isolated image, segmentation picture is obtained;
1.2) Gabor linear filtering: carries out smooth operation with the filtering of Gabor real part, and imaginary part filtering carries out edge detection and initially
Merge:
Gabor plural number expression:
Real part:
Imaginary part:
Wherein:
X '=xcos θ+ysin θ
Y '=- xsin θ+ycos θ
The meaning of Parameters in Formula and the allocation problem of parameter:
X, y are the position of spatial domain pixel;
Wavelength X: its value is as unit of pixel, 2≤λ≤input image size/5;
Direction θ: the parameter specifies the direction of Gabor function parallel stripes, and value range is 0 °~360 °;
Phase offset ψ: 0 °~180 ° of its value range, -90 spend corresponding antisymmetric function, symmetrical center- centered on 0 degree
On function, 90 degree of corresponding antisymmetric functions, 180 degree correspond to center-off function;
Length-width ratio Υ: i.e. space aspect ratio, the parameter determine the ellipticity of Gabor function shape, and as Υ=1, shape is
Circle, as Υ < 1, shape is elongated with parallel stripes direction;Half response spatial frequency of bandwidth b:Gabor filter, band
The ratio of wide b and σ/λ is related, and wherein σ indicates the standard deviation of the Gauss factor of Gabor function:
1.3) fragmentation region remerges:
1.3.1) row merging
By step 1.1) and 1.2) cut zone tentatively merged, and obtain each region position coordinates (left,
Top, right, bottom), wherein literal line is incomplete literal line, and a line text is cut into the multiple regions of separation
Block by comparing the height of each interregional position coordinates and region, while considering the error analysis of image scanning, it is assumed that area
The coordinate of domain x is (leftx, topx, rightx, bottomx), the coordinate of region y is (lefty, topy, righty, bottomy),
The region merging technique for meeting following three coordinate conditions simultaneously is one text row:
By traversing to region, realize that the row of colleague's text merges;
1.3.2) overlapping region merging
Cut zone has more coincidence or repeat region, is merged according to the coordinate feature of repeat region;It will meet simultaneously
The region of following two coordinate condition is defined as overlapping region, carries out region merging technique:
3. a kind of printed page analysis method of digital book according to claim 1, which is characterized in that the step 3) is specific
Including following sub-step:
3.1) connected domain in bianry image is marked, after obtaining the label matrix of bianry image, obtains measurement mark square
A series of attributes of each tab area in battle array L;
3.2) regional texture feature carries out function calculating using flow cytometer instrument cases, and the spread foundation of these functions is gray scale
Co-occurrence matrix GLCM;Gray level co-occurrence matrixes can reflect image grayscale to be believed about the summation of direction, adjacent spaces, amplitude of variation
Breath;
3.3) feature is selected, characteristic value is selected to carry out classification analysis to region;
3.4) random forests algorithm given area type, including content text, image, table, formula, headerfooter, page are utilized
Code, dead sector, formula label.
4. a kind of printed page analysis method of digital book according to claim 3, which is characterized in that the step 3.3)
In, select 29 characteristic values to carry out classification analysis to region, specifically: area, maxal, minal, eccent, equivDia,
solidity,extent,peri,autoc,contr,corrm,corrp,cprom,cshad,dissi,energ,entro,
homom,homop,maxpr,sosvh,savgh,svarh,senth,dvarh,inf1h,inf2h,indnc,idmnc;Specifically
Feature meaning is as follows:
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610584126.7A CN106096592B (en) | 2016-07-22 | 2016-07-22 | A kind of printed page analysis method of digital book |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610584126.7A CN106096592B (en) | 2016-07-22 | 2016-07-22 | A kind of printed page analysis method of digital book |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106096592A CN106096592A (en) | 2016-11-09 |
CN106096592B true CN106096592B (en) | 2019-05-24 |
Family
ID=57450070
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610584126.7A Active CN106096592B (en) | 2016-07-22 | 2016-07-22 | A kind of printed page analysis method of digital book |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106096592B (en) |
Families Citing this family (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107301418A (en) * | 2017-06-28 | 2017-10-27 | 江南大学 | Printed page analysis in optical character identification |
CN109389116B (en) * | 2017-08-14 | 2022-02-08 | 阿里巴巴(中国)有限公司 | Character detection method and device |
CN108021900B (en) * | 2017-12-18 | 2022-05-17 | 科大讯飞股份有限公司 | Layout column dividing method and device |
CN109308476B (en) * | 2018-09-06 | 2019-08-27 | 邬国锐 | Billing information processing method, system and computer readable storage medium |
CN109948123B (en) * | 2018-11-27 | 2023-06-02 | 创新先进技术有限公司 | Image merging method and device |
CN110097046A (en) * | 2019-03-11 | 2019-08-06 | 上海肇观电子科技有限公司 | A kind of character detecting method and device, equipment and computer readable storage medium |
CN109933756B (en) * | 2019-03-22 | 2022-04-15 | 腾讯科技(深圳)有限公司 | Image file transferring method, device and equipment based on OCR (optical character recognition), and readable storage medium |
CN110059596B (en) * | 2019-04-03 | 2020-07-07 | 北京字节跳动网络技术有限公司 | Image identification method, device, medium and electronic equipment |
CN110263792B (en) * | 2019-06-12 | 2021-10-22 | 广东小天才科技有限公司 | Image recognizing and reading and data processing method, intelligent pen, system and storage medium |
CN113033338B (en) * | 2021-03-09 | 2024-03-29 | 太极计算机股份有限公司 | Electronic header edition headline news position identification method and device |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1604075A (en) * | 2004-11-22 | 2005-04-06 | 北京北大方正技术研究院有限公司 | Method for conducting words reading sequence recovery for newspaper pages |
CN1604074A (en) * | 2004-11-22 | 2005-04-06 | 北京北大方正技术研究院有限公司 | Method for determining words reading sequence for columned serial words pages with mutually exclusive pattern and characters |
EP1701292A3 (en) * | 2005-03-08 | 2009-09-16 | Ricoh Company, Ltd. | Document layout analysis with control of non-character area |
CN101794278A (en) * | 2009-09-21 | 2010-08-04 | 广东省标准化研究院 | Method and software for digitalizing full text of standard document |
CN105373790A (en) * | 2015-10-23 | 2016-03-02 | 北京汉王数字科技有限公司 | Layout analysis method and device |
CN105573974A (en) * | 2014-10-09 | 2016-05-11 | 北大方正集团有限公司 | Page layout method, device and system |
-
2016
- 2016-07-22 CN CN201610584126.7A patent/CN106096592B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1604075A (en) * | 2004-11-22 | 2005-04-06 | 北京北大方正技术研究院有限公司 | Method for conducting words reading sequence recovery for newspaper pages |
CN1604074A (en) * | 2004-11-22 | 2005-04-06 | 北京北大方正技术研究院有限公司 | Method for determining words reading sequence for columned serial words pages with mutually exclusive pattern and characters |
EP1701292A3 (en) * | 2005-03-08 | 2009-09-16 | Ricoh Company, Ltd. | Document layout analysis with control of non-character area |
CN101794278A (en) * | 2009-09-21 | 2010-08-04 | 广东省标准化研究院 | Method and software for digitalizing full text of standard document |
CN105573974A (en) * | 2014-10-09 | 2016-05-11 | 北大方正集团有限公司 | Page layout method, device and system |
CN105373790A (en) * | 2015-10-23 | 2016-03-02 | 北京汉王数字科技有限公司 | Layout analysis method and device |
Non-Patent Citations (3)
Title |
---|
图文互斥版面中文字阅读顺序的确定;贾娟 等;《中文信息学报》;20051231;第19卷(第5期);全文 |
智能阅读服务机器人系统关键技术研究;李艳 等;《中国优秀硕士学位论文全文数据库 信息科技辑》;20090415(第04期);全文 |
版面分析中图文分割方法研究及应用;刘妍妍;《中国优秀硕士学位论文全文数据库 信息科技辑》;20130815(第08期);全文 |
Also Published As
Publication number | Publication date |
---|---|
CN106096592A (en) | 2016-11-09 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106096592B (en) | A kind of printed page analysis method of digital book | |
CN110516208B (en) | System and method for extracting PDF document form | |
Poco et al. | Extracting and retargeting color mappings from bitmap images of visualizations | |
US20200065601A1 (en) | Method and system for transforming handwritten text to digital ink | |
AU2006252025B2 (en) | Recognition of parameterised shapes from document images | |
US8000529B2 (en) | System and method for creating an editable template from a document image | |
AU2006252019B2 (en) | Method and Apparatus for Dynamic Connector Analysis | |
CN109325398A (en) | A kind of face character analysis method based on transfer learning | |
CN104899586B (en) | Method and device is identified to the word content for including in image | |
CN110443239A (en) | The recognition methods of character image and its device | |
CN104240256A (en) | Image salient detecting method based on layering sparse modeling | |
CN104573685A (en) | Natural scene text detecting method based on extraction of linear structures | |
JP2021193610A (en) | Information processing method, information processing device, electronic apparatus and storage medium | |
US7864985B1 (en) | Automatic operator-induced artifact detection in document images | |
CN112949570B (en) | Finger vein identification method based on residual attention mechanism | |
CN108021837A (en) | A kind of bar code detection method, bar code detecting device and electronic equipment | |
CN111553351A (en) | Semantic segmentation based text detection method for arbitrary scene shape | |
Zhou et al. | Identifying designs from incomplete, fragmented cultural heritage objects by curve-pattern matching | |
Oka et al. | Vectorization of contour lines from scanned topographic maps | |
WO2009067022A1 (en) | A method for resolving contradicting output data from an optical character recognition (ocr) system, wherein the output data comprises more than one recognition alternative for an image of a character | |
Hristov et al. | A software system for classification of archaeological artefacts represented by 2D plans | |
CN115620322B (en) | Method for identifying table structure of whole-line table based on key point detection | |
CN109325487B (en) | Full-category license plate recognition method based on target detection | |
CN116259062A (en) | CNN handwriting identification method based on multichannel and attention mechanism | |
Clément et al. | Fuzzy directional enlacement landscapes for the evaluation of complex spatial relations |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |