CN102262618B - Method and device for identifying page information - Google Patents

Method and device for identifying page information Download PDF

Info

Publication number
CN102262618B
CN102262618B CN201010193898.0A CN201010193898A CN102262618B CN 102262618 B CN102262618 B CN 102262618B CN 201010193898 A CN201010193898 A CN 201010193898A CN 102262618 B CN102262618 B CN 102262618B
Authority
CN
China
Prior art keywords
block
caption
text block
image
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201010193898.0A
Other languages
Chinese (zh)
Other versions
CN102262618A (en
Inventor
高良才
汤帜
房婧
仇睿恒
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
New Founder Holdings Development Co ltd
Peking University
Peking University Founder Research and Development Center
Original Assignee
BEIDA FANGZHENG TECHN INST Co Ltd BEIJING
Peking University
Peking University Founder Group Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by BEIDA FANGZHENG TECHN INST Co Ltd BEIJING, Peking University, Peking University Founder Group Co Ltd filed Critical BEIDA FANGZHENG TECHN INST Co Ltd BEIJING
Priority to CN201010193898.0A priority Critical patent/CN102262618B/en
Publication of CN102262618A publication Critical patent/CN102262618A/en
Application granted granted Critical
Publication of CN102262618B publication Critical patent/CN102262618B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Character Input (AREA)
  • Character Discrimination (AREA)

Abstract

The invention provides a method for identifying page information. The method comprises the following steps of: reading a page to be identified, separating a character text object from an image object, combining text blocks, and preserving the image object as an image block; identifying an image annotation text block from the combined text blocks; optimally matching the image block with the image annotation text block by using an optimization method to obtain the image block and the image annotation text block, which are associated with each other; removing the image annotation text block from the text blocks of the page, and determining a reading sequence of the rest of the text blocks and the image block; and putting the image annotation text block behind the associated image block in the reading sequence. Correspondingly, the invention provides a device for identifying the page information. The invention has the advantages that: all images and annotations on the page are taken into comprehensive consideration, and the global optimal matching of the images and the annotations is obtained by an optimal matching method; and an optimal association relation can be found globally without limitation of the number of the images and annotations and the type of spaces between the images and annotations, so the conventional effect of identifying the reading sequence of the page is improved.

Description

A kind of layout information is known method for distinguishing and device
Technical field
The present invention relates to digital document processing technology field, relate in particular to the identification of layout information in digital document, wherein, comprise identification, image and the caption of caption incidence relation identification and utilize the image of identification and the incidence relation of caption improves the recognition effect of reading order.
Background technology
In recent years, digital document structure extraction technique has become digital document analysis and the study hotspot of understanding field, and it comprises, and layout structure extracts and logical organization extracts two aspects.Wherein, it is that document Layout division is become to piece that layout structure extracts fundamental purpose, generally represents the layout relationship between piece with tree structure, and the research of this direction is comparatively ripe; And being mainly limited to the piecemeal that topological analysis is obtained, existing logical organization extraction technique gives different logic implications, such as chapters and sections, title, paragraph, author and unit, footnote, chart, the page number etc., thus obtain logical block.
For example, but for the relation between logical block,, the incidence relation of image and caption, reading sequence of layout identification etc., study relatively less.And relation between logical block has great importance for the correct identification of layout information, such as, the identification of the incidence relation of image and caption, not only can be used in the recognition effect that improves reading sequence of layout, and also significant for researchs such as image retrievals.
The associated Study of recognition of current image and caption, the main distance that adopts is near principle, and rely on that captions are usually located at directly over image or under and feature placed in the middle, the nearest caption of selected distance image is its title, for example can be referring to " Logical StructureAnalysis of Book Document Images Using Contents Information ", Proceedings of International Conference on Document Analysisand Recognition, 1997.The shortcoming of this method is, in the time containing multiple image on the page, particularly along with the variation of digital document layout, the space layout of image and caption becomes and becomes increasingly complex, the caption of choosing image near principle according to distance easily causes mating entanglement, that is to say, only depend on distance and the pattern of single image and caption, the incidence relation of multiple images and caption in very difficult correctly definite complicated space of a whole page.
Summary of the invention
In order to overcome the above problems, the invention provides a kind of layout information and know method for distinguishing and device, wherein, comprise identification, image and the caption of caption incidence relation identification and utilize the image of identification and the incidence relation of caption improves the recognition effect of reading order.By this method, can correctly identify the incidence relation of logical elements caption in the complicated space of a whole page and image and caption, and can utilize the incidence relation of the image that identifies and caption to improve the recognition effect of reading order in the complicated space of a whole page.
In order to realize above object, the invention provides a kind of method of identifying caption, comprise the following steps: read the space of a whole page to be identified, separate character text object and image object in this space of a whole page, and character text object merging is become to text block, image object is left to image block; From the text block merging, identify caption text block.Wherein, come separating character text object and image object according to document layout structure analysis method and/or according to the data object type in digital document; According to identifying caption text block with lower at least one: whether the distance of font attribute, text block and the image block of text block, the number of words of text block, text block meet the form of expression of caption.
The method that the invention provides the incidence relation of a kind of recognition image and caption, comprises the following steps: utilize the method for above-mentioned identification caption to identify caption text block and image block; Utilize optimization method to carry out Optimum Matching to image block and caption text block, thereby obtain the image block and the caption text block that are associated.Wherein, preferably, utilize optimization method to make to realize the distance sum minimum between image block and the caption text block of Optimum Matching, more preferably, can adopt bipartite graph Optimum Matching method to come matching image piece and caption text block.
A kind of method that the invention provides recognition effect that improves reading sequence of layout, comprises the following steps: utilize the method for the incidence relation of above-mentioned recognition image and caption to carry out the matching relationship between recognition image piece and caption text block; From the text block of the space of a whole page, remove caption text block, and identify the reading order of all the other text block and image block; After caption text block being turned back to the image block matching in reading order.
In order to realize above method, the invention provides a kind of layout information recognition device, comprise: reading unit, caption recognition unit, matching unit, reading order improve unit and output unit, wherein, output unit can be exported respectively the incidence relation of reading order, image block and caption text block, the caption text block of identification, text block and the image block of arranging according to the reading order of identification according to actual needs.The concrete operations of these unit are identical with the corresponding steps in said method.
The present invention considers all images on the space of a whole page and caption, obtaining image by Optimum Matching method mates with the global optimum of caption, be not subject to image and caption number and the restriction of space pattern between them, can find optimum incidence relation from the overall situation., mate with the global optimum of caption by image meanwhile, can improve to a great extent existing reading sequence of layout recognition effect.
Accompanying drawing explanation
Fig. 1 is the schematic diagram that cum rights bipartite graph constructed according to the invention is constructed;
Fig. 2 is according to the schematic block diagram of layout information recognition device of the present invention;
Fig. 3 is the schematic page in the first embodiment;
Fig. 4 is the process flow diagram of the recognition methods in the first embodiment;
Fig. 5 is the process flow diagram of caption recognition methods in the first embodiment;
Fig. 6 a and Fig. 6 b are respectively the schematic diagram of the Optimum Matching result calculated of bipartite graph in the first embodiment structure and KM algorithm;
Fig. 7 is the process flow diagram of KM algorithm in the first embodiment;
Fig. 8 is the design sketch that utilizes the original method of the existing Segment based on XY tree the page shown in Fig. 3 to be carried out to reading order sequence;
Fig. 9 utilizes the inventive method the page shown in Fig. 3 to be carried out to the design sketch of reading order sequence in the first embodiment;
Figure 10 is the schematic page in the second embodiment.
Embodiment
Below, describe the present invention in connection with drawings and Examples.
Main study subject of the present invention comprises the identification of the identification of caption and the incidence relation of image and caption, is intended to improve by the identification of these layout informations the related application such as recognition effect and other logical organization extraction of reading order.The present invention is mainly used in the digital document meeting the following conditions: can read digital document by page, can obtain character text object and image object and their association attributes such as font, position coordinates of every page, such as, the digital document of the CEBX form that general PDF document and upright company make.
In the present invention, caption recognition methods comprises the following steps:
(1) read the space of a whole page to be identified, separate character text object and image object in this space of a whole page, and character text object merging is become to text block, image object is left to image block.Wherein, can come separating character text object and image object according to document layout structure analysis method and/or according to the data object type in digital document;
(2) from the text block merging, identify the text block of caption type,, caption text block, such as, can be according to identifying caption text block with lower at least one: the distance of font size, text block and the image block of main font, the number of words of text block, the form of expression whether text block meets caption in text block.
Identify caption text block from text block after, by optimization method, image block and caption text block are carried out to Optimum Matching, thereby obtain the image block and the caption text block that are associated.
Specifically, in one embodiment, because the distance being mutually related between image block and caption text block approaches (or the most approaching) conventionally, so the caption (or making all captions all find the image being associated) being associated for all images are all found, can using with this image at a distance of the caption of enough nearly (or nearest) as caption associated with it.In this case, can utilize optimization method to make to realize the distance sum minimum between image block and the caption text block of Optimum Matching.
Here, can adopt bipartite graph Optimum Matching method to realize the distance sum minimum between image block and the caption text block of Optimum Matching, specific as follows:
(1) structure cum rights bipartite graph G={X, Y, E}
As shown in Figure 1, in this bipartite graph, image block set and the set of caption text block, respectively as two subclass of X, Y of bipartite graph, are expressed as to X={X 1, X 2... X i... X nand Y={Y 1, Y 2... Y j... Y m, wherein, n is the number of image block in the space of a whole page, the numbering that i is image block, and m is the number of caption text block in the space of a whole page, j is the numbering of caption text block.E={e ijrepresent the limit set of connect Vertex set X and Y, element e wherein ijpresentation video piece X iwith caption text block Y jlimit, its weights ω ijfor image block X icentral point and the caption text block Y of boundary rectangle frame jthe central point of boundary rectangle frame between Euclidean distance.
(2) utilize bipartite graph Optimum Matching algorithm to obtain the Optimum Matching of image block and caption text block
In the time of specific implementation, can be by the limit e in bipartite graph shown in Fig. 1 ijweights ω ijnegate, and utilize KM (Kuhn-Munkras) maximum weight matching algorithm to carry out optimum Perfect matching, has the image of least weight match algorithm result and caption text block image block and the caption text block as Optimum Matching thereby obtain.
When the number of the image block on the space of a whole page and caption text block is not one by one at once, by subclass polishing few number, that is, the number of two subclass is equated with dummy node, and give the weights of a large number as virtual limit.
After correctly identifying the incidence relation of image and caption, utilize these incidence relations to improve the recognition effect of reading order, specific as follows:
(1) from the text block of the space of a whole page, remove caption text block, can utilize existing reading order method to determine the reading order of all the other text block and image block;
(2) after caption text block being turned back to the image block matching in reading order, thereby obtain complete reading order.
By this method, both guaranteed in sequencer procedure, logical relation closely image and caption can not split by other document object, avoid again merging prematurely image and caption easily causes the problem of the overlapping execution that affects sort algorithm between space of a whole page piecemeal, thereby improved to a great extent the accuracy of reading order identification.
Here, be noted that according to the caption of the inventive method identification and not only can be used for the incidence relation of recognition image and caption, but also can be used for utilizing any other of caption to apply, such as image retrieval etc.; Not only can be used for improving the recognition effect of reading order according to the image of the inventive method identification and the incidence relation of caption, but also any other that can be used for the incidence relation that need to utilize image and caption apply, such as image retrieval etc.; Can be used for that space of a whole page content is reset and information extraction etc. need to utilize any application of reading order according to the improved reading order of the inventive method.Therefore, can need to export respectively the incidence relation of reading order, image block and the caption text block of identification according to the present invention, the caption text block of identification, the text block of arranging according to the reading order of identification and image block according to practical application uses for the application of any these identifying informations of needs.
In order to realize above method, the invention provides a kind of layout information recognition device.With reference to Fig. 2, this device can comprise that reading unit 1, caption recognition unit 2, matching unit 3, reading order improve unit 4 and output unit 5, wherein, reading unit 1 reads the space of a whole page to be identified, separate character text object and image object in this space of a whole page, and character text object merging is become to text block, image object is left to image block; Caption recognition unit 2 identifies caption text block from the text block merging; Matching unit 3 utilizes optimization method to carry out Optimum Matching to image block and caption text block, thereby obtains the image block and the caption text block that are associated; Reading order improves unit 4 and from the text block of the space of a whole page, removes caption text block, and determines the reading order of all the other text block and image block, after then caption text block being turned back to the image block matching in reading order; Output unit 5 needs the incidence relation of reading order, image block and the caption text block that can export respectively identification, the caption text block of identification, the text block of arranging according to the reading order of identification and image block to use for any application that need to utilize these identifying informations according to practical application.The concrete operations of these unit are identical with the corresponding steps in said method, therefore, omit its detailed description.
Below, will be described in detail specific implementation of the present invention by specific embodiment.
(the first embodiment)
In the present embodiment, adopt e-book " 21 century Basis of Computer Engineering study course " (publishing house of Beijing University of Post & Telecommunication), this e-book has 317 pages, and the space of a whole page to be identified as shown in Figure 3, carrys out the incidence relation of recognition image and caption based on bipartite graph Optimum Matching.
With reference to Fig. 4, the recognition methods in the present embodiment comprises the following steps:
Step S1, read the page and separate text object and image object
In the present embodiment, as shown in the rectangle frame in Fig. 3, wherein there are four image blocks and five text block in Segment situation.
Step S2, identification caption text block
In the present embodiment, by being set, degree of confidence determines whether current text piece is caption text block.With reference to Fig. 5, this step is specific as follows:
Step S21, calculating font size degree of confidence Q1
The font size of the main font of all character texts in the font size/space of a whole page of the main font of Q1=current text piece
Wherein, about the calculating of main font, the font that in employing prior art statistics certain limit, the frequency of occurrences is the highest is as main font.In the present embodiment, the font size of the main font of caption text block is 9, and in the page, the font size of the main font of all characters is 10.56, degree of confidence Q1=9/10.56=0.85.
Step S22, calculating range image degree of confidence Q2
Whether Q2=approaches with image block distance
In the present embodiment, four caption text block approach with image block position respectively, thereby degree of confidence Q is 1.
Step S23, calculating number of words degree of confidence Q3
The average word number of the word number/space of a whole page text block in Q3=current text piece
In the present embodiment, the word number of four caption text block is respectively 10,12,11,11, and in the page, the average word number of text block is 25, and therefore, degree of confidence Q3 is respectively 0.4,0.48, and 0.44 and 0.44.
Step S24, calculating form of expression degree of confidence Q4
Whether Q4=meets the regular expression of caption
In the present embodiment, regular expression is defined as: ^ (figure [[: space :]] * [[: numeral :]]+([.] [[: numeral :]]+| ([[: numeral :]]+))), be that shape is as conventionally forms such as " Fig. 1-1 " " Fig. 1 .1 ", four caption text block all meet this form, thereby degree of confidence Q4 is 1.Certainly, should be appreciated that, above-mentioned regular expression is only to represent whether current text piece meets the exemplary realization of the form of expression of caption, anyly expresses the form of expression whether current text piece meet caption and all should be included in protection scope of the present invention.
Step S25, the overall degree of confidence R of weighted calculation
R=(u×Q1+v×Q2+w×Q3+x×Q4)/(u+v+w+x)
Wherein, u, v, w, x represents weighting coefficient, is natural number, gets in the present embodiment u=3, v=2, w=1, x=1, the overall degree of confidence of four caption text block is respectively 0.85,0.86 as calculated, and 0.85 and 0.85.
Step S26, judge whether overall degree of confidence R exceedes threshold value r, if R >=r judges in step S27 that current text piece is caption text block, if R < is r, in step S28, judge that current text piece is not caption text block.In the present embodiment, getting threshold value r is 0.7,, in the time that overall degree of confidence exceedes 0.7, judges that current text piece is caption text block, and therefore, in Fig. 3, four caption text block all can be correctly validated.
The bipartite graph of step S3, construct image piece and caption text block also calculates weights
In the present embodiment, cum rights bipartite graph G={X as shown in Figure 6 a of structure, Y, E}, that is, and using the image block set in Fig. 3 and the set of caption text block respectively as the X of bipartite graph, two subclass of Y, that is, X={X 1, X 2, X 3, X 4, Y={Y 1, Y 2, Y 3, Y 4, and gather the limit e in E using the Euclidean distance of the central point of image block boundary rectangle frame and the central point of caption text block boundary rectangle frame as limit ijweights ω ij.Therefore, in the present embodiment, need to calculate respectively ω 11, ω 12, ω 13, ω 14, ω 21, ω 22, ω 23, ω 24, ω 31, ω 32, ω 33, ω 34, ω 41, ω 42, ω 43, ω 44, and due to picture number and caption number correspondence one by one, so without polishing node.
Step S4, utilize KM algorithm to find the incidence relation between image block and caption text block
In the present embodiment, the target of optimization is to make the weights sum on the limit that in matching result, all couplings are right as far as possible little, therefore, need to calculate the least weight match algorithm of bipartite graph.In actual realization, the weights on all limits are implemented to inversion operation, and apply KM maximum weight matching algorithm and calculate maximum weight matching result, this result is the least weight match algorithm result of image and caption.
With reference to Fig. 7, KM algorithm is implemented as follows:
A) provide initial label
l ( x i ) = max j &omega; ij , l ( y j ) = 0 , i , j = 1,2 . . . , t , t = max ( n , m )
Wherein, in the present embodiment, n and m are 4;
B) obtain limit collection E l={ (xi, y j) | l (xi)+l (y j)=ω ij}, G l=(X, Y, E l) and G lin one coupling M;
C) judge whether all nodes of saturated X of M, if all nodes of the saturated X of M carry out d step, otherwise carry out e step;
D) judge that M is the Optimum Matching of G, and finish to calculate;
E) in X, look for a M unsaturation point x 0, make A ← { x 0, B ← φ, A, B is two set;
F) judge N gl(A) whether equal B, if N gl(A)=B, turns k step, otherwise carries out g step, wherein, be with A in the node set of node adjacency;
G) look for a node y ∈ N gl(A)-B;
H) judge that whether y is M saturation point, if y is M saturation point, carries out i step, otherwise carries out j step;
I) find out the match point z of y, make A ← A ∪ z}, { y}, turns f step to B ← B ∪;
J) there is an augmentative path P from x0 to y, make M ← M ⊕ E (P), turn c step;
K) be calculated as follows a value:
a = min x i &Element; A y j &NotElement; N Gl ( A ) { l ( xi ) + l ( y j ) - &omega;ij } ,
Revise label:
L) ask E according to l ' l 'and G l ';
M) make l ← l ', G l← G l ', turn g step.
Can obtain the incidence relation of image and caption by above KM algorithm, that is, and to each image block X ifind the caption text block Y of coupling j.In the present embodiment, as shown in Figure 6 a, four images and four captions can form complete bipartite graph, and matching result is as shown in line in Fig. 6 b.If utilize existing distance near principle determination methods, only depend on distance and the pattern of single image and caption, easily obscure incidence relation, for example image block 3 and image block 4 are all close with caption text block 3 distances, cannot correctly judge the incidence relation of image and caption.And by the present invention, can find global optimum's incidence relation, that is, can image block 3 is associated with caption text block 3, image block 4 is associated with caption text block 4.
The whole image block matching and caption text block in step S5, the output space of a whole page, and for improvement of the recognition effect of reading sequence of layout
Be implemented as follows:
A) under the prerequisite of incidence relation that retains image and caption, from the text block of the space of a whole page, remove caption text block;
B) whole other the space of a whole page piecemeals that step a remained, adopt existing method to carry out reading order identification;
C), after identifying reading order, after caption text block is turned back to the image block matching in reading order, obtain complete reading order.
Fig. 8 has shown and utilizes the existing reading order recognition methods based on XY tree Segment (for example can be referring to " Optimized XY-cut for Determining a Page ReadingOrder ", Proceedings of the Eighth International Conference onDocument Analysis and Recognition, 2005) page shown in Fig. 3 is carried out to the design sketch of reading order sequence, Fig. 9 has shown the design sketch that utilizes the inventive method the page shown in Fig. 3 to be carried out to reading order sequence, wherein, broken line represents reading order.From this two width, figure can find out, in Fig. 8, image block 1 and its caption text block 1 and image block 2 are split with its caption text block 2, and therefore, the sequence of this part is reasonable not; And in Fig. 9, logical relation closely image block 1 is not split with its caption text block 2 with its caption text block 1 and image block 2, but read according to the order of " image block 1 → caption text block 1 → image block 2 → caption text block 2 ", therefore, the accuracy that has improved reading order identification, improvement effect is obvious.
(the second embodiment)
In the present embodiment, as an example of the 165th page of e-book " 21 century Basis of Computer Engineering study course " example, the processing of the present invention to image block number and the unequal situation of caption text block number is described.Generally, when image block number and caption text block number are when unequal, the number of image block can be more than the number of caption text block.
As shown in figure 10, in this page, in last character block, have one with civilian image block 5, and image block 5 and image block 3 are all very close to the distance of caption text block 3.If only depend on distance and the pattern of single image and caption, easily obscure incidence relation.And in the present embodiment, in bipartite graph with dummy node polishing caption text block set Y, and give a large number (such as, 9999) as the weights on virtual limit, all the other implementation methods are identical with the first embodiment.By this method, can correctly identify matching relationship, that is, image block 1 to image block 4 with caption text block 1 to caption text block 4 Corresponding matchings respectively, and image block 5 is isolated without coupling caption.
Equally, matching result is applied to improvement reading sequence of layout, the ranking results obtaining, as shown in the broken line in Figure 10, meets people's reading habit.
Below with reference to drawings and Examples, the present invention be have been described in detail; but; should be appreciated that, the present invention is not limited to above disclosed specific embodiment, and the modification that any those skilled in the art easily expects on this basis and modification all should be included in protection scope of the present invention.

Claims (11)

1. a method of improving reading sequence of layout recognition effect, comprises the following steps:
Read the space of a whole page to be identified, separate character text object and image object in this space of a whole page, and character text object merging is become to text block, image object is left to image block;
From the text block merging, identify caption text block;
Utilize optimization method to carry out Optimum Matching to image block and caption text block, make to realize the distance sum minimum between image block and the caption text block of Optimum Matching, thereby obtain the image block and the caption text block that are associated;
From the text block of the described space of a whole page, remove caption text block, and determine the reading order of all the other text block and image block;
After caption text block being turned back to the image block being associated in reading order;
Wherein, according to identifying caption text block with lower at least one: whether the distance of font attribute, text block and the image block of text block, the number of words of text block, text block meet the form of expression of caption;
And, identify caption text block by following steps:
Calculate font size degree of confidence Q1, wherein, Q1 is the ratio of the font size of the main font of all character texts in font size and the space of a whole page of the main font of current text piece;
Calculate range image degree of confidence Q2, wherein, Q2 represents whether approach with image block distance;
Calculate number of words degree of confidence Q3, wherein, Q3 is word number in current text piece and the ratio of the average word number of space of a whole page text block;
Calculate form of expression degree of confidence Q4, wherein, Q4 represents whether to meet the regular expression of caption;
The overall degree of confidence R of weighted calculation, wherein, R=(u × Q1+v × Q2+w × Q3+x × Q4)/(u+v+w+x), wherein, u, v, w, x represents weighting coefficient, is natural number;
Judge whether overall degree of confidence R exceedes threshold value r, if R >=r judges that current text piece is caption text block, if R<r judges that current text piece is not caption text block.
2. method according to claim 1, is characterized in that, comes separating character text object and image object according to document layout structure analysis method and/or according to the data object type in digital document.
3. method according to claim 1, is characterized in that, described optimization method is bipartite graph Optimum Matching method, and the method comprises the following steps:
Structure cum rights bipartite graph, two subclass of this bipartite graph using image block and caption text block as bipartite graph, using the Euclidean distance of the central point of image block boundary rectangle frame and the central point of caption text block boundary rectangle frame as bipartite graph in the weights on limit;
Utilize bipartite graph Optimum Matching method to obtain image block and the caption text block of Optimum Matching.
4. method according to claim 3, it is characterized in that, by the weights negate on limit in described bipartite graph, and utilize Kuhn-Munkras maximum weight matching algorithm to carry out optimum Perfect matching, there is the image of least weight match algorithm result and caption text block image block and the caption text block as Optimum Matching thereby obtain.
5. method according to claim 3, is characterized in that, when the number of the image block on the space of a whole page and caption text block is not one by one at once, by subclass polishing few number, and gives the weights of a large number as virtual limit with dummy node.
6. method according to claim 1, is characterized in that, output is with lower at least one: the incidence relation of reading order, image block and caption text block, the caption text block of identification, according to text block and the image block of the reading order arrangement of identification.
7. a method of identifying caption, comprises the following steps:
Read the space of a whole page to be identified, separate character text object and image object in this space of a whole page, and character text object merging is become to text block, image object is left to image block;
Utilize with lower at least one and from the text block merging, identify caption text block: the distance of font attribute, text block and the image block of text block, the number of words of text block, the form of expression of text block,
And, identify caption text block by following steps:
Calculate font size degree of confidence Q1, wherein, Q1 is the ratio of the font size of the main font of all character texts in font size and the space of a whole page of the main font of current text piece;
Calculate range image degree of confidence Q2, wherein, Q2 represents whether approach with image block distance;
Calculate number of words degree of confidence Q3, wherein, Q3 is word number in current text piece and the ratio of the average word number of space of a whole page text block;
Calculate form of expression degree of confidence Q4, wherein, Q4 represents whether to meet the regular expression of caption;
The overall degree of confidence R of weighted calculation, wherein, R=(u × Q1+v × Q2+w × Q3+x × Q4)/(u+v+w+x), wherein, u, v, w, x represents weighting coefficient, is natural number;
Judge whether overall degree of confidence R exceedes threshold value r, if R >=r judges that current text piece is caption text block, if R<r judges that current text piece is not caption text block.
8. a method for the incidence relation of recognition image and caption, comprises the following steps:
Utilize the method for the identification caption described in claim 7 to identify caption text block and image block;
Utilize optimization method to carry out Optimum Matching to image block and caption text block, make to realize the distance sum minimum between image block and the caption text block of Optimum Matching, thereby obtain the image block and the caption text block that are associated.
9. method according to claim 8, is characterized in that,
Described optimization method is bipartite graph Optimum Matching method, and the method comprises the following steps:
Structure cum rights bipartite graph, two subclass of this bipartite graph using image block and caption text block as bipartite graph, using the Euclidean distance of the central point of image block boundary rectangle frame and the central point of caption text block boundary rectangle frame as bipartite graph in the weights on limit;
Utilize bipartite graph Optimum Matching algorithm to obtain the Optimum Matching of image block and caption text block.
10. a layout information recognition device, comprising:
Reading unit, it separates character text object and image object in this space of a whole page for reading the space of a whole page to be identified, and character text object merging is become to text block, and image object is left to image block;
Caption recognition unit, it is for identifying caption text block from the text block merging;
Matching unit, it is for utilizing optimization method to carry out Optimum Matching to image block and caption text block, makes to realize the distance sum minimum between image block and the caption text block of Optimum Matching, thereby obtains the image block and the caption text block that are associated;
Reading order improves unit, and it is for removing caption text block from the text block of the space of a whole page, and the reading order of definite all the other text block and image block, after then caption text block being turned back to the image block matching in reading order;
Wherein, the utilization of described caption recognition unit is identified caption text block with lower at least one: whether the distance of font attribute, text block and the image block of text block, the number of words of text block, text block meet the form of expression of caption;
And described caption recognition unit is identified caption text block by carrying out following steps:
Calculate font size degree of confidence Q1, wherein, Q1 is the ratio of the font size of the main font of all character texts in font size and the space of a whole page of the main font of current text piece;
Calculate range image degree of confidence Q2, wherein, Q2 represents whether approach with image block distance;
Calculate number of words degree of confidence Q3, wherein, Q3 is word number in current text piece and the ratio of the average word number of space of a whole page text block;
Calculate form of expression degree of confidence Q4, wherein, Q4 represents whether to meet the regular expression of caption;
The overall degree of confidence R of weighted calculation, wherein, R=(u × Q1+v × Q2+w × Q3+x × Q4)/(u+v+w+x), wherein, u, v, w, x represents weighting coefficient, is natural number;
Judge whether overall degree of confidence R exceedes threshold value r, if R >=r judges that current text piece is caption text block, if R<r judges that current text piece is not caption text block.
11. devices according to claim 10, it is characterized in that, also comprise output unit, the output of this output unit is with lower at least one: the incidence relation of reading order, image block and caption text block, the caption text block of identification, according to text block and the image block of the reading order arrangement of identification.
CN201010193898.0A 2010-05-28 2010-05-28 Method and device for identifying page information Expired - Fee Related CN102262618B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201010193898.0A CN102262618B (en) 2010-05-28 2010-05-28 Method and device for identifying page information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201010193898.0A CN102262618B (en) 2010-05-28 2010-05-28 Method and device for identifying page information

Publications (2)

Publication Number Publication Date
CN102262618A CN102262618A (en) 2011-11-30
CN102262618B true CN102262618B (en) 2014-07-09

Family

ID=45009250

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201010193898.0A Expired - Fee Related CN102262618B (en) 2010-05-28 2010-05-28 Method and device for identifying page information

Country Status (1)

Country Link
CN (1) CN102262618B (en)

Families Citing this family (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102708093B (en) * 2012-04-12 2015-10-28 李敏 The method and system of reading are separated for realizing correlation on portable devices
CN104142961B (en) 2013-05-10 2017-08-25 北大方正集团有限公司 The logic processing device of composite diagram and logical process method in format document
CN103488619B (en) * 2013-07-05 2017-05-24 百度在线网络技术(北京)有限公司 Method and device for processing document file
CN104346615B (en) * 2013-08-08 2019-02-19 北大方正集团有限公司 The extraction element and extracting method of composite diagram in format document
US9542622B2 (en) * 2014-03-08 2017-01-10 Microsoft Technology Licensing, Llc Framework for data extraction by examples
CN104156345B (en) * 2014-08-04 2017-06-20 中南出版传媒集团股份有限公司 The method and apparatus of caption in identification portable document format file
CN104239282B (en) * 2014-09-09 2017-11-14 百度在线网络技术(北京)有限公司 The treating method and apparatus of e-book
CN104268127B (en) * 2014-09-22 2018-02-09 同方知网(北京)技术有限公司 A kind of method of electronics shelves layout files reading order analysis
CN104899551B (en) * 2015-04-30 2018-08-14 北京大学 A kind of form image sorting technique
CN106326193A (en) * 2015-06-18 2017-01-11 北京大学 Footnote identification method and footnote and footnote citation association method in fixed-layout document
CN105512100B (en) * 2015-12-01 2018-08-07 北京大学 A kind of printed page analysis method and device
CN106951400A (en) * 2017-02-06 2017-07-14 北京因果树网络科技有限公司 The information extraction method and device of a kind of pdf document
CN106934383B (en) * 2017-03-23 2018-11-30 掌阅科技股份有限公司 The recognition methods of picture markup information, device and server in file
CN109086327B (en) * 2018-07-03 2022-05-17 中国科学院信息工程研究所 Method and device for rapidly generating webpage visual structure graph
CN110673846B (en) * 2019-09-04 2023-02-17 北京泰和纬度网络技术有限公司 Method and system for webpage blocking
CN111160144B (en) * 2019-12-16 2023-04-07 广东施富电气实业有限公司 Method and system for identifying components by combining electric drawing with pictures and texts and storage medium
CN111046096B (en) * 2019-12-16 2023-11-24 北京信息科技大学 Method and device for generating graphic structured information

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1604073A (en) * 2004-11-22 2005-04-06 北京北大方正技术研究院有限公司 Method for conducting title and text logic connection for newspaper pages

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1604073A (en) * 2004-11-22 2005-04-06 北京北大方正技术研究院有限公司 Method for conducting title and text logic connection for newspaper pages

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Chun Chen Lin et al.Logical structure analysis of book document images using contents information.《Proceedings of the Fourth International Conference on Document Analysis and Recognition》.1997,1048-1054.
Logical structure analysis of book document images using contents information;Chun Chen Lin et al;《Proceedings of the Fourth International Conference on Document Analysis and Recognition》;19970820;1048-1054 *

Also Published As

Publication number Publication date
CN102262618A (en) 2011-11-30

Similar Documents

Publication Publication Date Title
CN102262618B (en) Method and device for identifying page information
CN111858954B (en) Task-oriented text-generated image network model
CN104881458B (en) A kind of mask method and device of Web page subject
CN108628828A (en) A kind of joint abstracting method of viewpoint and its holder based on from attention
CN107748745B (en) Enterprise name keyword extraction method
Suo et al. A simple and robust correlation filtering method for text-based person search
CN114495128B (en) Subtitle information detection method, device, equipment and storage medium
CN109165382A (en) A kind of similar defect report recommended method that weighted words vector sum latent semantic analysis combines
CN112559658B (en) Address matching method and device
Papadopoulos et al. Image clustering through community detection on hybrid image similarity graphs
CN110020312A (en) The method and apparatus for extracting Web page text
Zhang et al. A multiple feature fully convolutional network for road extraction from high-resolution remote sensing image over mountainous areas
CN112949476A (en) Text relation detection method and device based on graph convolution neural network and storage medium
US9830533B2 (en) Analyzing and exploring images posted on social media
CN110232131A (en) Intention material searching method and device based on intention label
CN116975340A (en) Information retrieval method, apparatus, device, program product, and storage medium
Li et al. Social context-aware person search in videos via multi-modal cues
CN116737979A (en) Context-guided multi-modal-associated image text retrieval method and system
Wang et al. Knowledge mining with scene text for fine-grained recognition
CN115759293A (en) Model training method, image retrieval device and electronic equipment
Wang et al. Rare-aware attention network for image–text matching
CN104462061A (en) Word extraction method and word extraction device
CN108595466B (en) Internet information filtering and internet user information and network card structure analysis method
Lin et al. An unsupervised transformer-based multivariate alteration detection approach for change detection in VHR remote sensing images
CN112836057A (en) Knowledge graph generation method, device, terminal and storage medium

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20220913

Address after: 100871 No. 5, the Summer Palace Road, Beijing, Haidian District

Patentee after: Peking University

Patentee after: New founder holdings development Co.,Ltd.

Patentee after: PEKING University FOUNDER R & D CENTER

Address before: 100871 No. 5, the Summer Palace Road, Beijing, Haidian District

Patentee before: Peking University

Patentee before: PEKING UNIVERSITY FOUNDER GROUP Co.,Ltd.

Patentee before: PEKING University FOUNDER R & D CENTER

CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20140709