CN101714149A - Method for automatically correlating pictures with descriptions obtained after inversely solving format files - Google Patents

Method for automatically correlating pictures with descriptions obtained after inversely solving format files Download PDF

Info

Publication number
CN101714149A
CN101714149A CN 200810223698 CN200810223698A CN101714149A CN 101714149 A CN101714149 A CN 101714149A CN 200810223698 CN200810223698 CN 200810223698 CN 200810223698 A CN200810223698 A CN 200810223698A CN 101714149 A CN101714149 A CN 101714149A
Authority
CN
China
Prior art keywords
picture block
literal piece
literal
picture
piece
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN 200810223698
Other languages
Chinese (zh)
Other versions
CN101714149B (en
Inventor
徐剑波
董宁
王辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
New Founder Holdings Development Co ltd
Founder Apabi Technology Ltd
Original Assignee
Peking University Founder Group Co Ltd
Beijing Founder Apabi Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Peking University Founder Group Co Ltd, Beijing Founder Apabi Technology Co Ltd filed Critical Peking University Founder Group Co Ltd
Priority to CN 200810223698 priority Critical patent/CN101714149B/en
Publication of CN101714149A publication Critical patent/CN101714149A/en
Application granted granted Critical
Publication of CN101714149B publication Critical patent/CN101714149B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Processing Or Creating Images (AREA)

Abstract

The invention discloses a method for automatically correlating pictures with descriptions obtained after inversely solving format files, which belongs to the technical field of information processing. In the prior art, the correlation relation between pictures and descriptions obtained after inversely solving format files needs to be artificially correlated, or the correlation relation among data needs to be stored in the format files, so the prior art is heavy in workload and low in efficiency. The method comprises: comparing every text block of which the attribute is main text in a text block set with all picture blocks in a picture block set, screening out the picture block in the best position and taking a text block as a candidate description of the picture block; determining the description of every picture block in the picture block set; and screening out the most suitable candidate description as the description of the picture block if one picture block has a plurality of candidate descriptions. The method can be adopted to automatically establish correlation between the pictures and the descriptions obtained after files in any format are inversely solved, thereby reducing the workload of artificial correlation and improving efficiency.

Description

A kind of anti-automatic correlation method of separating the picture that obtains behind the layout files and scheming to say
Technical field
The invention belongs to technical field of information processing, be specifically related to a kind of anti-automatic correlation method of separating the picture that obtains behind the layout files and scheming to say.
Background technology
Chinese patent application (application number: 200710179938.4; Open date: 2008.06.25) disclose " a kind of indexing method of the complicated space of a whole page based on PDF ", this method can be extracted the set of literal piece from layout files.Comprised corresponding word content in the literal piece, the font size size, the zone position information of font name and this article block, and calculate the composing type of Word message by regional location.The composing type of Word message generally has following several: vertical setting of types from left to right, vertical setting of types from right to left, vertical setting of types are directionless, horizontally-arranged from left to right, horizontally-arranged from right to left etc.The attribute of demarcating the literal piece according to the font size size of literal piece is title or text, and the sequence number of literal piece etc.Yet this method is not obtained picture block, and the incidence relation between picture block and the corresponding picture character explanation (promptly figure says), need manually carry out operation associatedly, and workload is big, and efficient is low.
Chinese patent application (application number: 200610112710.9; Open date: 2007.02.14) disclose the method for data information " a kind of extraction appear in the newspapers ", this method can be extracted data in the layout files according to the layout information structure of layout files, and the layout information by layout files and contribution area information extract the incidence relation between contribution automatically.The shortcoming of this method is: layout files must be stored the incidence relation between the contribution inside, if layout files is not stored the incidence relation between the contribution inside, then this method has just lost effectiveness.
Summary of the invention
At the defective that exists in the prior art, the purpose of this invention is to provide a kind of anti-automatic correlation method of separating the picture that obtains behind the layout files and scheming to say, this method can realize separating the picture that obtains behind any layout files and the figure of this picture says auto-associating with counter, reduce the manually-operated workload, raise the efficiency.
To achieve these goals, the technical solution used in the present invention is: a kind of anti-automatic correlation method of separating the picture that obtains behind the layout files and scheming to say may further comprise the steps:
(1) { takes out the literal piece that the attribute different with getting the literal piece is text the S} from the anti-literal set of blocks that obtains behind the layout files of separating;
(2) the anti-picture block set that obtains after separating layout files search among the P} with step (1) in the literal piece neighbour's that takes out picture block, if neither one picture block and this article block neighbour, then go to step (3), if a picture block and this article block neighbour are only arranged, then the candidate of this literal piece as this picture block schemed, if two or more picture block and this article block neighbour are arranged, then filter out the best picture block in position, the candidate as this picture block schemes with this literal piece;
(3) repeat above step, { all the literal pieces among the S} are removed once up to the literal set of blocks;
(4) determine that { figure of each picture block says among the P} in the picture block set; If the candidate of a picture block schemes only there is one, then this candidate is schemed to say as the figure of this picture block; If the candidate of a picture block schemes then to filter out only candidate and scheme to say as the figure of this picture block for a plurality of.
Aforesaid a kind of anti-automatic correlation method of separating the picture that obtains behind the layout files and scheming to say, { method of searching among the P} with literal piece neighbour's picture block is: judge that whether in the horizontal direction or the in the vertical direction neighbour picture block and literal piece in picture block set in the step (2), if in the horizontal direction or in the vertical direction neighbour, then picture block and literal piece neighbour.
Aforesaid a kind of anti-automatic correlation method that the picture that obtains behind the layout files and figure say of separating, wherein, described judge picture block and literal piece whether in the horizontal direction or in the vertical direction neighbour's method may further comprise the steps:
Suppose that literal piece upper left corner point coordinate is (X 1, Y 1), lower right corner point coordinate is (X 1', Y 1'), picture block upper left corner point coordinate is (X 2, Y 2), lower right corner point coordinate is (X 2', Y 2'); Width W=the X of literal piece 1'-X 1, the width W of picture block '=X 2'-X 2Height H=the Y of literal piece 1'-Y 1, the height H of picture block '=Y 2'-Y 2The mean value of the font size of all literal pieces is AvgFontSize; Figure say and picture block between coverage DistThreshold=C 1* AvgFontSize, wherein C 1Be the spread ratio between literal piece and the picture block, 1<C 1<5; Following min function representation is got both smaller values, and the max function representation is got both higher values, and D is an extended distance, 0≤D≤10, and unit is a pound;
1. calculate the degree of overlapping of picture block and literal piece:
The calculating publicity of degree of overlapping OverlapX in the horizontal direction is
OverlapX=(min(X 1′,X 2′)-max(X 1,X 2))/(max(X 1′,X 2′)-min(X 1,X 2)),
The calculating publicity of the degree of overlapping OverlapY of in the vertical direction is
OverlapY=(min(Y 1′,Y 2′)-max(Y 1,Y 2))/(max(Y 1′,Y 2′)-min(Y 1,Y 2));
2. judge Y 1〉=Y 2-D and Y 1'≤Y 2'+D and X 1〉=X 2-D and X 1'≤X 2Whether '+D sets up; If set up, then continue whether to judge OverlapY greater than OverlapX, as if greater than, picture block and literal piece neighbour in the horizontal direction then, otherwise picture block and literal piece in the vertical direction neighbour; If be false, then calculate picture block and literal piece overlap distance DistX in the horizontal direction, DistX=max (X 1, X 2)-min (X 1', X 2');
3. judge Y 1〉=Y 2-D and Y 1'≤Y 2Whether '+D and W<W ' and DistX<DistThreshold set up, if set up, and picture block and literal piece neighbour in the horizontal direction then; Otherwise calculate the ultimate range DistXMax of picture block and literal piece, if X 1<X 2, DistXMax=X then 2-X 1, otherwise DistXMax=|X 2'-X 1' |;
4. judge Y 1〉=Y 2-D and Y 1'≤Y 2Whether set up '+D and W<W ' and DistXMax<W '/2, if set up, and picture block and literal piece neighbour in the horizontal direction then; Otherwise, calculate DistY, DistY=max (Y 1, Y 2)-min (Y 1', Y 2');
5. judge X 1〉=X 2-D and X 1'≤X 2Whether the composing type of '+D and H<H ' and literal piece is set up for from left to right horizontally-arranged or horizontally-arranged from right to left and DistY<DistThreshold, if set up, then picture block and literal piece in the vertical direction neighbour, otherwise the i.e. also in the vertical direction neighbour not in the horizontal direction not of picture block and literal piece.
Aforesaid a kind of anti-automatic correlation method of separating the picture that obtains behind the layout files and scheming to say, wherein, described C 1Value be 1.2, the value of described D is 3.
Aforesaid a kind of anti-automatic correlation method of separating the picture that obtains behind the layout files and scheming to say, the method that filters out the best picture block in position described in the step (2) may further comprise the steps:
Suppose that the literal piece that takes out in the step (1) is T, the picture block set that closes on T for TP}, the best picture block in position is PZ;
1. calculate respectively that { figure of all picture block says type PicType among the TP}, and T and { the distance D ist among the TP} between all picture block; The figure of described picture block says that type is meant the position of literal piece with respect to picture block, comprise the literal piece the upside of picture block, literal piece in the left side of picture block, the literal piece is at the right side of picture block and the literal piece downside in picture block;
2. from take out arbitrarily a picture block P the TP}, and with the picture block of taking out from { deleting the TP}; Other is PZ=P;
3. from take out arbitrarily a picture block PN the TP}, and with PN from { deleting the TP}; From PZ and PN, filter out position picture block preferably, if the PN position is better, PZ=PN in addition then;
From PZ and PN, filter out a position preferably the method for picture block be: the figure that supposes PZ says that type is PicTypeZ, and the figure of PN says that type is PicTypeN, and the distance between T and the PZ is DistZ, and the distance between T and the PN is DistN;
If satisfy one of following condition, then PN is better than PZ position:
Condition a.PicTypeN and DistN<DistZ identical with PicTypeZ,
Condition b.PicTypeN is that the literal piece is left side and the DistN<DistZ of literal piece in picture block at the right side and the PicTypeZ of picture block,
The priority of condition c.PicTypeN is higher than PicTypeZ and PicTypeN is that the literal piece is that the literal piece is not set up simultaneously in the left side of picture block at the right side and the PicTypeZ of picture block; Wherein, the literal piece is higher than left side and the right side of literal piece in picture block in the priority of the downside of picture block, and the literal piece is higher than the upside of literal piece in picture block in the left side of picture block or the priority on right side;
4. judge { whether TP} is empty, if be empty, then PZ is the best picture block in position; Otherwise, go to step 3..
Aforesaid a kind of anti-automatic correlation method of separating the picture that obtains behind the layout files and scheming to say, the step 1. figure of middle calculating chart sheet piece says that the method for the distance between type and literal piece and the picture block may further comprise the steps:
If literal piece and picture block be the neighbour in the horizontal direction:
A. calculate the horizontal ordinate CenterT=(X at literal piece center 1+ X 1')/2; Calculate the horizontal ordinate CenterPic=(X at picture block center 2+ X 2')/2;
B. judge whether CenterT<CenterPic sets up, if set up, then PicType is the left side of literal piece in picture block, the distance D ist=X of literal piece and picture block 2-Center; If be false, then PicType is the right side of literal piece in picture block, literal piece and picture block distance D ist=Center-X 2';
If literal piece and picture block in the vertical direction neighbour:
A. calculate the ordinate CenterT=(Y at literal piece center 1+ Y 1')/2; Calculate the ordinate CenterPic=(Y at picture block center 2+Y 2')/2;
B. judge whether CenterT<CenterPic sets up, if set up, then PicType is the upside of literal piece in picture block, the distance D ist=Y of literal piece and picture block 2-Center; If be false, then PicType is the downside of literal piece in picture block, literal piece and picture block distance D ist=Center-Y 2'.
Aforesaid a kind of anti-automatic correlation method that the picture that obtains behind the layout files and figure say of separating is schemed when being a plurality of as the candidate of a picture block in the step (4), filters out only literal piece and may further comprise the steps as the method that the figure of this picture block says:
The candidate who supposes a picture block schemes set and is { L};
1. with { figure says that the identical literal piece merging of type becomes a literal piece among the L}, and the degree of overlapping of the literal piece after the merging is the degree of overlapping sum of merged literal piece and picture block, and weight is the number of merged literal piece;
2. after merging { the literal piece of picking out the weighted value maximum the L} is said as the figure of picture block, if the literal piece of weighted value maximum is a plurality of, the then relatively a plurality of literal pieces of weighted value maximum and the degree of overlapping of picture block will be said with the literal piece of the picture block degree of overlapping maximum figure as picture block.
Method of the present invention, by to anti-calculating of separating position relation etc. between the literal piece that obtains behind the layout files and the picture block, the layout information structure that need not to understand layout files just can be automatically picture with set up between the figure of this picture says related, reduced and manually confirmed and operation associated workload, improved efficient.
Description of drawings
Fig. 1 is a method flow diagram of the present invention;
Fig. 2 is that embodiment Chinese block and picture block position concern synoptic diagram;
Fig. 3 is the process flow diagram that filters out the best picture block in position in the embodiment when the literal piece has two or more picture block and its neighbour.
Embodiment
Describe the present invention below in conjunction with embodiment and accompanying drawing.
Figure of the present invention says so and refers to one or more literal pieces that picture block is described.Figure says to have different types, comprises that figure says that upside in picture block, figure say left side in picture block, scheme to say on the right side of picture block and scheme to say downside in picture block.The type that figure says is to determine according to the relation of the position between literal piece and the picture block central point, promptly figure says that the upside in picture block is meant the upside that is positioned at the picture block central point, figure says in the left side of picture block and is meant the left side that is positioned at the picture block central point, figure says on the right side of picture block and is meant the right side that is positioned at the picture block central point, figure says that the downside in picture block is meant the downside that is positioned at the picture block central point, comprises that figure says the situation in picture block.As Fig. 2 Chinese block 1 and literal piece 2 upside in picture block, literal piece 3 and literal piece 4 are at the downside of picture block, and literal piece 5, literal piece 6 and literal piece 7 are in the left side of picture block, and literal piece 8 and literal piece 9 are on the right side of picture block.
Fig. 1 shows the flow process of the anti-automatic correlation method of separating the picture that obtains behind the layout files and scheming to say of the present invention, may further comprise the steps.
Suppose from counter separate layout files after, the literal agllutination that obtains is combined into that { S}, picture block set is { P}.
(1) { takes out the literal piece that the attribute different with getting the literal piece is text the S} from the literal set of blocks.
(2) the anti-picture block set that obtains after separating layout files search among the P} with step (1) in the literal piece neighbour's that takes out picture block.If neither one picture block and this article block neighbour then go to step (3).If a picture block and this article block neighbour are only arranged, then the candidate of this literal piece as this picture block schemed.If two or more picture block and this article block neighbour are arranged, then filter out the best picture block in position, the candidate as this picture block schemes with this literal piece.
{ method of searching among the P} with literal piece neighbour's picture block is: judge that whether in the horizontal direction or the in the vertical direction neighbour picture block and literal piece in picture block set, if in the horizontal direction or in the vertical direction neighbour, then picture block and literal piece neighbour.In Fig. 2, literal piece 1, literal piece 2, literal piece 3 and literal piece 4 and picture block in the vertical direction neighbour, literal piece 5, literal piece 6, literal piece 7, literal piece 8 and literal piece 9 be the neighbour in the horizontal direction, and the method for judgement is as described below.
Suppose that literal piece upper left corner point coordinate is (X 1, Y 1), lower right corner point coordinate is (X 1', Y 1'), picture block upper left corner point coordinate is (X 2, Y 2), lower right corner point coordinate is (X 2', Y 2').Width W=the X of literal piece 1'-X 1, the width W of picture block '=X 2'-X 2Height H=the Y of literal piece 1'-Y 1, the height H of picture block '=Y 2'-Y 2The mean value of the font size of all literal pieces is AvgFontSize.Figure say and picture block between coverage DistThreshold=C 1* AvgFontSize, wherein C 1Be the spread ratio between literal piece and the picture block, 1<C 1<5, C in the present embodiment 1=1.2.Following min function representation is got both smaller values, and the max function representation is got both higher values.Following D is an extended distance, and promptly the literal piece exceeds picture block width or the distance that allowed of height, 0≤D≤10, and unit be pound.In the present embodiment, the D value is 3, and the value of D can be adjusted in span.
Judge picture block and literal piece whether in the horizontal direction or in the vertical direction neighbour's method may further comprise the steps.
1. calculate the degree of overlapping of picture block and literal piece:
The calculating publicity of degree of overlapping OverlapX in the horizontal direction is
OverlapX=(min(X 1′,X 2′)-max(X 1,X 2))/(max(X 1′,X 2′)-min(X 1,X 2));
The calculating publicity of the degree of overlapping OverlapY of in the vertical direction is
OverlapY=(min(Y 1′,Y 2′)-max(Y 1,Y 2))/(max(Y 1′,Y 2′)-min(Y 1,Y 2))。
2. judge Y 1〉=Y 2-D and Y 1'≤Y 2'+D and X 1〉=X 2-D and X 1'≤X 2Whether '+D sets up; If set up, then continue whether to judge OverlapY greater than OverlapX, as if greater than, picture block and literal piece neighbour in the horizontal direction then, otherwise picture block and literal piece in the vertical direction neighbour; If be false, then calculate picture block and literal piece overlap distance DistX in the horizontal direction, DistX=max (X 1, X 2)-min (X 1', X 2').
3. judge Y 1〉=Y 2-D and Y 1'≤Y 2Whether '+D and W<W ' and DistX<DistThreshold set up, if set up, and picture block and literal piece neighbour in the horizontal direction then; Otherwise calculate the ultimate range DistXMax of picture block and literal piece, if X 1<X 2, DistXMax=X then 2-X 1, otherwise DistXMax=|X 2'-X 1' |;
4. judge Y 1〉=Y 2-D and Y 1'≤Y 2Whether set up '+D and W<W ' and DistXMax<W '/2, if set up, and picture block and literal piece neighbour in the horizontal direction then; Otherwise, calculate DistY, DistY=max (Y 1, Y 2)-min (Y 1', Y 2');
5. judge X 1〉=X 2-D and X 1'≤X 2Whether the composing type of '+D and H<H ' and literal piece is set up for from left to right horizontally-arranged or horizontally-arranged from right to left and DistY<DistThreshold, if set up, then picture block and literal piece in the vertical direction neighbour, otherwise the i.e. also in the vertical direction neighbour not in the horizontal direction not of picture block and literal piece.
Fig. 3 has shown when the literal piece has two or more picture block and this article block neighbour, filters out the flow process of the best picture block in position, may further comprise the steps.Suppose that the literal piece that takes out in the step (1) is T, the picture block set that closes on T is that { TP}, the best picture block in position is PZ.
1. calculate respectively that { figure of all picture block says type PicType among the TP}, and T and { the distance D ist among the TP} between all picture block.
If literal piece and picture block be the neighbour in the horizontal direction:
A. calculate the horizontal ordinate CenterT=(X at literal piece center 1+ X 1')/2; Calculate the horizontal ordinate CenterPic=(X at picture block center 2+ X 2')/2;
B. judge whether CenterT<CenterPic sets up, if set up, then PicType is the left side of literal piece in picture block, the distance D ist=X of literal piece and picture block 2-Center; If be false, then PicType is the right side of literal piece in picture block, literal piece and picture block distance D ist=Center-X 2'.
If literal piece and picture block in the vertical direction neighbour:
A. calculate the ordinate CenterT=(Y at literal piece center 1+ Y 1')/2, calculate the ordinate CenterPic=(Y at picture block center 2+ Y 2')/2;
B. judge whether CenterT<CenterPic sets up; If set up, then PicType is the upside of literal piece in picture block, the distance D ist=Y of literal piece and picture block 2-Center; If be false, then PicType is the downside of literal piece in picture block, literal piece and picture block distance D ist=Center-Y 2'.
2. from take out arbitrarily a picture block P the TP}, and with the picture block of taking out from { deleting the TP}; Other is PZ=P.
3. from take out arbitrarily a picture block PN the TP}, and with PN from { deleting the TP}; From PZ and PN, filter out position picture block preferably, if the PN position is better, PZ=PN in addition then.
From PZ and PN, filter out a position preferably the method for picture block be: the figure that supposes PZ says that type is PicTypeZ, and the figure of PN says that type is PicTypeN, and the distance between T and the PZ is DistZ, and the distance between T and the PN is DistN;
If satisfy one of following condition, then PN is better than PZ position:
Condition a.PicTypeN and DistN<DistZ identical with PicTypeZ;
Condition b.PicTypeN is that the literal piece is left side and the DistN<DistZ of literal piece in picture block at the right side and the PicTypeZ of picture block;
The priority of condition c.PicTypeN is higher than PicTypeZ and PicTypeN is that the literal piece is that the literal piece is not set up simultaneously in the left side of picture block at the right side and the PicTypeZ of picture block; Wherein, the literal piece is higher than left side and the right side of literal piece in picture block in the priority of the downside of picture block, and the literal piece is higher than the upside of literal piece in picture block in the left side of picture block or the priority on right side.
4. judge { whether TP} is empty, if be empty, then PZ is the best picture block in position; Otherwise, go to step 3..
(3) repeating step (1) and step (2), { all the literal pieces among the S} are removed once up to the literal set of blocks.
(4) determine that { figure of each picture block says among the P} in the picture block set.If the candidate of a picture block schemes only there is one, then this candidate is schemed to say as the figure of this picture block.If the candidate of a picture block schemes then to filter out only candidate and scheme to say as the figure of this picture block, and other candidates of this picture block are schemed to add to again the literal set of blocks { among the S} for a plurality of.
When the candidate of picture block schemes when a plurality of, filter out the method that only candidate schemes and may further comprise the steps, it is { L} that the candidate who suppose a picture block schemes to gather.
1. with { figure says that the identical literal piece merging of type becomes a literal piece among the L}, and the literal piece after the merging and the degree of overlapping of picture block are the degree of overlapping sum of merged literal piece and picture block, and weight is the number of merged literal piece.In Fig. 2, literal piece 1 and literal piece 2 are merged into a literal piece, merge hereinafter that the degree of overlapping of block and picture block is literal piece 1 and the degree of overlapping of picture block and the degree of overlapping sum of literal piece 2 and picture block, merging hereinafter, the weight of block is 2; Literal piece 3 and literal piece 4 are merged into a literal piece, and degree of overlapping is literal piece 3, literal piece 4 and picture block degree of overlapping sum, and weight is 2; Literal piece 5, literal piece 6 and literal piece 7 are merged into a literal piece, and degree of overlapping is literal piece 5, literal piece 6, literal piece 7 and picture block degree of overlapping sum, and weight is 3; Literal piece 8 and literal piece 9 are merged into a literal piece, and degree of overlapping is literal piece 8, literal piece 9 and picture block degree of overlapping sum, and weight is 2.
2. after merging { the literal piece of picking out the weighted value maximum the L} is said as the figure of picture block, if the literal piece of weighted value maximum is a plurality of, the then relatively a plurality of literal pieces of weighted value maximum and the degree of overlapping of picture block will be said with the literal piece of the picture block degree of overlapping maximum figure as picture block.In Fig. 2,, its figure as picture block is said by the literal piece weighted value maximum that literal piece 5, literal piece 6 and literal piece 7 merge.
Method of the present invention is not limited to above-mentioned embodiment, and those skilled in the art's technical scheme according to the present invention draws other embodiment, belongs to technological innovation scope of the present invention equally.

Claims (8)

1. instead separate the automatic correlation method that the picture that obtains behind the layout files and figure say for one kind, may further comprise the steps:
(1) { takes out the literal piece that the attribute different with getting the literal piece is text the S} from the anti-literal set of blocks that obtains behind the layout files of separating;
(2) the anti-picture block set that obtains after separating layout files search among the P} with step (1) in the literal piece neighbour's that takes out picture block, if neither one picture block and this article block neighbour, then go to step (3), if a picture block and this article block neighbour are only arranged, then the candidate of this literal piece as this picture block schemed, if two or more picture block and this article block neighbour are arranged, then filter out the best picture block in position, the candidate as this picture block schemes with this literal piece;
(3) repeat above step, { all the literal pieces among the S} are removed once up to the literal set of blocks;
(4) determine that { figure of each picture block says among the P} in the picture block set; If the candidate of a picture block schemes only there is one, then this candidate is schemed to say as the figure of this picture block; If the candidate of a picture block schemes then to filter out only candidate and scheme to say as the figure of this picture block for a plurality of.
2. a kind of anti-automatic correlation method of separating the picture that obtains behind the layout files and scheming to say as claimed in claim 1, it is characterized in that, { method of searching among the P} with literal piece neighbour's picture block is: judge that whether in the horizontal direction or the in the vertical direction neighbour picture block and literal piece in picture block set in the step (2), if in the horizontal direction or in the vertical direction neighbour, then picture block and literal piece neighbour.
3. a kind of anti-automatic correlation method that the picture that obtains behind the layout files and figure say of separating as claimed in claim 2 is characterized in that, described judge picture block and literal piece whether in the horizontal direction or in the vertical direction neighbour's method may further comprise the steps:
Suppose that literal piece upper left corner point coordinate is (X 1, Y 1), lower right corner point coordinate is (X 1', Y 1'), picture block upper left corner point coordinate is (X 2, Y 2), lower right corner point coordinate is (X 2', Y 2'); Width W=the X of literal piece 1'-X 1, the width W of picture block '=X 2'-X 2Height H=the Y of literal piece 1'-Y 1, the height H of picture block '=Y 2'-Y 2The mean value of the font size of all literal pieces is AvgFontSize; Figure say and picture block between coverage DistThreshold=C 1* AvgFontSize, wherein C 1Be the spread ratio between literal piece and the picture block, 1<C 1<5; Following min function representation is got both smaller values, and the max function representation is got both higher values, and D is an extended distance, 0≤D≤10, and unit is a pound;
1. calculate the degree of overlapping of picture block and literal piece:
The calculating publicity of degree of overlapping OverlapX in the horizontal direction is
OverlapX=(min(X 1′,X 2′)-max(X 1,X 2))/(max(X 1′,X 2′)-min(X 1,X 2)),
The calculating publicity of the degree of overlapping OverlapY of in the vertical direction is
OverlapY=(min(Y 1′,Y 2′)-max(Y 1,Y 2))/(max(Y 1′,Y 2′)-min(Y 1,Y 2));
2. judge Y 1〉=Y 2-D and Y 1'≤Y 2'+D and X 1〉=X 2-D and X 1'≤X 2Whether '+D sets up; If set up, then continue whether to judge OverlapY greater than OverlapX, as if greater than, picture block and literal piece neighbour in the horizontal direction then, otherwise picture block and literal piece in the vertical direction neighbour; If be false, then calculate picture block and literal piece overlap distance DistX in the horizontal direction, DistX=max (X 1, X 2)-min (X 1', X 2');
3. judge Y 1〉=Y 2-D and Y 1'≤Y 2Whether '+D and W<W ' and DistX<DistThreshold set up, if set up, and picture block and literal piece neighbour in the horizontal direction then; Otherwise calculate the ultimate range DistXMax of picture block and literal piece, if X 1<X 2, DistXMax=X then 2-X 1, otherwise DistXMax=|X 2'-X 1' |;
4. judge Y 1〉=Y 2-D and Y 1'≤Y 2Whether set up '+D and W<W ' and DistMax<W '/2, if set up, and picture block and literal piece neighbour in the horizontal direction then; Otherwise, the overlap distance DistY of calculating picture block and literal piece in the vertical direction, DistY=max (Y 1, Y 2)-min (Y 1', Y 2');
5. judge X 1〉=X 2-D and X 1'≤X 2Whether the composing type of '+D and H<H ' and literal piece is set up for from left to right horizontally-arranged or horizontally-arranged from right to left and DistY<DistThreshold, if set up, then picture block and literal piece in the vertical direction neighbour, otherwise the i.e. also in the vertical direction neighbour not in the horizontal direction not of picture block and literal piece.
4. a kind of anti-automatic correlation method of separating the picture that obtains behind the layout files and scheming to say as claimed in claim 3 is characterized in that: described C 1Value be 1.2.
5. a kind of anti-automatic correlation method of separating the picture that obtains behind the layout files and scheming to say as claimed in claim 3, it is characterized in that: the value of described D is 3.
6. as the described a kind of anti-automatic correlation method of separating the picture that obtains behind the layout files and scheming to say of one of claim 3 to 5, it is characterized in that the method that filters out the best picture block in position described in the step (2) may further comprise the steps:
Suppose that the literal piece that takes out in the step (1) is T, the picture block set that closes on T for TP}, the best picture block in position is PZ;
1. calculate respectively that { figure of all picture block says type PicType among the TP}, and T and { the distance D ist among the TP} between all picture block; The figure of described picture block says that type is meant the position of literal piece with respect to picture block, comprise the literal piece the upside of picture block, literal piece in the left side of picture block, the literal piece is at the right side of picture block and the literal piece downside in picture block;
2. from take out arbitrarily a picture block P the TP}, and with the picture block of taking out from { deleting the TP}; Other is PZ=P;
3. from take out arbitrarily a picture block PN the TP}, and with PN from { deleting the TP}; From PZ and PN, filter out position picture block preferably, if the PN position is better, PZ=PN in addition then;
From PZ and PN, filter out a position preferably the method for picture block be: the figure that supposes PZ says that type is PicTypeZ, and the figure of PN says that type is PicTypeN, and the distance between T and the PZ is DistZ, and the distance between T and the PN is DistN;
If satisfy one of following condition, then PN is better than PZ position:
Condition a.PicTypeN and DistN<DistZ identical with PicTypeZ,
Condition b.PicTypeN is that the literal piece is left side and the DistN<DistZ of literal piece in picture block at the right side and the PicTypeZ of picture block,
The priority of condition c.PicTypeN is higher than PicTypeZ and PicTypeN is that the literal piece is that the literal piece is not set up simultaneously in the left side of picture block at the right side and the PicTypeZ of picture block; Wherein, the literal piece is higher than left side and the right side of literal piece in picture block in the priority of the downside of picture block, and the literal piece is higher than the upside of literal piece in picture block in the left side of picture block or the priority on right side;
4. judge { whether TP} is empty, if be empty, then PZ is the best picture block in position; Otherwise, go to step 3..
7. a kind of anti-automatic correlation method of separating the picture that obtains behind the layout files and scheming to say as claimed in claim 6 is characterized in that the step 1. figure of middle calculating chart sheet piece says that the method for the distance between type and literal piece and the picture block may further comprise the steps:
If literal piece and picture block be the neighbour in the horizontal direction:
A. calculate the horizontal ordinate CenterT=(X at literal piece center 1+ X 1')/2; Calculate the horizontal ordinate CenterPic=(X at picture block center 2+ X 2')/2;
B. judge whether CenterT<CenterPic sets up, if set up, then PicType is the left side of literal piece in picture block, the distance D ist=X of literal piece and picture block 2-Center; If be false, then PicType is the right side of literal piece in picture block, literal piece and picture block distance D ist=Center-X 2';
If literal piece and picture block in the vertical direction neighbour:
A. calculate the ordinate CenterT=(Y at literal piece center 1+ Y 1')/2; Calculate the ordinate CenterPic=(Y at picture block center 2+ Y 2')/2;
B. judge whether CenterT<CenterPic sets up, if set up, then PicType is the upside of literal piece in picture block, the distance D ist=Y of literal piece and picture block 2-Center; If be false, then PicType is the downside of literal piece in picture block, literal piece and picture block distance D ist=Center-Y 2'.
8. a kind of anti-automatic correlation method of separating the picture that obtains behind the layout files and scheming to say as claimed in claim 7, it is characterized in that, scheme when a plurality of as the candidate of a picture block in the step (4), filter out only literal piece and may further comprise the steps as the method that the figure of this picture block says:
The candidate who supposes a picture block schemes set and is { L};
1. with { figure says that the identical literal piece merging of type becomes a literal piece among the L}, and the literal piece after the merging and the degree of overlapping of picture block are the degree of overlapping sum of merged literal piece and picture block, and weight is the number of merged literal piece;
2. after merging { the literal piece of picking out the weighted value maximum the L} is said as the figure of picture block, if the literal piece of weighted value maximum is a plurality of, the then relatively a plurality of literal pieces of weighted value maximum and the degree of overlapping of picture block will be said with the literal piece of the picture block degree of overlapping maximum figure as picture block.
CN 200810223698 2008-10-08 2008-10-08 Method for automatically correlating pictures with descriptions obtained after inversely solving format files Expired - Fee Related CN101714149B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN 200810223698 CN101714149B (en) 2008-10-08 2008-10-08 Method for automatically correlating pictures with descriptions obtained after inversely solving format files

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN 200810223698 CN101714149B (en) 2008-10-08 2008-10-08 Method for automatically correlating pictures with descriptions obtained after inversely solving format files

Publications (2)

Publication Number Publication Date
CN101714149A true CN101714149A (en) 2010-05-26
CN101714149B CN101714149B (en) 2013-03-06

Family

ID=42417799

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 200810223698 Expired - Fee Related CN101714149B (en) 2008-10-08 2008-10-08 Method for automatically correlating pictures with descriptions obtained after inversely solving format files

Country Status (1)

Country Link
CN (1) CN101714149B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102567291A (en) * 2010-12-31 2012-07-11 北大方正集团有限公司 Method and device for deleting lace characters in format document
CN103927533A (en) * 2014-04-11 2014-07-16 北京工业大学 Intelligent processing method for graphics and text information in early patent document scanning copy
CN108038426A (en) * 2017-11-29 2018-05-15 阿博茨德(北京)科技有限公司 The method and device of chart-information in a kind of extraction document
CN112819509A (en) * 2021-01-18 2021-05-18 上海携程商务有限公司 Method, system, electronic device and storage medium for automatically screening advertisement pictures

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1912874A (en) * 2006-08-30 2007-02-14 北京大学 Method for abstracting document data information appeared in newspaper

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102567291A (en) * 2010-12-31 2012-07-11 北大方正集团有限公司 Method and device for deleting lace characters in format document
CN103927533A (en) * 2014-04-11 2014-07-16 北京工业大学 Intelligent processing method for graphics and text information in early patent document scanning copy
CN103927533B (en) * 2014-04-11 2017-03-01 北京工业大学 The intelligent processing method of graph text information in a kind of scanned document for earlier patents
CN108038426A (en) * 2017-11-29 2018-05-15 阿博茨德(北京)科技有限公司 The method and device of chart-information in a kind of extraction document
CN112819509A (en) * 2021-01-18 2021-05-18 上海携程商务有限公司 Method, system, electronic device and storage medium for automatically screening advertisement pictures
CN112819509B (en) * 2021-01-18 2024-03-26 上海携程商务有限公司 Method, system, electronic device and storage medium for automatically screening advertisement pictures

Also Published As

Publication number Publication date
CN101714149B (en) 2013-03-06

Similar Documents

Publication Publication Date Title
CN109635268B (en) Method for extracting form information in PDF file
CN104516891B (en) A kind of printed page analysis method and system
CN104298982B (en) A kind of character recognition method and device
US8285713B2 (en) Image search using face detection
CN101615252B (en) Method for extracting text information from adaptive images
EP2219153A3 (en) Resizing a digital document image via background content removal
CN103824373B (en) A kind of bill images amount of money sorting technique and system
CN103577818A (en) Method and device for recognizing image characters
CN101661559A (en) Digital image training and detecting methods
CN110163030B (en) PDF framed table extraction method based on image information
CN102882838A (en) Authentication method and system applying verification code mechanism
US20090276378A1 (en) System and Method for Identifying Document Structure and Associated Metainformation and Facilitating Appropriate Processing
CN101714149B (en) Method for automatically correlating pictures with descriptions obtained after inversely solving format files
US20150199821A1 (en) Segmentation of a multi-column document
JP2011018338A5 (en)
CN107391457B (en) Document segmentation method and device based on text line
KR101484419B1 (en) Apparatus and method for recognizing layout of electronic document
CN105608454A (en) Text structure part detection neural network based text detection method and system
CN105808691A (en) Gate vehicle retrieval method and system
CN106127042A (en) Webpage visual similarity recognition method
CN105825216A (en) Method of locating text in complex background image
CN101866418A (en) Method and equipment for determining file reading sequences
CN105930313A (en) Method and device for processing notification message
CN103678280A (en) Translation task fragmentization method
Cahn et al. Segmentation of cervical cell images.

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20220620

Address after: 3007, Hengqin international financial center building, No. 58, Huajin street, Hengqin new area, Zhuhai, Guangdong 519031

Patentee after: New founder holdings development Co.,Ltd.

Patentee after: FOUNDER APABI TECHNOLOGY Ltd.

Address before: 100871, fangzheng building, 298 Fu Cheng Road, Beijing, Haidian District

Patentee before: PEKING UNIVERSITY FOUNDER GROUP Co.,Ltd.

Patentee before: FOUNDER APABI TECHNOLOGY Ltd.

CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20130306