A kind of anti-automatic correlation method of separating the picture that obtains behind the layout files and scheming to say
Technical field
The invention belongs to technical field of information processing, be specifically related to a kind of anti-automatic correlation method of separating the picture that obtains behind the layout files and scheming to say.
Background technology
Chinese patent application (application number: 200710179938.4; Open date: 2008.06.25) disclose " a kind of indexing method of the complicated space of a whole page based on PDF ", this method can be extracted the set of literal piece from layout files.Comprised corresponding word content in the literal piece, the font size size, the zone position information of font name and this article block, and calculate the composing type of Word message by regional location.The composing type of Word message generally has following several: vertical setting of types from left to right, vertical setting of types from right to left, vertical setting of types are directionless, horizontally-arranged from left to right, horizontally-arranged from right to left etc.The attribute of demarcating the literal piece according to the font size size of literal piece is title or text, and the sequence number of literal piece etc.Yet this method is not obtained picture block, and the incidence relation between picture block and the corresponding picture character explanation (promptly figure says), need manually carry out operation associatedly, and workload is big, and efficient is low.
Chinese patent application (application number: 200610112710.9; Open date: 2007.02.14) disclose the method for data information " a kind of extraction appear in the newspapers ", this method can be extracted data in the layout files according to the layout information structure of layout files, and the layout information by layout files and contribution area information extract the incidence relation between contribution automatically.The shortcoming of this method is: layout files must be stored the incidence relation between the contribution inside, if layout files is not stored the incidence relation between the contribution inside, then this method has just lost effectiveness.
Summary of the invention
At the defective that exists in the prior art, the purpose of this invention is to provide a kind of anti-automatic correlation method of separating the picture that obtains behind the layout files and scheming to say, this method can realize separating the picture that obtains behind any layout files and the figure of this picture says auto-associating with counter, reduce the manually-operated workload, raise the efficiency.
To achieve these goals, the technical solution used in the present invention is: a kind of anti-automatic correlation method of separating the picture that obtains behind the layout files and scheming to say may further comprise the steps:
(1) { takes out the literal piece that the attribute different with getting the literal piece is text the S} from the anti-literal set of blocks that obtains behind the layout files of separating;
(2) the anti-picture block set that obtains after separating layout files search among the P} with step (1) in the literal piece neighbour's that takes out picture block, if neither one picture block and this article block neighbour, then go to step (3), if a picture block and this article block neighbour are only arranged, then the candidate of this literal piece as this picture block schemed, if two or more picture block and this article block neighbour are arranged, then filter out the best picture block in position, the candidate as this picture block schemes with this literal piece;
(3) repeat above step, { all the literal pieces among the S} are removed once up to the literal set of blocks;
(4) determine that { figure of each picture block says among the P} in the picture block set; If the candidate of a picture block schemes only there is one, then this candidate is schemed to say as the figure of this picture block; If the candidate of a picture block schemes then to filter out only candidate and scheme to say as the figure of this picture block for a plurality of.
Aforesaid a kind of anti-automatic correlation method of separating the picture that obtains behind the layout files and scheming to say, { method of searching among the P} with literal piece neighbour's picture block is: judge that whether in the horizontal direction or the in the vertical direction neighbour picture block and literal piece in picture block set in the step (2), if in the horizontal direction or in the vertical direction neighbour, then picture block and literal piece neighbour.
Aforesaid a kind of anti-automatic correlation method that the picture that obtains behind the layout files and figure say of separating, wherein, described judge picture block and literal piece whether in the horizontal direction or in the vertical direction neighbour's method may further comprise the steps:
Suppose that literal piece upper left corner point coordinate is (X
1, Y
1), lower right corner point coordinate is (X
1', Y
1'), picture block upper left corner point coordinate is (X
2, Y
2), lower right corner point coordinate is (X
2', Y
2'); Width W=the X of literal piece
1'-X
1, the width W of picture block '=X
2'-X
2Height H=the Y of literal piece
1'-Y
1, the height H of picture block '=Y
2'-Y
2The mean value of the font size of all literal pieces is AvgFontSize; Figure say and picture block between coverage DistThreshold=C
1* AvgFontSize, wherein C
1Be the spread ratio between literal piece and the picture block, 1<C
1<5; Following min function representation is got both smaller values, and the max function representation is got both higher values, and D is an extended distance, 0≤D≤10, and unit is a pound;
1. calculate the degree of overlapping of picture block and literal piece:
The calculating publicity of degree of overlapping OverlapX in the horizontal direction is
OverlapX=(min(X
1′,X
2′)-max(X
1,X
2))/(max(X
1′,X
2′)-min(X
1,X
2)),
The calculating publicity of the degree of overlapping OverlapY of in the vertical direction is
OverlapY=(min(Y
1′,Y
2′)-max(Y
1,Y
2))/(max(Y
1′,Y
2′)-min(Y
1,Y
2));
2. judge Y
1〉=Y
2-D and Y
1'≤Y
2'+D and X
1〉=X
2-D and X
1'≤X
2Whether '+D sets up; If set up, then continue whether to judge OverlapY greater than OverlapX, as if greater than, picture block and literal piece neighbour in the horizontal direction then, otherwise picture block and literal piece in the vertical direction neighbour; If be false, then calculate picture block and literal piece overlap distance DistX in the horizontal direction, DistX=max (X
1, X
2)-min (X
1', X
2');
3. judge Y
1〉=Y
2-D and Y
1'≤Y
2Whether '+D and W<W ' and DistX<DistThreshold set up, if set up, and picture block and literal piece neighbour in the horizontal direction then; Otherwise calculate the ultimate range DistXMax of picture block and literal piece, if X
1<X
2, DistXMax=X then
2-X
1, otherwise DistXMax=|X
2'-X
1' |;
4. judge Y
1〉=Y
2-D and Y
1'≤Y
2Whether set up '+D and W<W ' and DistXMax<W '/2, if set up, and picture block and literal piece neighbour in the horizontal direction then; Otherwise, calculate DistY, DistY=max (Y
1, Y
2)-min (Y
1', Y
2');
5. judge X
1〉=X
2-D and X
1'≤X
2Whether the composing type of '+D and H<H ' and literal piece is set up for from left to right horizontally-arranged or horizontally-arranged from right to left and DistY<DistThreshold, if set up, then picture block and literal piece in the vertical direction neighbour, otherwise the i.e. also in the vertical direction neighbour not in the horizontal direction not of picture block and literal piece.
Aforesaid a kind of anti-automatic correlation method of separating the picture that obtains behind the layout files and scheming to say, wherein, described C
1Value be 1.2, the value of described D is 3.
Aforesaid a kind of anti-automatic correlation method of separating the picture that obtains behind the layout files and scheming to say, the method that filters out the best picture block in position described in the step (2) may further comprise the steps:
Suppose that the literal piece that takes out in the step (1) is T, the picture block set that closes on T for TP}, the best picture block in position is PZ;
1. calculate respectively that { figure of all picture block says type PicType among the TP}, and T and { the distance D ist among the TP} between all picture block; The figure of described picture block says that type is meant the position of literal piece with respect to picture block, comprise the literal piece the upside of picture block, literal piece in the left side of picture block, the literal piece is at the right side of picture block and the literal piece downside in picture block;
2. from take out arbitrarily a picture block P the TP}, and with the picture block of taking out from { deleting the TP}; Other is PZ=P;
3. from take out arbitrarily a picture block PN the TP}, and with PN from { deleting the TP}; From PZ and PN, filter out position picture block preferably, if the PN position is better, PZ=PN in addition then;
From PZ and PN, filter out a position preferably the method for picture block be: the figure that supposes PZ says that type is PicTypeZ, and the figure of PN says that type is PicTypeN, and the distance between T and the PZ is DistZ, and the distance between T and the PN is DistN;
If satisfy one of following condition, then PN is better than PZ position:
Condition a.PicTypeN and DistN<DistZ identical with PicTypeZ,
Condition b.PicTypeN is that the literal piece is left side and the DistN<DistZ of literal piece in picture block at the right side and the PicTypeZ of picture block,
The priority of condition c.PicTypeN is higher than PicTypeZ and PicTypeN is that the literal piece is that the literal piece is not set up simultaneously in the left side of picture block at the right side and the PicTypeZ of picture block; Wherein, the literal piece is higher than left side and the right side of literal piece in picture block in the priority of the downside of picture block, and the literal piece is higher than the upside of literal piece in picture block in the left side of picture block or the priority on right side;
4. judge { whether TP} is empty, if be empty, then PZ is the best picture block in position; Otherwise, go to step 3..
Aforesaid a kind of anti-automatic correlation method of separating the picture that obtains behind the layout files and scheming to say, the step 1. figure of middle calculating chart sheet piece says that the method for the distance between type and literal piece and the picture block may further comprise the steps:
If literal piece and picture block be the neighbour in the horizontal direction:
A. calculate the horizontal ordinate CenterT=(X at literal piece center
1+ X
1')/2; Calculate the horizontal ordinate CenterPic=(X at picture block center
2+ X
2')/2;
B. judge whether CenterT<CenterPic sets up, if set up, then PicType is the left side of literal piece in picture block, the distance D ist=X of literal piece and picture block
2-Center; If be false, then PicType is the right side of literal piece in picture block, literal piece and picture block distance D ist=Center-X
2';
If literal piece and picture block in the vertical direction neighbour:
A. calculate the ordinate CenterT=(Y at literal piece center
1+ Y
1')/2; Calculate the ordinate CenterPic=(Y at picture block center
2+Y
2')/2;
B. judge whether CenterT<CenterPic sets up, if set up, then PicType is the upside of literal piece in picture block, the distance D ist=Y of literal piece and picture block
2-Center; If be false, then PicType is the downside of literal piece in picture block, literal piece and picture block distance D ist=Center-Y
2'.
Aforesaid a kind of anti-automatic correlation method that the picture that obtains behind the layout files and figure say of separating is schemed when being a plurality of as the candidate of a picture block in the step (4), filters out only literal piece and may further comprise the steps as the method that the figure of this picture block says:
The candidate who supposes a picture block schemes set and is { L};
1. with { figure says that the identical literal piece merging of type becomes a literal piece among the L}, and the degree of overlapping of the literal piece after the merging is the degree of overlapping sum of merged literal piece and picture block, and weight is the number of merged literal piece;
2. after merging { the literal piece of picking out the weighted value maximum the L} is said as the figure of picture block, if the literal piece of weighted value maximum is a plurality of, the then relatively a plurality of literal pieces of weighted value maximum and the degree of overlapping of picture block will be said with the literal piece of the picture block degree of overlapping maximum figure as picture block.
Method of the present invention, by to anti-calculating of separating position relation etc. between the literal piece that obtains behind the layout files and the picture block, the layout information structure that need not to understand layout files just can be automatically picture with set up between the figure of this picture says related, reduced and manually confirmed and operation associated workload, improved efficient.
Description of drawings
Fig. 1 is a method flow diagram of the present invention;
Fig. 2 is that embodiment Chinese block and picture block position concern synoptic diagram;
Fig. 3 is the process flow diagram that filters out the best picture block in position in the embodiment when the literal piece has two or more picture block and its neighbour.
Embodiment
Describe the present invention below in conjunction with embodiment and accompanying drawing.
Figure of the present invention says so and refers to one or more literal pieces that picture block is described.Figure says to have different types, comprises that figure says that upside in picture block, figure say left side in picture block, scheme to say on the right side of picture block and scheme to say downside in picture block.The type that figure says is to determine according to the relation of the position between literal piece and the picture block central point, promptly figure says that the upside in picture block is meant the upside that is positioned at the picture block central point, figure says in the left side of picture block and is meant the left side that is positioned at the picture block central point, figure says on the right side of picture block and is meant the right side that is positioned at the picture block central point, figure says that the downside in picture block is meant the downside that is positioned at the picture block central point, comprises that figure says the situation in picture block.As Fig. 2 Chinese block 1 and literal piece 2 upside in picture block, literal piece 3 and literal piece 4 are at the downside of picture block, and literal piece 5, literal piece 6 and literal piece 7 are in the left side of picture block, and literal piece 8 and literal piece 9 are on the right side of picture block.
Fig. 1 shows the flow process of the anti-automatic correlation method of separating the picture that obtains behind the layout files and scheming to say of the present invention, may further comprise the steps.
Suppose from counter separate layout files after, the literal agllutination that obtains is combined into that { S}, picture block set is { P}.
(1) { takes out the literal piece that the attribute different with getting the literal piece is text the S} from the literal set of blocks.
(2) the anti-picture block set that obtains after separating layout files search among the P} with step (1) in the literal piece neighbour's that takes out picture block.If neither one picture block and this article block neighbour then go to step (3).If a picture block and this article block neighbour are only arranged, then the candidate of this literal piece as this picture block schemed.If two or more picture block and this article block neighbour are arranged, then filter out the best picture block in position, the candidate as this picture block schemes with this literal piece.
{ method of searching among the P} with literal piece neighbour's picture block is: judge that whether in the horizontal direction or the in the vertical direction neighbour picture block and literal piece in picture block set, if in the horizontal direction or in the vertical direction neighbour, then picture block and literal piece neighbour.In Fig. 2, literal piece 1, literal piece 2, literal piece 3 and literal piece 4 and picture block in the vertical direction neighbour, literal piece 5, literal piece 6, literal piece 7, literal piece 8 and literal piece 9 be the neighbour in the horizontal direction, and the method for judgement is as described below.
Suppose that literal piece upper left corner point coordinate is (X
1, Y
1), lower right corner point coordinate is (X
1', Y
1'), picture block upper left corner point coordinate is (X
2, Y
2), lower right corner point coordinate is (X
2', Y
2').Width W=the X of literal piece
1'-X
1, the width W of picture block '=X
2'-X
2Height H=the Y of literal piece
1'-Y
1, the height H of picture block '=Y
2'-Y
2The mean value of the font size of all literal pieces is AvgFontSize.Figure say and picture block between coverage DistThreshold=C
1* AvgFontSize, wherein C
1Be the spread ratio between literal piece and the picture block, 1<C
1<5, C in the present embodiment
1=1.2.Following min function representation is got both smaller values, and the max function representation is got both higher values.Following D is an extended distance, and promptly the literal piece exceeds picture block width or the distance that allowed of height, 0≤D≤10, and unit be pound.In the present embodiment, the D value is 3, and the value of D can be adjusted in span.
Judge picture block and literal piece whether in the horizontal direction or in the vertical direction neighbour's method may further comprise the steps.
1. calculate the degree of overlapping of picture block and literal piece:
The calculating publicity of degree of overlapping OverlapX in the horizontal direction is
OverlapX=(min(X
1′,X
2′)-max(X
1,X
2))/(max(X
1′,X
2′)-min(X
1,X
2));
The calculating publicity of the degree of overlapping OverlapY of in the vertical direction is
OverlapY=(min(Y
1′,Y
2′)-max(Y
1,Y
2))/(max(Y
1′,Y
2′)-min(Y
1,Y
2))。
2. judge Y
1〉=Y
2-D and Y
1'≤Y
2'+D and X
1〉=X
2-D and X
1'≤X
2Whether '+D sets up; If set up, then continue whether to judge OverlapY greater than OverlapX, as if greater than, picture block and literal piece neighbour in the horizontal direction then, otherwise picture block and literal piece in the vertical direction neighbour; If be false, then calculate picture block and literal piece overlap distance DistX in the horizontal direction, DistX=max (X
1, X
2)-min (X
1', X
2').
3. judge Y
1〉=Y
2-D and Y
1'≤Y
2Whether '+D and W<W ' and DistX<DistThreshold set up, if set up, and picture block and literal piece neighbour in the horizontal direction then; Otherwise calculate the ultimate range DistXMax of picture block and literal piece, if X
1<X
2, DistXMax=X then
2-X
1, otherwise DistXMax=|X
2'-X
1' |;
4. judge Y
1〉=Y
2-D and Y
1'≤Y
2Whether set up '+D and W<W ' and DistXMax<W '/2, if set up, and picture block and literal piece neighbour in the horizontal direction then; Otherwise, calculate DistY, DistY=max (Y
1, Y
2)-min (Y
1', Y
2');
5. judge X
1〉=X
2-D and X
1'≤X
2Whether the composing type of '+D and H<H ' and literal piece is set up for from left to right horizontally-arranged or horizontally-arranged from right to left and DistY<DistThreshold, if set up, then picture block and literal piece in the vertical direction neighbour, otherwise the i.e. also in the vertical direction neighbour not in the horizontal direction not of picture block and literal piece.
Fig. 3 has shown when the literal piece has two or more picture block and this article block neighbour, filters out the flow process of the best picture block in position, may further comprise the steps.Suppose that the literal piece that takes out in the step (1) is T, the picture block set that closes on T is that { TP}, the best picture block in position is PZ.
1. calculate respectively that { figure of all picture block says type PicType among the TP}, and T and { the distance D ist among the TP} between all picture block.
If literal piece and picture block be the neighbour in the horizontal direction:
A. calculate the horizontal ordinate CenterT=(X at literal piece center
1+ X
1')/2; Calculate the horizontal ordinate CenterPic=(X at picture block center
2+ X
2')/2;
B. judge whether CenterT<CenterPic sets up, if set up, then PicType is the left side of literal piece in picture block, the distance D ist=X of literal piece and picture block
2-Center; If be false, then PicType is the right side of literal piece in picture block, literal piece and picture block distance D ist=Center-X
2'.
If literal piece and picture block in the vertical direction neighbour:
A. calculate the ordinate CenterT=(Y at literal piece center
1+ Y
1')/2, calculate the ordinate CenterPic=(Y at picture block center
2+ Y
2')/2;
B. judge whether CenterT<CenterPic sets up; If set up, then PicType is the upside of literal piece in picture block, the distance D ist=Y of literal piece and picture block
2-Center; If be false, then PicType is the downside of literal piece in picture block, literal piece and picture block distance D ist=Center-Y
2'.
2. from take out arbitrarily a picture block P the TP}, and with the picture block of taking out from { deleting the TP}; Other is PZ=P.
3. from take out arbitrarily a picture block PN the TP}, and with PN from { deleting the TP}; From PZ and PN, filter out position picture block preferably, if the PN position is better, PZ=PN in addition then.
From PZ and PN, filter out a position preferably the method for picture block be: the figure that supposes PZ says that type is PicTypeZ, and the figure of PN says that type is PicTypeN, and the distance between T and the PZ is DistZ, and the distance between T and the PN is DistN;
If satisfy one of following condition, then PN is better than PZ position:
Condition a.PicTypeN and DistN<DistZ identical with PicTypeZ;
Condition b.PicTypeN is that the literal piece is left side and the DistN<DistZ of literal piece in picture block at the right side and the PicTypeZ of picture block;
The priority of condition c.PicTypeN is higher than PicTypeZ and PicTypeN is that the literal piece is that the literal piece is not set up simultaneously in the left side of picture block at the right side and the PicTypeZ of picture block; Wherein, the literal piece is higher than left side and the right side of literal piece in picture block in the priority of the downside of picture block, and the literal piece is higher than the upside of literal piece in picture block in the left side of picture block or the priority on right side.
4. judge { whether TP} is empty, if be empty, then PZ is the best picture block in position; Otherwise, go to step 3..
(3) repeating step (1) and step (2), { all the literal pieces among the S} are removed once up to the literal set of blocks.
(4) determine that { figure of each picture block says among the P} in the picture block set.If the candidate of a picture block schemes only there is one, then this candidate is schemed to say as the figure of this picture block.If the candidate of a picture block schemes then to filter out only candidate and scheme to say as the figure of this picture block, and other candidates of this picture block are schemed to add to again the literal set of blocks { among the S} for a plurality of.
When the candidate of picture block schemes when a plurality of, filter out the method that only candidate schemes and may further comprise the steps, it is { L} that the candidate who suppose a picture block schemes to gather.
1. with { figure says that the identical literal piece merging of type becomes a literal piece among the L}, and the literal piece after the merging and the degree of overlapping of picture block are the degree of overlapping sum of merged literal piece and picture block, and weight is the number of merged literal piece.In Fig. 2, literal piece 1 and literal piece 2 are merged into a literal piece, merge hereinafter that the degree of overlapping of block and picture block is literal piece 1 and the degree of overlapping of picture block and the degree of overlapping sum of literal piece 2 and picture block, merging hereinafter, the weight of block is 2; Literal piece 3 and literal piece 4 are merged into a literal piece, and degree of overlapping is literal piece 3, literal piece 4 and picture block degree of overlapping sum, and weight is 2; Literal piece 5, literal piece 6 and literal piece 7 are merged into a literal piece, and degree of overlapping is literal piece 5, literal piece 6, literal piece 7 and picture block degree of overlapping sum, and weight is 3; Literal piece 8 and literal piece 9 are merged into a literal piece, and degree of overlapping is literal piece 8, literal piece 9 and picture block degree of overlapping sum, and weight is 2.
2. after merging { the literal piece of picking out the weighted value maximum the L} is said as the figure of picture block, if the literal piece of weighted value maximum is a plurality of, the then relatively a plurality of literal pieces of weighted value maximum and the degree of overlapping of picture block will be said with the literal piece of the picture block degree of overlapping maximum figure as picture block.In Fig. 2,, its figure as picture block is said by the literal piece weighted value maximum that literal piece 5, literal piece 6 and literal piece 7 merge.
Method of the present invention is not limited to above-mentioned embodiment, and those skilled in the art's technical scheme according to the present invention draws other embodiment, belongs to technological innovation scope of the present invention equally.