CN101876967B

CN101876967B - Method for generating PDF text paragraphs

Info

Publication number: CN101876967B
Application number: CN2010101363998A
Authority: CN
Inventors: 晏检平
Original assignee: Shenzhen Wondershare Software Co Ltd
Current assignee: Wondershare Technology Co ltd
Priority date: 2010-03-25
Filing date: 2010-03-25
Publication date: 2012-05-02
Anticipated expiration: 2030-03-25
Also published as: CN101876967A

Abstract

The invention relates to a method for generating PDF text paragraphs, comprising the following steps: A. identifying and extracting text blocks of a PDF text; B. removing the repeated text blocks in different layers, and determining text lines which form a text line set; C. horizontally dividing the text line set to obtain one or more first texts; and then vertically dividing each first text in a first text set to respectively obtain one or more second texts, and extracting a blank area among one or more second texts to form a blank area set; D. merging two adjacent first texts in the first text set to obtain a text type-setting line; and E. dividing the merged text type-setting line so as to form a text type-setting column and the text paragraphs. By implementing the technical proposal of the invention, the text structure processed by the method easily generates an RTF format, thus achieving good effect and high editable degree; and in addition, the method adopts automatic typesetting, thus manual intervention is not required.

Description

The method that a kind of PDF text fragment generates

Technical field

The present invention relates to infotech, more particularly, relate to the method that a kind of PDF text fragment generates.

Background technology

(Portable Document Format PDF), was used for the file layout that exchange files develops by Adobe Systems to portable file layout in 1993.That its advantage is is cross-platform, can the original form of document retaining (Layout), open standard.In the PDF file, write down the exact position of text element, and had no relation between the text, this form is difficult for editor.

The PDF text formatting becomes the desirable file layout of carrying out electronic document distribution and formatted message propagation on the internet with its remarkable characteristic.Current, the technical paper major part of issue is submitted to PDF in the internet.But PDF focuses on the print format of describing document, does not describe the data structure in the original document, and is difficult for editor.Quote the content in the third-party PDF text if desired, current at present way is manually to copy out literal, and then puts into other Word and manually set type, edit, and this operation is wasted time and energy.

At present; General ability through software for composing itself derives the XML file, and this XML file all comprises the content information of PDF article, and the content of different software for composing output maybe be different; But the positional information of literal piece does not derive in most softwares for composing; Make that the information of PDF article is imperfect, often need replenish that efficient is very low through the mode of craft.Because most softwares for composing can generate the file of PDF, a large amount of historical datas is based on PDF, so very wide based on the parsing application surface of PDF.For example; Publication number be CN1687926A public announcement of a patent application a kind of " based on the extraction system and the method for the PDF text message of XML "; Mainly be to convert the physical arrangement of PDF text into logical organization, do not become paragraph and written processing but carry out text; Again for example; Publication number is that the patented claim of CN1776673A discloses a kind of " the PDF text is to the method for XML document conversion ", the XML document that PDF transfers sane level to, mentions the information among the XML through the XSLT binding rule through third-party instrument again; The prerequisite of its application is that the PDF page itself is fairly simple; Structure is more consistent, uses the rule of simple XPATH just can extract XML information, and the multicolumn page of inapplicable complicacy.Again for example; Publication number is that the patented claim of CN160403A discloses " a kind of method of the newspaper page being carried out the words reading sequence recovery "; Be that the PDF text is carried out the written processing of text, but do not relate to generation and the rule of merging and the flow process of information such as whole extraction content and position of literal piece.Again for example; Publication number is that the patented claim of CN101206639 discloses " a kind of method of PDF indexing data ", is the extraction that the PDF of complex page is provided literal piece and article, but has merged the font and the font size of literal piece in its process; Cause changing back effect distortion; And after merging the literal piece, need to adopt the relation between manual intervention ability identification block and the piece, the result that the rule of the piece of the logical relation between the identification block, and one-tenth automatically generates can't write rich text format.

Summary of the invention

The technical matters that the present invention will solve is, above-mentioned when quoting third-party PDF text and carry out editing and composing to prior art, and the defective that wastes time and energy, the method that provides a kind of PDF text fragment to generate can be accomplished editor time saving and energy savingly and sets type.

The technical solution adopted for the present invention to solve the technical problems is: construct the method that a kind of PDF text fragment generates, comprising:

A. discern and extract the literal piece of PDF text;

B. reject the literal piece that repeats in the different layers, and definite line of text, determined line of text is formed the line of text set;

C. horizontal direction is carried out in said line of text set and divide, obtain one or more first texts, said one or more first texts are formed first text collection; Then each first text in first text collection being carried out vertical direction respectively divides; Obtain one or more second texts respectively; Said one or more second text is formed second text collection, extracts in second text collection white space between one or more second texts to form the white space set;

D. merge in first text collection two adjacent first texts, to obtain the text row of setting type;

E. the text after division merges is set type and is gone, to form text composing row and text fragment; Wherein,

In the steps A, the literal piece of identification PDF text comprises:

A1. judge whether the character in the PDF text is English character, if, execution in step A3 then; If not, execution in step A2 then;

A2. said character is a literal piece;

Whether the spacing of A3. judging two adjacent characters less than the product of the font size and first spread ratio, and judges whether font, font size, the color of said two adjacent characters be identical, if then said two adjacent characters belong to same literal piece; If not, then said two adjacent characters do not belong to same literal piece;

Said step B comprises:

B1. obtain two literal pieces of same index value or adjacent index value; If the literal piece content of said two literal pieces is identical and the spacing of said two literal pieces less than the product of the font size and second spread ratio; Then delete one of them literal piece, and remaining literal piece is put in the literal set of blocks;

B2. set up an ineffective law, rule, etc. one's own profession, the literal piece in the literal set of blocks is put in this ineffective law, rule, etc. one's own profession by array indexing value size successively, to generate the line of text set; And in the one text row; Two literal pieces of same index value or adjacent index value satisfy: the datum line distance of two literal pieces is less than font size and the product of first datum line differential apart from coefficient, and the level interval that reaches two literal pieces is less than font size and second datum line differential product apart from coefficient;

Said step C comprises:

C1. upper boundary values from small to large the order of the line of text in the said line of text set by line of text is arranged in order;

C2. two adjacent line of text relatively one by one if two adjacent line of text intersect in the projection of Y direction, then are put in same first text;

C3. respectively with left side dividing value from small to large the series arrangement of the line of text in each first text by line of text;

C4. two adjacent line of text relatively one by one if two adjacent line of text intersect in the projection of X-direction, then are put in same second text;

C5. extract the white space between one or more second texts in second text collection, to form the white space set;

Said step D comprises:

Merge for the first time in first text collection two adjacent first texts, the condition that merges for the first time is that the two first adjacent texts exist the same number of white space and pairing white space to intersect in the projection of X-direction;

After merging for the first time, carry out merging the second time, to obtain text composing row, the merging condition that merge the said second time is that the white space of two adjacent first texts intersects in the projection of X-direction;

Said step e comprises:

The line of text of E1. said text being set type in going is pressed line of text left side dividing value series arrangement from small to large;

E2. a newly-built text is set type and is listed as, and from said text is set type row, takes out line of text in turn successively;

E3. judge whether the line of text taken out and the newly-built text row of setting type intersect in the projection of X-direction, if then change step e 4; If not, then change step e 2;

E4. the line of text of being taken out is put into said newly-built text composing row in proper order;

The line of text of E5. said text being set type in being listed as is arranged by upper boundary values from small to large;

E6. a newly-built text fragment takes out line of text in turn successively from text is set type row;

E7. judge whether two adjacent line of text satisfy preset paragraph condition, if then change step e 8; If not, then change step e 6;

E8. said two adjacent line of text are put into the one text paragraph in proper order.

In the method that PDF text fragment of the present invention generates, in steps A, set up the literal set of blocks that the anglec of rotation is respectively 0,90,180,270 degree four directions, and set up the literal piece that array indexing extracts the PDF text with incremental change.

In the method that PDF text fragment of the present invention generates, the literal piece that is extracted comprises the datum line of literal piece, peripheral rectangle, font, font size, color and angle.

In the method that PDF text fragment of the present invention generates, said preset paragraph condition comprises following condition:

(a) difference in height between the line of text is less than the product of font size and height coefficient, and;

(b) vertical interval between the line of text is less than the product of font average height and paragraph coefficient, and;

(c) difference of the width between the line of text is less than the product of font size and spread factor, or,

If the left side dividing value of two line of text is identical, then the width of the line of text in front is greater than the width of the line of text in back; Or,

If the right dividing value of two line of text is identical, then the width of the line of text in front is less than the width of the line of text in back.

The method that the PDF text fragment of embodiment of the present invention generates has following beneficial effect: the text structure of handling through this method is prone to generate RTF (Rich Text Format) form, and is effective, and can edit Du Gao; In addition, this method is an automatic typesetting, need not manual intervention, and is time saving and energy saving during operation.

Description of drawings

To combine accompanying drawing and embodiment that the present invention is described further below, in the accompanying drawing:

Fig. 1 is the process flow diagram of the method embodiment one of PDF text fragment generation of the present invention;

Fig. 2 is the process flow diagram of literal piece embodiment one of the identification PDF text of step S100 among Fig. 1;

Fig. 3 is the process flow diagram of step S200 embodiment one among Fig. 1;

Fig. 4 is the process flow diagram of step S300 embodiment one among Fig. 1;

Fig. 5 is the process flow diagram of step S500 embodiment one among Fig. 1.

Embodiment

The present invention is directed in the prior art when the content of quoting third-party PDF text is edited, the defective that wastes time and energy, the method that provides a kind of PDF text fragment to generate, when using this method to edit, time saving and energy saving.

Before specifying this method, at first introduce the technical term that several needs use:

The literal piece: in english-speaking environment, literal piece is an English word normally; And in non-english-speaking environment, literal piece is a word normally.Its Chinese words piece is divided into 4 kinds of directions: from left to right, from top to bottom, from right to left, from top to bottom.

Line of text: be in one or more literal pieces of delegation and form line of text.

Text fragment: one or more adjacent line of text are formed text fragment, generally are to separate with null between the paragraph.

Text is set type and is listed as: one or more text fragments are from top to bottom formed texts composing row.

Text is set type and gone: one or more texts composing row composition texts are set type and are gone.

The page: one or more texts are set type to go and are combined into the page.

For common article, normally a text is set type and is gone, and wherein comprises text composing row, wherein comprises a plurality of text fragments.

For the article on two hurdles, normally a text is set type and is gone, and wherein comprises two texts composing row, wherein comprises a plurality of text fragments.

For the page of a relative complex, normally to set type to go and form by a plurality of texts, wherein some texts are set type and are gone, and comprise one or more texts composing row, and its Chinese version composing row comprise one or more text fragments again.

As shown in Figure 1, the process flow diagram of the method embodiment one that generates at PDF text fragment of the present invention, this method may further comprise the steps:

The literal piece of step S100. identification and extraction PDF text, particularly, in this step; Can set up the literal set of blocks that the anglec of rotation is respectively 0,90,180,270 degree four directions earlier, the PDF text of the respectively corresponding four kinds of forms of the literal set of blocks of this four direction, for example; The anglec of rotation is the PDF text of 0 degree, also is modal PDF text, and text form is from top to bottom; Typesetting format from left to right, and the direction of its literal piece is from left to right, the angle that also is font is 0 degree.The text formatting of other anglec of rotation can be drawn analogous conclusions, and does not do at this and gives unnecessary details; Then; In the literal set of blocks of four direction, set up the literal piece that array indexing extracts the PDF text with incremental change respectively, the literal piece that is extracted comprises the datum line of literal piece, peripheral rectangle, font, font size, color and angle; Wherein, Datum line is used to locate the literal piece, and peripheral rectangle frame is used for confirming the coordinate figure of literal piece at X axle and Y axle, and this array indexing is relevant with the datum line that the literal piece is expert at;

Step S200. rejects the literal piece that repeats in the different layers, and definite line of text, and determined line of text is formed the line of text set.Because the PDF text is in conversion process, the same page can show in layering, and institute's content displayed is identical, so be necessary the literal block delete that repeats in the different layers;

Step S300. carries out horizontal direction with said line of text set and divides, and obtains one or more first texts, and said one or more first texts are formed first text collection; Then each first text in first text collection being carried out vertical direction respectively divides; Obtain one or more second texts respectively; Said one or more second text is formed second text collection, extracts in second text collection white space between one or more second texts to form the white space set;

Step S400. merges in first text collection two adjacent first texts, to obtain the text row of setting type;

Step S500. divides the text composing row after the merging, and setting type with the formation text is listed as and text fragment.

Preferably, as shown in Figure 2, the literal piece of the identification PDF text in step 100 can may further comprise the steps,

Step S110. judges whether the character in the PDF text is English character, if then change step S130; If not, then change step S120;

The said character of step S120. is a literal piece;

Whether the spacing that step S130. judges two adjacent characters less than the product of the font size and first spread ratio, and judges whether font, font size, the color of said two adjacent characters be identical, if then change step S140; If not, then change step S150; In this step, establish two adjacent characters and be respectively first character and second character, and the coordinate of the upper left point of first character and lower-right most point be respectively (x1, y1) with (x1 '; Y1 '), the upper left point of second character and the coordinate of lower-right most point are respectively that (x2 is y2) with (x2 ', y2 '); Then the spacing of this two adjacent character can be expressed as: fabs (max (x1, x2)-min (x1 ', x2 ')), wherein; Max (x1, x2) expression x1, the maximal value of x2, min (x1 '; X2 ') expression x1 ', the minimum value of x2 ', fabs () expression takes absolute value;

Said two adjacent characters of step S140. belong to same literal piece;

Said two adjacent characters of step S150. do not belong to same literal piece.

Preferably, as shown in Figure 3, step S200 can specifically may further comprise the steps:

Step S210. obtains two literal pieces of same index value or adjacent index value; If the literal piece content of said two literal pieces is identical and the spacing of said two literal pieces less than the product of the font size and second spread ratio; Then delete one of them literal piece, and remaining literal piece is put in the literal set of blocks.In this step, if two literal pieces are respectively the first literal piece and the second literal piece, and the upper left point of the first literal piece and the coordinate of lower-right most point are respectively (X1; Y1) and (X1 '; Y1 '), the upper left point of the second literal piece and the coordinate of lower-right most point be respectively (X2, Y2) with (X2 '; Y2 '), then the spacing of two literal pieces can specifically be expressed as less than the product of the font size and second spread ratio:

Fabs (X1-X2)＜(font size * the 21 spread ratio), and

Fabs (X1 '-X2 ')＜(font size * the two or two spread ratio), and

Fabs (Y1-Y2)＜(font size * the two or three spread ratio), and

Fabs (Y1 '-Y2 ')＜(font size * the two or four spread ratio);

In the above in the expression formula, such as but not limited to, the 21 spread ratio, the two or two spread ratio, the two or three spread ratio and the two or four spread ratio can all get 0.2;

Step S220. sets up an ineffective law, rule, etc. one's own profession, the literal piece in the literal set of blocks is put in this ineffective law, rule, etc. one's own profession by array indexing value size successively, to generate the line of text set; And in the one text row; Two literal pieces of same index value or adjacent index value satisfy: the datum line distance of two literal pieces is less than font size and first datum line differential product apart from coefficient; And the level interval fabs (max (X1 of two literal pieces; X2)-and min (X1 ', X2 ')) less than font size and second datum line differential product apart from coefficient.Such as but not limited to, desirable 0.5, the second datum line differential of the first datum line coefficient is apart from coefficient 0.6.

Preferably, as shown in Figure 4, step S300 can specifically may further comprise the steps:

Step S310. is arranged in order upper boundary values from small to large the order of the line of text in the said line of text set by line of text;

Step S320. is two adjacent line of text relatively one by one, if two adjacent line of text intersect in the projection of Y direction, then are put in same first text;

Step S330. is respectively with left side dividing value from small to large the series arrangement of the line of text in each first text by line of text;

Step S340. is two adjacent line of text relatively one by one, if two adjacent line of text intersect in the projection of X-direction, then are put in same second text;

Step S350. extracts the white space between one or more second texts in second text collection, gathers to form white space, such as but not limited to; If the anglec of rotation of this PDF text is 0 degree, its type-setting mode is from top to bottom, from left to right; If in second text collection, the upper left point coordinate of two second texts is respectively that (XX1 is YY1) with (XX2; YY2), the lower-right most point coordinate of two second texts be respectively (XX1 ', YY1 ') and (XX2 '; White space in the white space set of YY2 '), then being extracted is respectively:

First white space (0, XX1);

Second white space (XX1 ', XX2);

Last white space (XX2 ', pagewidth).

Preferably; Step S400 can comprise; Merge for the first time two adjacent first texts in first text collection, the condition that merges for the first time is that the two first adjacent texts exist the same number of white space and pairing white space to intersect in the projection of X-direction.

Preferably, step S400 also can comprise: after merging for the first time, carry out merging the second time, to obtain text composing row, the merging condition that merge the said second time is that the white space of two adjacent first texts intersects in the projection of X-direction.

Preferably, as shown in Figure 5, step S500 can specifically may further comprise the steps:

Step S510. presses line of text left side dividing value series arrangement from small to large with the line of text that said text is set type in going;

The newly-built text of step S520. is set type and is listed as, and from said text is set type row, takes out line of text in turn successively;

Whether line of text that step S530. judgement is taken out and newly-built text set type to be listed as intersects in the projection of X-direction, if then change step S540; If not, then change step S520;

Step S540. puts into said newly-built text composing row in proper order with the line of text of being taken out;

Step S550. arranges the line of text that said text is set type in being listed as by upper boundary values from small to large;

The newly-built text fragment of step S560. takes out line of text in turn successively from text is set type row;

Step S570. judges whether two adjacent line of text satisfy preset paragraph condition, if then change step S580; If not, then change step S560;

Step S580. puts into the one text paragraph in proper order with said two adjacent line of text.

Preferably, preset paragraph condition can comprise following condition:

(a) the difference in height fabs between the line of text (YY2-YY1 ') is less than the product of font size and height coefficient, and;

(b) the vertical interval fabs between the line of text (YY2 '-YY1 ') is less than the product of font average height and paragraph coefficient, and;

(c) the difference fabs of the width between the line of text ((XX1 '-XX1)-(XX2 '-XX2)) less than the product of font size and spread factor, or,

If the left side dividing value of two line of text is identical, then the width of the line of text in front (XX1 '-XX1) greater than the back line of text width (XX2 '-XX2); Or,

If the right dividing value of two line of text is identical, then the width of the line of text in front (XX1 '-XX1) less than the back line of text width (XX2 '-XX2).

In above expression formula, for example but be not defined as: height coefficient is 0.2, and the paragraph coefficient is 1.0, and spread factor is 4.

The above is merely the preferred embodiments of the present invention, is not limited to the present invention, and for a person skilled in the art, the present invention can have various changes and variation.All within spirit of the present invention and principle, any modification of being done, be equal to replacement, improvement etc., all should be included within the claim scope of the present invention.

Claims

1. the method that the PDF text fragment generates is characterized in that, comprising:

A. discern and extract the literal piece of PDF text;

In the steps A, the literal piece of identification PDF text comprises:

A2. said character is a literal piece;

Said step B comprises:

Said step C comprises:

Said step D comprises:

Said step e comprises:

Whether line of text that E 3. judgements are taken out and newly-built text set type to be listed as intersects in the projection of X-direction, if then change step e 4; If not, then change step e 2;

2. the method that PDF text fragment according to claim 1 generates; It is characterized in that; In steps A, set up the literal set of blocks that the anglec of rotation is respectively 0,90,180,270 degree four directions, and set up the literal piece that array indexing extracts the PDF text with incremental change.

3. the method that PDF text fragment according to claim 2 generates is characterized in that the literal piece that is extracted comprises the datum line of literal piece, peripheral rectangle, font, font size, color and angle.

4. the method that PDF text fragment according to claim 1 generates is characterized in that said preset paragraph condition comprises following condition: