CN101876967B - Method for generating PDF text paragraphs - Google Patents

Method for generating PDF text paragraphs Download PDF

Info

Publication number
CN101876967B
CN101876967B CN2010101363998A CN201010136399A CN101876967B CN 101876967 B CN101876967 B CN 101876967B CN 2010101363998 A CN2010101363998 A CN 2010101363998A CN 201010136399 A CN201010136399 A CN 201010136399A CN 101876967 B CN101876967 B CN 101876967B
Authority
CN
China
Prior art keywords
text
line
literal
adjacent
texts
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN2010101363998A
Other languages
Chinese (zh)
Other versions
CN101876967A (en
Inventor
晏检平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wondershare Technology Co ltd
Original Assignee
Shenzhen Wondershare Software Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Wondershare Software Co Ltd filed Critical Shenzhen Wondershare Software Co Ltd
Priority to CN2010101363998A priority Critical patent/CN101876967B/en
Publication of CN101876967A publication Critical patent/CN101876967A/en
Application granted granted Critical
Publication of CN101876967B publication Critical patent/CN101876967B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention relates to a method for generating PDF text paragraphs, comprising the following steps: A. identifying and extracting text blocks of a PDF text; B. removing the repeated text blocks in different layers, and determining text lines which form a text line set; C. horizontally dividing the text line set to obtain one or more first texts; and then vertically dividing each first text in a first text set to respectively obtain one or more second texts, and extracting a blank area among one or more second texts to form a blank area set; D. merging two adjacent first texts in the first text set to obtain a text type-setting line; and E. dividing the merged text type-setting line so as to form a text type-setting column and the text paragraphs. By implementing the technical proposal of the invention, the text structure processed by the method easily generates an RTF format, thus achieving good effect and high editable degree; and in addition, the method adopts automatic typesetting, thus manual intervention is not required.

Description

The method that a kind of PDF text fragment generates
Technical field
The present invention relates to infotech, more particularly, relate to the method that a kind of PDF text fragment generates.
Background technology
(Portable Document Format PDF), was used for the file layout that exchange files develops by Adobe Systems to portable file layout in 1993.That its advantage is is cross-platform, can the original form of document retaining (Layout), open standard.In the PDF file, write down the exact position of text element, and had no relation between the text, this form is difficult for editor.
The PDF text formatting becomes the desirable file layout of carrying out electronic document distribution and formatted message propagation on the internet with its remarkable characteristic.Current, the technical paper major part of issue is submitted to PDF in the internet.But PDF focuses on the print format of describing document, does not describe the data structure in the original document, and is difficult for editor.Quote the content in the third-party PDF text if desired, current at present way is manually to copy out literal, and then puts into other Word and manually set type, edit, and this operation is wasted time and energy.
At present; General ability through software for composing itself derives the XML file, and this XML file all comprises the content information of PDF article, and the content of different software for composing output maybe be different; But the positional information of literal piece does not derive in most softwares for composing; Make that the information of PDF article is imperfect, often need replenish that efficient is very low through the mode of craft.Because most softwares for composing can generate the file of PDF, a large amount of historical datas is based on PDF, so very wide based on the parsing application surface of PDF.For example; Publication number be CN1687926A public announcement of a patent application a kind of " based on the extraction system and the method for the PDF text message of XML "; Mainly be to convert the physical arrangement of PDF text into logical organization, do not become paragraph and written processing but carry out text; Again for example; Publication number is that the patented claim of CN1776673A discloses a kind of " the PDF text is to the method for XML document conversion ", the XML document that PDF transfers sane level to, mentions the information among the XML through the XSLT binding rule through third-party instrument again; The prerequisite of its application is that the PDF page itself is fairly simple; Structure is more consistent, uses the rule of simple XPATH just can extract XML information, and the multicolumn page of inapplicable complicacy.Again for example; Publication number is that the patented claim of CN160403A discloses " a kind of method of the newspaper page being carried out the words reading sequence recovery "; Be that the PDF text is carried out the written processing of text, but do not relate to generation and the rule of merging and the flow process of information such as whole extraction content and position of literal piece.Again for example; Publication number is that the patented claim of CN101206639 discloses " a kind of method of PDF indexing data ", is the extraction that the PDF of complex page is provided literal piece and article, but has merged the font and the font size of literal piece in its process; Cause changing back effect distortion; And after merging the literal piece, need to adopt the relation between manual intervention ability identification block and the piece, the result that the rule of the piece of the logical relation between the identification block, and one-tenth automatically generates can't write rich text format.
Summary of the invention
The technical matters that the present invention will solve is, above-mentioned when quoting third-party PDF text and carry out editing and composing to prior art, and the defective that wastes time and energy, the method that provides a kind of PDF text fragment to generate can be accomplished editor time saving and energy savingly and sets type.
The technical solution adopted for the present invention to solve the technical problems is: construct the method that a kind of PDF text fragment generates, comprising:
A. discern and extract the literal piece of PDF text;
B. reject the literal piece that repeats in the different layers, and definite line of text, determined line of text is formed the line of text set;
C. horizontal direction is carried out in said line of text set and divide, obtain one or more first texts, said one or more first texts are formed first text collection; Then each first text in first text collection being carried out vertical direction respectively divides; Obtain one or more second texts respectively; Said one or more second text is formed second text collection, extracts in second text collection white space between one or more second texts to form the white space set;
D. merge in first text collection two adjacent first texts, to obtain the text row of setting type;
E. the text after division merges is set type and is gone, to form text composing row and text fragment; Wherein,
In the steps A, the literal piece of identification PDF text comprises:
A1. judge whether the character in the PDF text is English character, if, execution in step A3 then; If not, execution in step A2 then;
A2. said character is a literal piece;
Whether the spacing of A3. judging two adjacent characters less than the product of the font size and first spread ratio, and judges whether font, font size, the color of said two adjacent characters be identical, if then said two adjacent characters belong to same literal piece; If not, then said two adjacent characters do not belong to same literal piece;
Said step B comprises:
B1. obtain two literal pieces of same index value or adjacent index value; If the literal piece content of said two literal pieces is identical and the spacing of said two literal pieces less than the product of the font size and second spread ratio; Then delete one of them literal piece, and remaining literal piece is put in the literal set of blocks;
B2. set up an ineffective law, rule, etc. one's own profession, the literal piece in the literal set of blocks is put in this ineffective law, rule, etc. one's own profession by array indexing value size successively, to generate the line of text set; And in the one text row; Two literal pieces of same index value or adjacent index value satisfy: the datum line distance of two literal pieces is less than font size and the product of first datum line differential apart from coefficient, and the level interval that reaches two literal pieces is less than font size and second datum line differential product apart from coefficient;
Said step C comprises:
C1. upper boundary values from small to large the order of the line of text in the said line of text set by line of text is arranged in order;
C2. two adjacent line of text relatively one by one if two adjacent line of text intersect in the projection of Y direction, then are put in same first text;
C3. respectively with left side dividing value from small to large the series arrangement of the line of text in each first text by line of text;
C4. two adjacent line of text relatively one by one if two adjacent line of text intersect in the projection of X-direction, then are put in same second text;
C5. extract the white space between one or more second texts in second text collection, to form the white space set;
Said step D comprises:
Merge for the first time in first text collection two adjacent first texts, the condition that merges for the first time is that the two first adjacent texts exist the same number of white space and pairing white space to intersect in the projection of X-direction;
After merging for the first time, carry out merging the second time, to obtain text composing row, the merging condition that merge the said second time is that the white space of two adjacent first texts intersects in the projection of X-direction;
Said step e comprises:
The line of text of E1. said text being set type in going is pressed line of text left side dividing value series arrangement from small to large;
E2. a newly-built text is set type and is listed as, and from said text is set type row, takes out line of text in turn successively;
E3. judge whether the line of text taken out and the newly-built text row of setting type intersect in the projection of X-direction, if then change step e 4; If not, then change step e 2;
E4. the line of text of being taken out is put into said newly-built text composing row in proper order;
The line of text of E5. said text being set type in being listed as is arranged by upper boundary values from small to large;
E6. a newly-built text fragment takes out line of text in turn successively from text is set type row;
E7. judge whether two adjacent line of text satisfy preset paragraph condition, if then change step e 8; If not, then change step e 6;
E8. said two adjacent line of text are put into the one text paragraph in proper order.
In the method that PDF text fragment of the present invention generates, in steps A, set up the literal set of blocks that the anglec of rotation is respectively 0,90,180,270 degree four directions, and set up the literal piece that array indexing extracts the PDF text with incremental change.
In the method that PDF text fragment of the present invention generates, the literal piece that is extracted comprises the datum line of literal piece, peripheral rectangle, font, font size, color and angle.
In the method that PDF text fragment of the present invention generates, said preset paragraph condition comprises following condition:
(a) difference in height between the line of text is less than the product of font size and height coefficient, and;
(b) vertical interval between the line of text is less than the product of font average height and paragraph coefficient, and;
(c) difference of the width between the line of text is less than the product of font size and spread factor, or,
If the left side dividing value of two line of text is identical, then the width of the line of text in front is greater than the width of the line of text in back; Or,
If the right dividing value of two line of text is identical, then the width of the line of text in front is less than the width of the line of text in back.
The method that the PDF text fragment of embodiment of the present invention generates has following beneficial effect: the text structure of handling through this method is prone to generate RTF (Rich Text Format) form, and is effective, and can edit Du Gao; In addition, this method is an automatic typesetting, need not manual intervention, and is time saving and energy saving during operation.
Description of drawings
To combine accompanying drawing and embodiment that the present invention is described further below, in the accompanying drawing:
Fig. 1 is the process flow diagram of the method embodiment one of PDF text fragment generation of the present invention;
Fig. 2 is the process flow diagram of literal piece embodiment one of the identification PDF text of step S100 among Fig. 1;
Fig. 3 is the process flow diagram of step S200 embodiment one among Fig. 1;
Fig. 4 is the process flow diagram of step S300 embodiment one among Fig. 1;
Fig. 5 is the process flow diagram of step S500 embodiment one among Fig. 1.
Embodiment
The present invention is directed in the prior art when the content of quoting third-party PDF text is edited, the defective that wastes time and energy, the method that provides a kind of PDF text fragment to generate, when using this method to edit, time saving and energy saving.
Before specifying this method, at first introduce the technical term that several needs use:
The literal piece: in english-speaking environment, literal piece is an English word normally; And in non-english-speaking environment, literal piece is a word normally.Its Chinese words piece is divided into 4 kinds of directions: from left to right, from top to bottom, from right to left, from top to bottom.
Line of text: be in one or more literal pieces of delegation and form line of text.
Text fragment: one or more adjacent line of text are formed text fragment, generally are to separate with null between the paragraph.
Text is set type and is listed as: one or more text fragments are from top to bottom formed texts composing row.
Text is set type and gone: one or more texts composing row composition texts are set type and are gone.
The page: one or more texts are set type to go and are combined into the page.
For common article, normally a text is set type and is gone, and wherein comprises text composing row, wherein comprises a plurality of text fragments.
For the article on two hurdles, normally a text is set type and is gone, and wherein comprises two texts composing row, wherein comprises a plurality of text fragments.
For the page of a relative complex, normally to set type to go and form by a plurality of texts, wherein some texts are set type and are gone, and comprise one or more texts composing row, and its Chinese version composing row comprise one or more text fragments again.
As shown in Figure 1, the process flow diagram of the method embodiment one that generates at PDF text fragment of the present invention, this method may further comprise the steps:
The literal piece of step S100. identification and extraction PDF text, particularly, in this step; Can set up the literal set of blocks that the anglec of rotation is respectively 0,90,180,270 degree four directions earlier, the PDF text of the respectively corresponding four kinds of forms of the literal set of blocks of this four direction, for example; The anglec of rotation is the PDF text of 0 degree, also is modal PDF text, and text form is from top to bottom; Typesetting format from left to right, and the direction of its literal piece is from left to right, the angle that also is font is 0 degree.The text formatting of other anglec of rotation can be drawn analogous conclusions, and does not do at this and gives unnecessary details; Then; In the literal set of blocks of four direction, set up the literal piece that array indexing extracts the PDF text with incremental change respectively, the literal piece that is extracted comprises the datum line of literal piece, peripheral rectangle, font, font size, color and angle; Wherein, Datum line is used to locate the literal piece, and peripheral rectangle frame is used for confirming the coordinate figure of literal piece at X axle and Y axle, and this array indexing is relevant with the datum line that the literal piece is expert at;
Step S200. rejects the literal piece that repeats in the different layers, and definite line of text, and determined line of text is formed the line of text set.Because the PDF text is in conversion process, the same page can show in layering, and institute's content displayed is identical, so be necessary the literal block delete that repeats in the different layers;
Step S300. carries out horizontal direction with said line of text set and divides, and obtains one or more first texts, and said one or more first texts are formed first text collection; Then each first text in first text collection being carried out vertical direction respectively divides; Obtain one or more second texts respectively; Said one or more second text is formed second text collection, extracts in second text collection white space between one or more second texts to form the white space set;
Step S400. merges in first text collection two adjacent first texts, to obtain the text row of setting type;
Step S500. divides the text composing row after the merging, and setting type with the formation text is listed as and text fragment.
Preferably, as shown in Figure 2, the literal piece of the identification PDF text in step 100 can may further comprise the steps,
Step S110. judges whether the character in the PDF text is English character, if then change step S130; If not, then change step S120;
The said character of step S120. is a literal piece;
Whether the spacing that step S130. judges two adjacent characters less than the product of the font size and first spread ratio, and judges whether font, font size, the color of said two adjacent characters be identical, if then change step S140; If not, then change step S150; In this step, establish two adjacent characters and be respectively first character and second character, and the coordinate of the upper left point of first character and lower-right most point be respectively (x1, y1) with (x1 '; Y1 '), the upper left point of second character and the coordinate of lower-right most point are respectively that (x2 is y2) with (x2 ', y2 '); Then the spacing of this two adjacent character can be expressed as: fabs (max (x1, x2)-min (x1 ', x2 ')), wherein; Max (x1, x2) expression x1, the maximal value of x2, min (x1 '; X2 ') expression x1 ', the minimum value of x2 ', fabs () expression takes absolute value;
Said two adjacent characters of step S140. belong to same literal piece;
Said two adjacent characters of step S150. do not belong to same literal piece.
Preferably, as shown in Figure 3, step S200 can specifically may further comprise the steps:
Step S210. obtains two literal pieces of same index value or adjacent index value; If the literal piece content of said two literal pieces is identical and the spacing of said two literal pieces less than the product of the font size and second spread ratio; Then delete one of them literal piece, and remaining literal piece is put in the literal set of blocks.In this step, if two literal pieces are respectively the first literal piece and the second literal piece, and the upper left point of the first literal piece and the coordinate of lower-right most point are respectively (X1; Y1) and (X1 '; Y1 '), the upper left point of the second literal piece and the coordinate of lower-right most point be respectively (X2, Y2) with (X2 '; Y2 '), then the spacing of two literal pieces can specifically be expressed as less than the product of the font size and second spread ratio:
Fabs (X1-X2)<(font size * the 21 spread ratio), and
Fabs (X1 '-X2 ')<(font size * the two or two spread ratio), and
Fabs (Y1-Y2)<(font size * the two or three spread ratio), and
Fabs (Y1 '-Y2 ')<(font size * the two or four spread ratio);
In the above in the expression formula, such as but not limited to, the 21 spread ratio, the two or two spread ratio, the two or three spread ratio and the two or four spread ratio can all get 0.2;
Step S220. sets up an ineffective law, rule, etc. one's own profession, the literal piece in the literal set of blocks is put in this ineffective law, rule, etc. one's own profession by array indexing value size successively, to generate the line of text set; And in the one text row; Two literal pieces of same index value or adjacent index value satisfy: the datum line distance of two literal pieces is less than font size and first datum line differential product apart from coefficient; And the level interval fabs (max (X1 of two literal pieces; X2)-and min (X1 ', X2 ')) less than font size and second datum line differential product apart from coefficient.Such as but not limited to, desirable 0.5, the second datum line differential of the first datum line coefficient is apart from coefficient 0.6.
Preferably, as shown in Figure 4, step S300 can specifically may further comprise the steps:
Step S310. is arranged in order upper boundary values from small to large the order of the line of text in the said line of text set by line of text;
Step S320. is two adjacent line of text relatively one by one, if two adjacent line of text intersect in the projection of Y direction, then are put in same first text;
Step S330. is respectively with left side dividing value from small to large the series arrangement of the line of text in each first text by line of text;
Step S340. is two adjacent line of text relatively one by one, if two adjacent line of text intersect in the projection of X-direction, then are put in same second text;
Step S350. extracts the white space between one or more second texts in second text collection, gathers to form white space, such as but not limited to; If the anglec of rotation of this PDF text is 0 degree, its type-setting mode is from top to bottom, from left to right; If in second text collection, the upper left point coordinate of two second texts is respectively that (XX1 is YY1) with (XX2; YY2), the lower-right most point coordinate of two second texts be respectively (XX1 ', YY1 ') and (XX2 '; White space in the white space set of YY2 '), then being extracted is respectively:
First white space (0, XX1);
Second white space (XX1 ', XX2);
Last white space (XX2 ', pagewidth).
Preferably; Step S400 can comprise; Merge for the first time two adjacent first texts in first text collection, the condition that merges for the first time is that the two first adjacent texts exist the same number of white space and pairing white space to intersect in the projection of X-direction.
Preferably, step S400 also can comprise: after merging for the first time, carry out merging the second time, to obtain text composing row, the merging condition that merge the said second time is that the white space of two adjacent first texts intersects in the projection of X-direction.
Preferably, as shown in Figure 5, step S500 can specifically may further comprise the steps:
Step S510. presses line of text left side dividing value series arrangement from small to large with the line of text that said text is set type in going;
The newly-built text of step S520. is set type and is listed as, and from said text is set type row, takes out line of text in turn successively;
Whether line of text that step S530. judgement is taken out and newly-built text set type to be listed as intersects in the projection of X-direction, if then change step S540; If not, then change step S520;
Step S540. puts into said newly-built text composing row in proper order with the line of text of being taken out;
Step S550. arranges the line of text that said text is set type in being listed as by upper boundary values from small to large;
The newly-built text fragment of step S560. takes out line of text in turn successively from text is set type row;
Step S570. judges whether two adjacent line of text satisfy preset paragraph condition, if then change step S580; If not, then change step S560;
Step S580. puts into the one text paragraph in proper order with said two adjacent line of text.
Preferably, preset paragraph condition can comprise following condition:
(a) the difference in height fabs between the line of text (YY2-YY1 ') is less than the product of font size and height coefficient, and;
(b) the vertical interval fabs between the line of text (YY2 '-YY1 ') is less than the product of font average height and paragraph coefficient, and;
(c) the difference fabs of the width between the line of text ((XX1 '-XX1)-(XX2 '-XX2)) less than the product of font size and spread factor, or,
If the left side dividing value of two line of text is identical, then the width of the line of text in front (XX1 '-XX1) greater than the back line of text width (XX2 '-XX2); Or,
If the right dividing value of two line of text is identical, then the width of the line of text in front (XX1 '-XX1) less than the back line of text width (XX2 '-XX2).
In above expression formula, for example but be not defined as: height coefficient is 0.2, and the paragraph coefficient is 1.0, and spread factor is 4.
The above is merely the preferred embodiments of the present invention, is not limited to the present invention, and for a person skilled in the art, the present invention can have various changes and variation.All within spirit of the present invention and principle, any modification of being done, be equal to replacement, improvement etc., all should be included within the claim scope of the present invention.

Claims (4)

1. the method that the PDF text fragment generates is characterized in that, comprising:
A. discern and extract the literal piece of PDF text;
B. reject the literal piece that repeats in the different layers, and definite line of text, determined line of text is formed the line of text set;
C. horizontal direction is carried out in said line of text set and divide, obtain one or more first texts, said one or more first texts are formed first text collection; Then each first text in first text collection being carried out vertical direction respectively divides; Obtain one or more second texts respectively; Said one or more second text is formed second text collection, extracts in second text collection white space between one or more second texts to form the white space set;
D. merge in first text collection two adjacent first texts, to obtain the text row of setting type;
E. the text after division merges is set type and is gone, to form text composing row and text fragment; Wherein,
In the steps A, the literal piece of identification PDF text comprises:
A1. judge whether the character in the PDF text is English character, if, execution in step A3 then; If not, execution in step A2 then;
A2. said character is a literal piece;
Whether the spacing of A3. judging two adjacent characters less than the product of the font size and first spread ratio, and judges whether font, font size, the color of said two adjacent characters be identical, if then said two adjacent characters belong to same literal piece; If not, then said two adjacent characters do not belong to same literal piece;
Said step B comprises:
B1. obtain two literal pieces of same index value or adjacent index value; If the literal piece content of said two literal pieces is identical and the spacing of said two literal pieces less than the product of the font size and second spread ratio; Then delete one of them literal piece, and remaining literal piece is put in the literal set of blocks;
B2. set up an ineffective law, rule, etc. one's own profession, the literal piece in the literal set of blocks is put in this ineffective law, rule, etc. one's own profession by array indexing value size successively, to generate the line of text set; And in the one text row; Two literal pieces of same index value or adjacent index value satisfy: the datum line distance of two literal pieces is less than font size and the product of first datum line differential apart from coefficient, and the level interval that reaches two literal pieces is less than font size and second datum line differential product apart from coefficient;
Said step C comprises:
C1. upper boundary values from small to large the order of the line of text in the said line of text set by line of text is arranged in order;
C2. two adjacent line of text relatively one by one if two adjacent line of text intersect in the projection of Y direction, then are put in same first text;
C3. respectively with left side dividing value from small to large the series arrangement of the line of text in each first text by line of text;
C4. two adjacent line of text relatively one by one if two adjacent line of text intersect in the projection of X-direction, then are put in same second text;
C5. extract the white space between one or more second texts in second text collection, to form the white space set;
Said step D comprises:
Merge for the first time in first text collection two adjacent first texts, the condition that merges for the first time is that the two first adjacent texts exist the same number of white space and pairing white space to intersect in the projection of X-direction;
After merging for the first time, carry out merging the second time, to obtain text composing row, the merging condition that merge the said second time is that the white space of two adjacent first texts intersects in the projection of X-direction;
Said step e comprises:
The line of text of E1. said text being set type in going is pressed line of text left side dividing value series arrangement from small to large;
E2. a newly-built text is set type and is listed as, and from said text is set type row, takes out line of text in turn successively;
Whether line of text that E 3. judgements are taken out and newly-built text set type to be listed as intersects in the projection of X-direction, if then change step e 4; If not, then change step e 2;
E4. the line of text of being taken out is put into said newly-built text composing row in proper order;
The line of text of E5. said text being set type in being listed as is arranged by upper boundary values from small to large;
E6. a newly-built text fragment takes out line of text in turn successively from text is set type row;
E7. judge whether two adjacent line of text satisfy preset paragraph condition, if then change step e 8; If not, then change step e 6;
E8. said two adjacent line of text are put into the one text paragraph in proper order.
2. the method that PDF text fragment according to claim 1 generates; It is characterized in that; In steps A, set up the literal set of blocks that the anglec of rotation is respectively 0,90,180,270 degree four directions, and set up the literal piece that array indexing extracts the PDF text with incremental change.
3. the method that PDF text fragment according to claim 2 generates is characterized in that the literal piece that is extracted comprises the datum line of literal piece, peripheral rectangle, font, font size, color and angle.
4. the method that PDF text fragment according to claim 1 generates is characterized in that said preset paragraph condition comprises following condition:
(a) difference in height between the line of text is less than the product of font size and height coefficient, and;
(b) vertical interval between the line of text is less than the product of font average height and paragraph coefficient, and;
(c) difference of the width between the line of text is less than the product of font size and spread factor, or,
If the left side dividing value of two line of text is identical, then the width of the line of text in front is greater than the width of the line of text in back; Or,
If the right dividing value of two line of text is identical, then the width of the line of text in front is less than the width of the line of text in back.
CN2010101363998A 2010-03-25 2010-03-25 Method for generating PDF text paragraphs Expired - Fee Related CN101876967B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2010101363998A CN101876967B (en) 2010-03-25 2010-03-25 Method for generating PDF text paragraphs

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2010101363998A CN101876967B (en) 2010-03-25 2010-03-25 Method for generating PDF text paragraphs

Publications (2)

Publication Number Publication Date
CN101876967A CN101876967A (en) 2010-11-03
CN101876967B true CN101876967B (en) 2012-05-02

Family

ID=43019525

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2010101363998A Expired - Fee Related CN101876967B (en) 2010-03-25 2010-03-25 Method for generating PDF text paragraphs

Country Status (1)

Country Link
CN (1) CN101876967B (en)

Families Citing this family (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102479215B (en) * 2010-11-30 2013-10-30 汉王科技股份有限公司 Automatic file exporting method and electronic reading device
CN102546577A (en) * 2010-12-27 2012-07-04 北京大学 Compression and decompression method and system for format data
CN102890826B (en) * 2011-08-12 2015-09-09 北京多看科技有限公司 A kind of method of scanned version document re-ranking version
CN102306294A (en) * 2011-08-23 2012-01-04 深圳市万兴软件有限公司 Method and system for extracting image from portable document format (PDF) file page
CN102306143A (en) * 2011-09-22 2012-01-04 汉王科技股份有限公司 Method and system for generating and editing PDF (portable document format) document
CN102722475A (en) * 2012-05-09 2012-10-10 深圳市万兴软件有限公司 Method for converting form in portable document format (PDF) document into Excel form
US20140280186A1 (en) * 2013-03-15 2014-09-18 International Business Machines Corporation Crowdsourcing and consolidating user notes taken in a virtual meeting
CN104063364A (en) * 2013-03-19 2014-09-24 福建福昕软件开发股份有限公司北京分公司 PDF document recognition method
CN104516868B (en) * 2013-09-30 2018-03-06 北大方正集团有限公司 The streaming restoring method and system in a kind of space of a whole page space
CN105354174B (en) * 2014-08-22 2018-04-10 北大方正集团有限公司 For exporting the composition method and device of epub formatted files
CN104199805B (en) * 2014-09-11 2017-10-20 清华大学 Text joining method and device
CN104850316B (en) * 2015-04-29 2019-02-12 小米科技有限责任公司 E-book font method of adjustment and device
CN105373526B (en) * 2015-10-23 2019-02-15 北大方正集团有限公司 A kind of white space processing method and system in electronic document
CN107391457B (en) * 2017-07-26 2020-10-27 成都科来软件有限公司 Document segmentation method and device based on text line
CN107783956B (en) * 2017-11-23 2019-03-15 掌阅科技股份有限公司 Composition method, electronic equipment and the computer storage medium of text information
CN109815453A (en) * 2018-12-25 2019-05-28 东软集团股份有限公司 Document method of partition, device, storage medium and electronic equipment
CN109948518B (en) * 2019-03-18 2023-06-09 武汉汉王大数据技术有限公司 Neural network-based PDF document content text paragraph aggregation method
CN110222324B (en) * 2019-05-21 2022-11-08 上海阿几网络技术有限公司 Automatic layout device based on character paragraph structure and word size change rate
CN112307713A (en) * 2020-10-27 2021-02-02 广州朗国电子科技有限公司 Automatic text typesetting method and system based on Android system
CN117217172A (en) * 2023-11-09 2023-12-12 金蝶征信有限公司 Table information acquisition method, apparatus, computer device, and storage medium

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0702322B1 (en) * 1994-09-12 2002-02-13 Adobe Systems Inc. Method and apparatus for identifying words described in a portable electronic document
CN1278260C (en) * 2004-02-06 2006-10-04 珠海金山软件股份有限公司 Typesetting method
CN101206639B (en) * 2007-12-20 2012-05-23 北大方正集团有限公司 Method for indexing complex impression based on PDF

Also Published As

Publication number Publication date
CN101876967A (en) 2010-11-03

Similar Documents

Publication Publication Date Title
CN101876967B (en) Method for generating PDF text paragraphs
CN110163030B (en) PDF framed table extraction method based on image information
CN100578432C (en) Method for directly writing handwriting information
CN101206639B (en) Method for indexing complex impression based on PDF
US20070126793A1 (en) Digital content creation system, digital content creation method, and program product
CN101101588B (en) Document editing device, program, and storage medium
CN1312611C (en) Placement system, programm and method
JP5189497B2 (en) Form creation system, network system using the same, and form creation method.
EP2002352B1 (en) Applying effects to a merged text path
CN1936882A (en) Paging form data-processing method and system
US8799761B2 (en) Method and system for repurposing a spreadsheet to save paper and ink
CN105139334A (en) Multiline text watermark production device
AU660313B2 (en) Method and apparatus for automated page layout of text and graphic elements
JP5950700B2 (en) Image processing apparatus, image processing method, and program
CN104424174B (en) Document processing system and document processing method
CN103488619B (en) Method and device for processing document file
CN103970890B (en) Real-time webpage data generation method and device
CN113962193A (en) Table typesetting method and device, electronic equipment and storage medium
JP6152633B2 (en) Display control apparatus and program
CN112307725A (en) Method for adding table information on two-dimensional drawing interface
CN111126007A (en) HTML (Hypertext markup language) -based medical record document paging algorithm
CN104112287A (en) Method and device for segmenting characters in picture
JP6540546B2 (en) Information processing apparatus and program
JP2000207393A (en) Character arrangement outputting device
CN101571882A (en) System and method for generating minimum outline of characters

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
C56 Change in the name or address of the patentee

Owner name: SHENZHEN WONDERSHARE INFORMATION TECHNOLOGY CO., L

Free format text: FORMER NAME: SHENZHEN WONDERSHARE SOFTWARE CO., LTD.

CP03 Change of name, title or address

Address after: 518057 Guangdong city of Shenzhen province Nanshan District Gao Xin Road, room 9 building on the north side of block A901 No. 006 TCL Industry Research Institute building A A Building 8 floor

Patentee after: SHENZHEN WONDERSHARE INFORMATION TECHNOLOGY Co.,Ltd.

Address before: Room 9, block A901 building on the north side of a building 518057 North TCL A of Guangdong Province, Shenzhen city Nanshan District South Road West ten high new technology

Patentee before: WONDERSHARE SOFTWARE Co.,Ltd.

CP03 Change of name, title or address
CP03 Change of name, title or address

Address after: 850000 Tibet autonomous region, Lhasa City, New District, west of the East Ring Road, 1-4 road to the north, south of 1-3 Road, Liu Dong building, east of the 8 unit 6, floor 2, No.

Patentee after: WONDERSHARE TECHNOLOGY CO.,LTD.

Address before: 518057 Guangdong city of Shenzhen province Nanshan District Gao Xin Road, room 9 building on the north side of block A901 No. 006 TCL Industry Research Institute building A A Building 8 floor

Patentee before: SHENZHEN WONDERSHARE INFORMATION TECHNOLOGY Co.,Ltd.

CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20120502