CN101876967B - Method for generating PDF text paragraphs - Google Patents

Method for generating PDF text paragraphs Download PDF

Info

Publication number
CN101876967B
CN101876967B CN 201010136399 CN201010136399A CN101876967B CN 101876967 B CN101876967 B CN 101876967B CN 201010136399 CN201010136399 CN 201010136399 CN 201010136399 A CN201010136399 A CN 201010136399A CN 101876967 B CN101876967 B CN 101876967B
Authority
CN
China
Prior art keywords
text
line
set
lines
adjacent
Prior art date
Application number
CN 201010136399
Other languages
Chinese (zh)
Other versions
CN101876967A (en
Inventor
晏检平
Original Assignee
深圳市万兴软件有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳市万兴软件有限公司 filed Critical 深圳市万兴软件有限公司
Priority to CN 201010136399 priority Critical patent/CN101876967B/en
Publication of CN101876967A publication Critical patent/CN101876967A/en
Application granted granted Critical
Publication of CN101876967B publication Critical patent/CN101876967B/en

Links

Abstract

The invention relates to a method for generating PDF text paragraphs, comprising the following steps: A. identifying and extracting text blocks of a PDF text; B. removing the repeated text blocks in different layers, and determining text lines which form a text line set; C. horizontally dividing the text line set to obtain one or more first texts; and then vertically dividing each first text in afirst text set to respectively obtain one or more second texts, and extracting a blank area among one or more second texts to form a blank area set; D. merging two adjacent first texts in the first text set to obtain a text type-setting line; and E. dividing the merged text type-setting line so as to form a text type-setting column and the text paragraphs. By implementing the technical proposal of the invention, the text structure processed by the method easily generates an RTF format, thus achieving good effect and high editable degree; and in addition, the method adopts automatic typesetting, thus manual intervention is not required.

Description

一种PDF文本段落生成的方法 A method of generating a text paragraph PDF

技术领域 FIELD

[0001] 本发明涉及信息技术,更具体地说,涉及一种PDF文本段落生成的方法。 [0001] The present invention relates to information technology, and more particularly, to a method of generating a PDF text paragraphs. 背景技术 Background technique

[0002] 便携式文件格式(Portable Document Format,PDF),由Adobe Systems 在1993 年用于文件交换所发展出的文件格式。 [0002] Portable Document Format (Portable Document Format, PDF), by Adobe Systems for document exchange file format developed by 1993. 它的优点在于跨平台、能保留文件原有格式(Layout)、 开放标准。 It has the advantage of cross-platform, able to retain the original file format (Layout), open standards. 在PDF格式文件中,记录了文本元素的精确位置,而文本之间没有任何关系,该格式不易编辑。 In a PDF file, record the precise location of text elements, but there is no relationship between the text, this format is not easy to edit.

[0003] PDF文本格式以其卓越的特性成为在互联网上进行电子文档发行和格式化信息传播的理想文件格式。 [0003] PDF text format with its superior characteristics to be ideal for electronic document distribution and file format formatted information dissemination in the Internet. 当前,在互联网发布的科技论文大部分以PDF格式提交。 Currently, released in most scientific papers submitted to the Internet in PDF format. 但是,PDF重在描述文档的打印格式,没有描述原始文档内的数据结构,并且不易编辑。 However, PDF print format focuses on describing the document, there is no description of the data structure in the original document, and easy to edit. 如果需要引用第三方的PDF文本中的内容,目前通行的办法是手动拷贝出文字,然后再放入其它字处理软件中手动排版、编辑,这种操作费时费力。 If you need to refer to third-party content in the PDF text, the current prevailing approach is to manually copy the text, and then into other word processing software manual typesetting, editing, this operation is time-consuming.

[0004] 目前,一般通过排版软件本身的能力导出XML文件,该XML文件都包含PDF文章的内容信息,不同的排版软件输出的内容可能不同,但多数的排版软件并不导出文字块的位置信息,使得PDF文章的信息不完整,往往需要通过手工的方式进行补充,效率非常低。 [0004] Currently, the general layout software by exporting its own capacity XML file, the XML file contains information about the contents of the article PDF, different content publishing software output may be different, but most publishing software does not export the text block location information so that the PDF article incomplete information, often need to be supplemented by manual way, the efficiency is very low. 由于多数排版软件可以生成PDF格式的文件,大量的历史数据是基于PDF的,所以基于PDF的解析应用面很广。 Since the majority of typesetting software can generate PDF files, a lot of historical data is PDF-based, so the PDF-based analytic application surface is very wide. 例如,公开号为CN1687926A的专利申请公布了一种“基于XML的PDF文本信息的抽取系统和方法”,主要是把PDF文本的物理结构转换为逻辑结构,但并没有进行文本成段落和成文的处理;再例如,公开号为CN1776673A的专利申请公开了一种“PDF文本到XML文档转换的方法”,通过第三方的工具把PDF转为平级的XML文档,再通过XSLT 结合规则提起XML中的信息,其应用的前提是PDF页面本身比较简单,结构较为一致,使用简单的XPATH的规则就可以提取XML信息,并不适用复杂的多栏页面。 For example, Patent Publication No. Application CN1687926A discloses a "system and method for extracting XML text information based on the PDF", the physical structure of the main text PDF into a logical structure, but it has not written the text into paragraphs and processing; another example, Publication No. CN1776673A patent application discloses a "PDF text to XML document conversion method", through third-party PDF tools into the same level of the XML document, and then lift the XSLT XML by binding rules the information, provided the application is PDF page itself is relatively simple, more consistent structure, using a simple rule of XPATH can extract XML information does not apply to complex, multi-column page. 又例如,公开号为CN160403A的专利申请公开了“一种对报纸页面进行文字阅读顺序恢复的方法”,是对PDF 文本进行文本成文的处理,但并没有涉及到文字块的生成和合并的规则和整个提取内容和位置等信息的流程。 Also for example, Patent Publication No. CN160403A Application discloses "a method of newspaper pages for text reading order recovery", the text is processed PDF text written, it does not refer to the combined generation and a block of text rules and the entire process to extract information such as the location and content. 再例如,公开号为CN101206639的专利申请公开了“一种PDF数据标引的方法”,是对复杂页面的PDF提供文字块和文章的提取,但是其过程中合并了文字块的字体和字号,导致转换后效果失真,并且在合并文字块后需要采用人工干预才能识别块与块之间的关系,无法自动识别块之间的逻辑关系,而且成块的规则生成的结果无法写入RTF 格式。 As another example, Patent Publication No. CN101206639 Application discloses "a method of indexing the PDF data", providing the extracted text blocks and complex article of PDF pages, but the process and merge the font size of the text block, after conversion effects cause distortion, and the relationship between the need for manual intervention in order to identify the blocks after the merger block of text does not automatically recognize the relationship between the logical block, and into the rules generated block results can not be written RTF.

发明内容 SUMMARY

[0005] 本发明要解决的技术问题在于,针对现有技术的上述在引用第三方的PDF文本进行编辑排版时,费时费力的缺陷,提供一种PDF文本段落生成的方法,可以省时省力地完成编辑和排版。 [0005] The present invention is to solve the technical problem of the prior art described above in reference to third PDF layout editing text, time-consuming defect, there is provided a method for generating a PDF text paragraph can be time-saving editing and typesetting.

[0006] 本发明解决其技术问题所采用的技术方案是:构造一种PDF文本段落生成的方法,包括: [0006] aspect of the present invention to solve the technical problem are: to construct a text paragraph PDF generation method, comprising:

[0007] A.识别并提取PDF文本的文字块; [0007] A. identify and extract the text block text PDF;

[0008] B.剔除不同层中重复的文字块,并且确定文本行,所确定的文本行组成文本行集合; [0008] B. the different layers removed repeated text block, and determines text lines, the text lines determined text line set;

[0009] C.将所述文本行集合进行水平方向划分,得到一个或多个第一文本,所述一个或多个第一文本组成第一文本集合;然后对第一文本集合中的每个第一文本分别进行垂直方向划分,分别得到一个或多个第二文本,所述一个或多个第二文本组成第二文本集合,提取第二文本集合中一个或多个第二文本之间的空白区域以组成空白区域集合; [0009] C. The set of text lines in the horizontal direction is divided, to obtain one or more first text, the one or more first text composed of a first set of text; and each of the first set of text first text respectively divided vertically, respectively one or more second text, the one or more second text composed of the second set of text, extracting a second set of text or between a plurality of the second text to form a blank area of ​​white region set;

[0010] D.合并第一文本集合中两相邻的第一文本,以得到文本排版行; [0010] D. combining the first set of text in the first text two adjacent lines to obtain a text layout;

[0011] E.划分合并后的文本排版行,以形成文本排版列和文本段落;其中, [0011] E. Text formatting the divided line combined to form a passage of text and formatting text columns; wherein,

[0012] 步骤A中,识别PDF文本的文字块包括: [0012] Step A, the identification of text blocks of text PDF comprises:

[0013] Al.判断PDF文本中的字符是否是英文字符,若是,则执行步骤A3 ;若否,则执行步骤A2 ; [0013] Al is determined whether PDF text characters are English characters, if yes, step A3;. If not, the step A2;

[0014] A2.所述字符是一个文字块; . [0014], wherein A2 is a block of text characters;

[0015] A3.判断两相邻字符的间距是否小于字体大小与第一间距系数的乘积,且判断所述两相邻字符的字体、字号、颜色是否相同,若是,则所述两相邻字符属于同一文字块;若否,则所述两相邻字符不属于同一文字块; [0015] A3. Determines whether the distance between the two adjacent characters is smaller than the font size and the product of the first pitch factor, and determining whether the two adjacent characters in the font, size, color is the same, if yes, the two adjacent characters belong to the same text block; if not, then the two adjacent characters are not the same block of text;

[0016] 所述步骤B包括: [0016] The said step B comprises:

[0017] Bi.获取相同索引值或相邻索引值的两个文字块,若所述两个文字块的文字块内容相同且所述两文字块的间距小于字体大小与第二间距系数的乘积,则删除其中一个文字块,并将剩下的文字块放到文字块集合中; [0017] Bi. Gets the same index or index values ​​two adjacent blocks of text, block of text when the text contents of the two blocks of the same character and the distance between the two blocks is less than the product of the second font size and pitch coefficient , wherein a block of text is deleted and the remaining set of blocks into character blocks of text;

[0018] B2.建立一个空文本行,将文字块集合中的文字块按数组索引值大小依次放入到该空文本行中,以生成文本行集合;且在同一文本行中,相同索引值或相邻索引值的两文字块满足:两文字块的基准线距离小于字体大小与第一基准线差距系数的乘积,及两文字块的水平间距小于字体大小与第二基准线差距系数的乘积; . [0018] B2 create an empty text line, text block text block set by the size of the array index value to turn into the blank line of text, to generate a set of text lines; and in the same line of text, the same index value or two adjacent blocks of text index value satisfies: reference line a distance less than two blocks of text and font size of the product of the first reference line gap coefficient, and the horizontal spacing of two text blocks and the second product is less than the font size of the gap between the reference line coefficient ;

[0019] 所述步骤C包括: [0019] the step C comprises:

[0020] Cl.将所述文本行集合中的文本行按文本行的上边界值从小到大的顺序依次排列; . [0020] Cl text row of the text line by line of text set upper limit value of the order of ascending order;

[0021] C2.逐个比较两相邻的文本行,若两相邻的文本行在Y轴方向的投影相交,则放到同一个第一文本中; . [0021] C2-by comparing two adjacent lines of text, the text if two adjacent rows of the Y-axis direction intersects the projection, the first text into the same;

[0022] C3.分别将每个第一文本中的文本行按文本行的左边界值从小到大的顺序排列; [0022] C3, respectively, each of the first line of text in the text by the left edge of the text line values ​​in ascending order.;

[0023] C4.逐个比较两相邻的文本行,若两相邻的文本行在X轴方向的投影相交,则放到同一个第二文本中; . [0023] C4-by comparing two adjacent lines of text, the text if two adjacent rows of the X-axis direction intersects the projection, the second text into the same;

[0024] C5.提取第二文本集合中一个或多个第二文本之间的空白区域,以组成空白区域集合; . [0024] C5 extracting a second set of one or more text blank area between the second text, to form a blank region set;

[0025] 所述步骤D包括: [0025] The said step D comprises:

[0026] 第一次合并第一文本集合中两相邻的第一文本,第一次合并的条件是两相邻的第一文本存在数目相同的空白区域且所对应的空白区域在X轴方向的投影相交; [0026] The first combining the first text in the first text set of two adjacent first combined with the proviso that a blank area in the X-axis direction between two adjacent first text present in the same number corresponding to the blank area, and intersects the projection;

[0027] 在第一次合并之后,进行第二次合并,以得到文本排版行,所述第二次合并的合并条件是两相邻第一文本的空白区域在X轴方向的投影相交; [0027] After the first combined and the second combined to obtain a line of text layout, the combined second combined conditions of two adjacent blank area in the first text intersects the projection of the X-axis direction;

[0028] 所述步骤E包括: [0028] The step E comprises:

[0029] El.将所述文本排版行中的文本行按文本行左边界值从小到大顺序排列; . [0029] El typeset text row of the text line by line of text left margin values ​​in ascending order;

[0030] E2.新建一个文本排版列,依次从所述文本排版行中顺序取出文本行; . [0030] E2 new text column layout, text lines are sequentially extracted from the layout order of the text line;

[0031] E3.判断所取出的文本行和新建的文本排版列是否在X轴方向的投影相交,若是, 则转步骤E4 ;若否,则转步骤E2 ; . [0031] E3 Analyzing the extracted text lines and text layout new column is in the X-axis direction intersecting the projection, if yes, go to step E4; if not, then go to step E2 of;

[0032] E4.将所取出的文本行顺序放入所述新建的文本排版列中; . [0032] E4 text lines sequentially fetched into the new column text layout;

[0033] E5.将所述文本排版列中的文本行按上边界值从小到大排列; . [0033] E5 typeset text rows of the text columns by upper limit value in ascending order;

[0034] E6.新建一个文本段落,依次从文本排版列中顺序取出文本行; . [0034] E6 new text paragraph, line sequential order taken from the text layout of text in the column;

[0035] E7.判断两相邻文本行是否满足预设的段落条件,若是,则转步骤E8 ;若否,则转步骤E6 ; . [0035] E7 is determined whether a predetermined condition passages meet two adjacent lines of text, and if yes, go to Step E8; if not, then go to step E6;

[0036] E8.将所述两相邻文本行顺序放入同一文本段落中。 [0036] E8. The order of the two adjacent lines of text into the same text in a paragraph.

[0037] 在本发明所述的PDF文本段落生成的方法中,在步骤A中,建立旋转角度分别为0、 90、180、270度四个方向的文字块集合,并且以递增量建立数组索引来提取PDF文本的文字块。 [0037] In the method of generating the PDF text paragraph of the present invention, in step A, the establishment of the rotation angles of 0, 90,180,270 block of text set of four directions, and to establish increments the array index PDF to extract the text block of text.

[0038] 在本发明所述的PDF文本段落生成的方法中,所提取的文字块包括文字块的基准线、外围矩形、字体、字号、颜色和角度。 [0038] In the method according to the present invention PDF text paragraph generated, the extracted word block comprises a block of text baseline peripheral rectangle, font, size, color and angle.

[0039] 在本发明所述的PDF文本段落生成的方法中,所述预设的段落条件包括以下条件: [0039] In the method according to the present invention PDF text paragraph generated, the preset condition includes the following conditions paragraphs:

[0040] (a)文本行之间的高度差小于字体大小与高度系数的乘积,且; [0040] (a) the difference in height between lines of text is less than the height of the font size and coefficient multiplication, and;

[0041] (b)文本行之间的垂直间距小于字体平均高度与段落系数的乘积,且; [0041] vertically between (b) text line spacing is less than the average height of the font and paragraph coefficient multiplication, and;

[0042] (c)文本行之间的宽度之差小于字体大小与宽度系数的乘积,或, [0042] (c) the difference between the width of lines of text and the font size is smaller than the width of the product of coefficients, or,

[0043] 如果两文本行的左边界值相同,则前面一个文本行的宽度大于后面一个文本行的宽度;或, [0043] If the value of the left edge of the same two lines of text, the width of the front of a line of text is greater than the width of the back of a line of text; or,

[0044] 如果两文本行的右边界值相同,则前面一个文本行的宽度小于后面一个文本行的宽度。 [0044] If the value of the right boundary of the same two lines of text, the width of the front of a line of text is less than the width of the back of a line of text.

[0045] 实施本发明的PDF文本段落生成的方法,具有以下有益效果:经过该方法处理过的文本结构易生成RTF(Rich Text format)格式,效果好,且可编辑度高;另外,该方法是自动排版,无需人工干预,操作时省时省力。 [0045] The method of paragraph of text generated PDF embodiment of the present invention has the following advantages: through the process of text structure easily treated generating RTF (Rich Text format) format, effective, and can be edited high; Additionally, the method automatic typesetting, without manual intervention, saving time and effort when operating.

附图说明 BRIEF DESCRIPTION

[0046] 下面将结合附图及实施例对本发明作进一步说明,附图中: [0046] The accompanying drawings and the following embodiments of the present invention is further illustrated drawings in which:

[0047] 图1是本发明PDF文本段落生成的方法实施例一的流程图; [0047] FIG. 1 is a PDF context of the present invention, a passage generating a flow diagram a method embodiment;

[0048] 图2是图1中步骤SlOO的识别PDF文本的文字块实施例一的流程图; [0048] FIG 2 is a flowchart of a PDF text identifying step in FIG. 1 SlOO embodiment blocks of text;

[0049] 图3是图1中步骤S200实施例一的流程图; [0049] FIG. 3 is a flowchart of the step S200 of FIG. 1 embodiment;

[0050] 图4是图1中步骤S300实施例一的流程图; [0050] FIG. 4 is a flowchart of the step S300 of FIG. 1 embodiment;

[0051] 图5是图1中步骤S500实施例一的流程图。 [0051] FIG. 5 is a flowchart of the step S500 of FIG. 1 embodiment.

具体实施方式[0052] 本发明针对现有技术中在引用第三方的PDF文本的内容进行编辑时,费时费力的缺陷,提供一种PDF文本段落生成的方法,使用该方法进行编辑时,省时省力。 DETAILED DESCRIPTION OF THE INVENTION [0052] The present invention is directed to the prior art, when editing the PDF to third reference in the text, laborious defect, there is provided a method for text paragraphs PDF generated using this method for editing, provincial effort.

[0053] 在具体说明该方法之前,首先介绍几个需要使用到的专业术语: [0053] Before specific description of the method, first introduced several need to use terminology:

[0054] 文字块:在英文环境中,一个文字块通常是一个英文单词;而在非英文环境中,一个文字块通常是一个字。 [0054] Text Block: In English environment, a block of text is usually a word in English; in non-English environment, a block of text is usually a word. 其中文字块分为4种方向:从左到右、从上到下的、从右到左、从下到上。 Wherein the text block is divided into four kinds of directions: left to right, top to bottom, right to left, from bottom to top.

[0055] 文本行:处于同一行的一个或多个文字块组成文本行。 [0055] text: one or more text blocks in the same row of the text line.

[0056] 文本段落:一个或多个相邻的文本行组成文本段落,段落之间一般是用空行隔开。 [0056] text passages: one or more adjacent text lines of text passages composition, typically separated by blank lines between paragraphs.

[0057] 文本排版列:一个或多个从上到下的文本段落组成文本排版列。 [0057] Text formatting columns: one or more text paragraphs from top to bottom of the column composed of text layout.

[0058] 文本排版行:一个或多个文本排版列组成文本排版行。 [0058] Text formatting lines: one or more columns of text layout of text formatted line.

[0059] 页面:一个或多个文本排版行组合成页面。 [0059] page: one or more text lines are combined into a page layout.

[0060] 对于普通文章,通常是一个文本排版行,其中包含一个文本排版列,其中包含多个文本段落。 [0060] For ordinary paper, is usually a line of text layout, text layout which comprises a column, which contains the text paragraphs.

[0061] 对于两栏的文章,通常是一个文本排版行,其中包含二个文本排版列,其中包含多个文本段落。 [0061] For the two columns articles, usually a line of text layout, wherein the text layout includes two columns, which comprises a plurality of text paragraphs.

[0062] 对于一个相对复杂的页面,通常是由多个文本排版行组成,其中某一个文本排版行,包含一个或多个文本排版列,其中文本排版列又包含一个或多个文本段落。 [0062] For a relatively complex page, usually by a plurality of rows of text layout, wherein one line of text layout, text layout comprises one or more columns, wherein the text layout column contains one or more paragraphs of text.

[0063] 如图1所示,在本发明的PDF文本段落生成的方法实施例一的流程图,该方法包括以下步骤: [0063] 1, in the method of the present invention PDF text paragraphs generating a flow diagram of the embodiment, the method comprises the steps of:

[0064] 步骤S100.识别并提取PDF文本的文字块,具体地,在该步骤中,可先建立旋转角度分别为0、90、180、270度四个方向的文字块集合,该四个方向的文字块集合分别对应四种格式的PDF文本,例如,旋转角度为0度的PDF文本,也是最常见的PDF文本,该文本格式是从上到下,从左到右的排版格式,且其文字块的方向是从左到右的,也即字体的角度为0 度。 [0064] step S100. PDF to identify and extract the text block text, in particular, in this step, first create 0,90,180,270 four directions of rotation angles of the set of text blocks, the four directions text block set corresponding text PDF four formats, e.g., the rotation angle of 0 degrees in PDF format, PDF is the most common text, the text is from top to bottom, left to right layout format, and which orientation of the text block is from left to right, i.e., an angle of 0 degrees font. 其它旋转角度的文本格式可照此类推,在此不做赘述;然后,分别在四个方向的文字块集合中,以递增量建立数组索引来提取PDF文本的文字块,所提取的文字块包括文字块的基准线、外围矩形、字体、字号、颜色和角度,其中,基准线用于定位文字块,外围矩形框用于确定文字块在X轴和Y轴的坐标值,该数组索引与文字块行的基准线相关; Other angles of rotation may be in a text format analogy, not be described herein; Then, each block of text set in four directions in order to establish increments the array index to extract text PDF text block, the text of the extracted block comprises baseline of the text block, a peripheral rectangular, font, size, color and angle, wherein the reference line for positioning text block, a peripheral rectangular frame for determining the coordinate values ​​of text blocks X-axis and Y-axis, and the array index text Related reference line block row;

[0065] 步骤S200.剔除不同层中重复的文字块,并且确定文本行,所确定的文本行组成文本行集合。 [0065] step S200. Eliminate repeated in different layers of text blocks and text lines is determined, the determined set of text rows of text lines. 由于PDF文本在转化过程中,同一页面会分层显示,且所显示的内容相同,所以有必要将不同层中重复的文字块删除; Since PDF text conversion process will be the same page hierarchical display, and the same content is displayed, it is necessary to remove the different layers in the repeated blocks of text;

[0066] 步骤S300.将所述文本行集合进行水平方向划分,得到一个或多个第一文本,所述一个或多个第一文本组成第一文本集合;然后对第一文本集合中的每个第一文本分别进行垂直方向划分,分别得到一个或多个第二文本,所述一个或多个第二文本组成第二文本集合,提取第二文本集合中一个或多个第二文本之间的空白区域以组成空白区域集合; . [0066] Step S300 to set the text line in the horizontal direction is divided, to obtain one or more first text, the one or more first text composed of a first set of text; and each pair of the first set of text a first text were divided vertically, respectively one or more second text, the one or more second text composed of the second set of text, extracting a second set of text in a text or between a plurality of second in a blank area consisting of a set of blank area;

[0067] 步骤S400.合并第一文本集合中两相邻的第一文本,以得到文本排版行; . [0067] Step S400 combining the first text in the first text set of two adjacent lines to obtain a text layout;

[0068] 步骤S500.划分合并后的文本排版行,以形成文本排版列和文本段落。 [0068] step S500. After the division line of text layout combined to form a column and text layout of text paragraphs.

[0069] 优选地,如图2所示,在步骤100中的识别PDF文本的文字块可包括以下步骤, [0069] Preferably, as shown in FIG. 2, the text character block identification PDF in step 100 may include the steps of,

[0070] 步骤S110.判断PDF文本中的字符是否是英文字符,若是,则转步骤S130 ;若否, 则转步骤S120 ; . [0070] Step S110 determines whether the PDF text characters is English characters, if yes, go to step S130; if not, then go to step S120;

7[0071] 步骤S120.所述字符是一个文字块; 7. [0071] The step S120 is a block of text characters;

[0072] 步骤S130.判断两相邻字符的间距是否小于字体大小与第一间距系数的乘积, 且判断所述两相邻字符的字体、字号、颜色是否相同,若是,则转步骤S140;若否,则转步骤S150;在该步骤中,设两相邻字符分别为第一字符和第二字符,且第一字符的左上点和右下点的坐标分别为(xl,yl)和(xl',yl'),第二字符的左上点和右下点的坐标分别为(x2,y2)和(x2' , y2'),则该两相邻字符的间距可表示为:fabs (max (xl,x2)-min (xl ', x2')),其中,max(xl,x2)表示xl,x2 的最大值,min(xl',x2')表示xl',x2'的最小值,fabs()表示取绝对值; . [0072] Step S130 determines whether the distance between the two adjacent characters is smaller than the font size and the product of the first pitch factor, and determining whether the two adjacent characters in the font, size, color is the same, if yes, go to step S140; if NO, then go to step S150; in this step, the two adjacent characters are provided a first character and a second character, and the coordinates of the upper left point and a lower right point of the first character are (xl, yl) and (XL ', yl'), the coordinates of the upper left point and a lower right point of the second character, respectively (x2, y2) and (x2 ', y2'), the distance between the two adjacent characters can be expressed as: fabs (max ( xl, x2) -min (xl ', x2')), wherein, max (xl, x2) represents xl, x2 of the maximum value, min (xl ', x2') represents the minimum value of xl ', x2' of, fabs () denotes an absolute value;

[0073] 步骤S140.所述两相邻字符属于同一文字块; . [0073] Step S140 of the two adjacent characters belong to the same block of text;

[0074] 步骤S150.所述两相邻字符不属于同一文字块。 [0074] step S150. The two adjacent blocks of text characters are not the same.

[0075] 优选地,如图3所示,步骤S200可具体包括以下步骤: [0075] Preferably, as shown in FIG. 3, step S200 may specifically include the following steps:

[0076] 步骤S210.获取相同索引值或相邻索引值的两个文字块,若所述两个文字块的文字块内容相同且所述两文字块的间距小于字体大小与第二间距系数的乘积,则删除其中一个文字块,并将剩下的文字块放到文字块集合中。 [0076] step S210. Get the same index or index values ​​two adjacent blocks of text, if the same two blocks of text content of the text block and the two blocks is less than the pitch font character size and the second pitch factor product, wherein a block of text is deleted and the remaining set of blocks into character blocks of text. 在该步骤中,若两文字块分别为第一文字块和第二文字块,且第一文字块的左上点和右下点的坐标分别为(X1,Y1)和(Xi',Yi'), 第二文字块的左上点和右下点的坐标分别为(Χ2,Υ2)和(Χ2',Υ2'),则两文字块的间距小于字体大小与第二间距系数的乘积可具体表示为: In this step, if the first two text blocks are blocks of text and the second text block, and the coordinates of the upper left point of the first text block and the lower right point are respectively (X1, Y1) and (Xi ', Yi'), the first upper left point and a lower right point coordinates of two blocks of text, respectively (Χ2, Υ2) and (Χ2 ', Υ2'), the distance between the two blocks is less than the product of the text may particularly represent the font size and the second pitch factor is:

[0077] fabs (X1-X2) < (字体大小X第二一间距系数),且 [0077] fabs (X1-X2) <(second font size of a spacing factor X), and

[0078] fabs (XI ' -X2') < (字体大小X第二二间距系数),且 [0078] fabs (XI '-X2') <(second two-pitch font size X coefficient), and

[0079] fabs (Y1-Y2) < (字体大小X第二三间距系数),且 [0079] fabs (Y1-Y2) <(the second three-pitch font size X coefficient), and

[0080] fabs (Yl ' -Y2') < (字体大小X第二四间距系数); [0080] fabs (Yl '-Y2') <(the second four-pitch font size factor X);

[0081] 在上面表达式中,例如但不限于,第二一间距系数、第二二间距系数、第二三间距系数和第二四间距系数可都取0. 2 ; [0081] In the above expression, for example, but not limited to, a second pitch factor, the second pitch two coefficients, pitch coefficients and the second three second four pitch coefficients may have taken 0.2;

[0082] 步骤S220.建立一个空文本行,将文字块集合中的文字块按数组索引值大小依次放入到该空文本行中,以生成文本行集合;且在同一文本行中,相同索引值或相邻索引值的两文字块满足:两文字块的基准线距离小于字体大小与第一基准线差距系数的乘积,及两文字块的水平间距fabs(maX(Xl,X2)-min(Xl' , X2'))小于字体大小与第二基准线差距系数的乘积。 . [0082] Step S220 to create a blank line of text, text block text block set by the size of the array index value to turn into the blank line of text, to generate a set of text lines; and in the same line of text, the same index value or adjacent two blocks of text index value satisfies: reference line a distance less than two blocks of text and font size of the product of the first reference line gap coefficient, and the horizontal spacing of two text blocks fabs (maX (Xl, X2) -min ( Xl ', X2')) is smaller than the font size multiplied with a second reference line gap coefficient. 例如但不限于,第一基准线系数可取0. 5,第二基准线差距系数0. 6。 Such as, but not limited to, the first reference line preferably 0.5 coefficient, a second reference line gap factor 0.6.

[0083] 优选地,如图4所示,步骤S300可具体包括以下步骤: [0083] Preferably, as shown in FIG. 4, step S300 may specifically include the following steps:

[0084] 步骤S310.将所述文本行集合中的文本行按文本行的上边界值从小到大的顺序依次排列; . [0084] Step S310 to the text line by line of text in the set of values ​​on the boundary line of text in ascending order of arrangement;

[0085] 步骤S320.逐个比较两相邻的文本行,若两相邻的文本行在Y轴方向的投影相交, 则放到同一个第一文本中; . [0085] Step S320 one by comparing two adjacent lines of text, the text if two adjacent rows of the Y-axis direction intersects the projection, the first text into the same;

[0086] 步骤S330.分别将每个第一文本中的文本行按文本行的左边界值从小到大的顺序排列; . [0086] Step S330 respectively the left border of the first line of text in each line of text in the text value by ascending order;

[0087] 步骤S340.逐个比较两相邻的文本行,若两相邻的文本行在X轴方向的投影相交, 则放到同一个第二文本中; . [0087] Step S340 one by comparing two adjacent lines of text, the text if two adjacent rows of the X-axis direction intersects the projection, the second text into the same;

[0088] 步骤S350.提取第二文本集合中一个或多个第二文本之间的空白区域,以组成空白区域集合,例如但不限于,若该PDF文本的旋转角度为0度,其排版方式是从上到下,从左到右的,若在第二文本集合中,两第二文本的左上点坐标分别为(XXI,YYl)和(XX2,YY2), 两第二文本的右下点坐标分别为(XXl' ,YYl')和(XX2',YY2'),则所提取的空白区域集合中的空白区域分别为: [0088] Step S350. Extracting a second set of one or more text blank area between the second text, to form a blank region set, for example, but not limited to, if the rotation angle is 0 degree PDF text which typography from top to bottom, left to right, if the text in the second set, the upper left point coordinates are two of the second text (XXI, YYl) and (XX2, YY2), two lower right point of the second text coordinates are (XXl ', YYl') and (XX2 ', YY2') a blank area, the blank area of ​​the extracted set are:

[0089] 第一空白区域(0,XXI); [0089] a first blank region (0, XXI);

[0090]第二空白区域(XXl',ΧΧ2); [0090] a second blank region (XXl ', ΧΧ2);

[0091] 最后空白区域(ΧΧ2',页面宽度)。 [0091] Finally, the blank area (ΧΧ2 ', page width).

[0092] 优选地,步骤S400可包括,第一次合并第一文本集合中的两相邻的第一文本,第一次合并的条件是两相邻的第一文本存在数目相同的空白区域且所对应的空白区域在X 轴方向的投影相交。 [0092] Preferably, step S400 may include a first combining the first two text adjacent the first set of text, the first two conditions are combined first text adjacent to the same number of blank areas and the presence of intersecting the blank area corresponding to the projection of the X-axis direction.

[0093] 优选地,步骤S400还可包括:在第一次合并之后,进行第二次合并,以得到文本排版行,所述第二次合并的合并条件是两相邻第一文本的空白区域在X轴方向的投影相交。 [0093] Preferably, step S400 may further include: after the first combined and the second combined to obtain a line of text layout, the combined second empty area combined with the proviso that two adjacent first text intersects the X-axis direction of the projection.

[0094] 优选地,如图5所示,步骤S500可具体包括以下步骤: [0094] Preferably, as shown in FIG. 5, step S500 may specifically include the following steps:

[0095] 步骤S510.将所述文本排版行中的文本行按文本行左边界值从小到大顺序排列; . [0095] Step S510 typeset text row of the text line by line of text left margin values ​​in ascending order;

[0096] 步骤S520.新建一个文本排版列,依次从所述文本排版行中顺序取出文本行; . [0096] Step S520 column new text layout, text lines are sequentially extracted from the layout order of the text line;

[0097] 步骤S530.判断所取出的文本行和新建的文本排版列是否在X轴方向的投影相交,若是,则转步骤S540 ;若否,则转步骤S520 ; . [0097] Step S530 determines whether the extracted text lines and text layout new column is in the X-axis direction intersecting the projection, if yes, go to step S540; if not, then go to step S520;

[0098] 步骤S540.将所取出的文本行顺序放入所述新建的文本排版列中; . [0098] step S540 the extracted text line into the order of the columns in the new text layout;

[0099] 步骤S550.将所述文本排版列中的文本行按上边界值从小到大排列; . [0099] step S550 typeset text rows of the text columns by upper limit value in ascending order;

[0100] 步骤S560.新建一个文本段落,依次从文本排版列中顺序取出文本行; . [0100] Create a text paragraph step S560, text lines are sequentially extracted from the text layout column order;

[0101] 步骤S570.判断两相邻文本行是否满足预设的段落条件,若是,则转步骤S580 ;若否,则转步骤S560 ; . [0101] Step S570 determines whether a predetermined condition passages meet two adjacent lines of text, and if yes, go to step S580; if not, then go to step S560;

[0102] 步骤S580.将所述两相邻文本行顺序放入同一文本段落中。 [0102] Step S580. The sequence of the two adjacent lines of text into the same text in a paragraph.

[0103] 优选地,预设的段落条件可以包括以下条件: [0103] Preferably, the preset condition may include the paragraph following conditions:

[0104] (a)文本行之间的高度差fabS(YY2-YYl')小于字体大小与高度系数的乘积, 且; FabS height difference between the [0104] (a) text lines (YY2-YYl ') is less than the height of the font size multiplied factor, and;

[0105] (b)文本行之间的垂直间距fabs(YY2' -YYl')小于字体平均高度与段落系数的乘积,且; [0105] (b) the vertical spacing between lines of text fabs (YY2 '-YYl') is less than the average height of the font and paragraph coefficient multiplication, and;

[0106] (c)文本行之间的宽度之差fabs((XXl' -XX1)-(XX2' _ΧΧ2))小于字体大小与宽度系数的乘积,或, [0106] (c) a difference fabs width between lines of text ((XXl '-XX1) - (XX2' _ΧΧ2)) is less than the product of the width font size factor, or,

[0107] 如果两文本行的左边界值相同,则前面一个文本行的宽度(ΧΧΓ -XXI)大于后面一个文本行的宽度(XX2' -XX2);或, Width [0107] If the value of the left edge of the same two lines of text, the text in front of a line (ΧΧΓ -XXI) is greater than the width of the back of a line of text (XX2 '-XX2); or,

[0108] 如果两文本行的右边界值相同,则前面一个文本行的宽度(ΧΧΓ -XXI)小于后面一个文本行的宽度(XX2' -XX2)。 [0108] If the value of the right boundary of the same two lines of text, the width of the front of a line of text (ΧΧΓ -XXI) smaller than the width (XX2 '-XX2) behind a line of text.

[0109] 在以上表达式中,例如但不限定为:高度系数为0.2,段落系数为1.0,宽度系数为4。 [0109] In the above expression, such as but not limited to: a height factor of 0.2, paragraph factor of 1.0, the width of the factor of 4.

[0110] 以上所述仅为本发明的优选实施例而已,并不用于限制本发明,对于本领域的技术人员来说,本发明可以有各种更改和变化。 [0110] The foregoing is only preferred embodiments of the present invention, it is not intended to limit the invention to those skilled in the art, the present invention may have various changes and variations. 凡在本发明的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本发明的权利要求范围之内。 Any modification within the spirit and principle of the present invention, made, equivalent substitutions, improvements, etc., should be included within the scope of the invention as claimed in claims.

Claims (4)

1. 一种PDF文本段落生成的方法,其特征在于,包括:A.识别并提取PDF文本的文字块;B.剔除不同层中重复的文字块,并且确定文本行,所确定的文本行组成文本行集合;C.将所述文本行集合进行水平方向划分,得到一个或多个第一文本,所述一个或多个第一文本组成第一文本集合;然后对第一文本集合中的每个第一文本分别进行垂直方向划分,分别得到一个或多个第二文本,所述一个或多个第二文本组成第二文本集合,提取第二文本集合中一个或多个第二文本之间的空白区域以组成空白区域集合;D.合并第一文本集合中两相邻的第一文本,以得到文本排版行;E.划分合并后的文本排版行,以形成文本排版列和文本段落;其中, 步骤A中,识别PDF文本的文字块包括:Al.判断PDF文本中的字符是否是英文字符,若是,则执行步骤A3 ;若否,则执行步骤A2 ;A2. CLAIMS 1. A method of generating a PDF text paragraphs, characterized in that, comprising: A PDF to identify and extract the text block text; B excluding repeated in different layers of text blocks and text lines is determined, the determined text rows. text line set; C the set of text lines in the horizontal direction is divided, to obtain one or more first text, the one or more first text composed of a first set of text; then each pair of the first set of text a first text were divided vertically, respectively one or more second text, the one or more second text composed of the second set of text, extracting a second set of text in a text or between a plurality of second to form a blank area of ​​white region set;. D combining the first set of text in the first text two adjacent lines to obtain a text layout; E dividing lines merged text layout, the text layout of columns and to form a text paragraph.; wherein, in step a, the identification PDF text block of text comprising: Al determining PDF text character is English characters, if yes, execute step A3; otherwise, executing step A2; A2.. 述字符是一个文字块;A3.判断两相邻字符的间距是否小于字体大小与第一间距系数的乘积,且判断所述两相邻字符的字体、字号、颜色是否相同,若是,则所述两相邻字符属于同一文字块;若否,则所述两相邻字符不属于同一文字块; 所述步骤B包括:Bi.获取相同索引值或相邻索引值的两个文字块,若所述两个文字块的文字块内容相同且所述两文字块的间距小于字体大小与第二间距系数的乘积,则删除其中一个文字块, 并将剩下的文字块放到文字块集合中;B2.建立一个空文本行,将文字块集合中的文字块按数组索引值大小依次放入到该空文本行中,以生成文本行集合;且在同一文本行中,相同索引值或相邻索引值的两文字块满足:两文字块的基准线距离小于字体大小与第一基准线差距系数的乘积,及两文字块的水平间距小于字体大小与第二基准 Said character is a text block;. A3 Analyzing two adjacent characters is smaller than the pitch of the product of the first font size and pitch coefficient, and determines the two adjacent characters in the font, size, color is the same, if yes, the two adjacent characters belong to the same text block; if not, then the two adjacent blocks of text characters are not the same; the step B comprises: Bi get the same index or index values ​​of two adjacent blocks of text, if the. the product of the same text content of said two blocks and the two blocks of text of the text block is smaller than the font size and pitch of the second pitch factor, wherein a block of text is deleted and the remaining block set into text blocks of text; . B2 create an empty text line, text block text block set by the size of the array index value to turn into the blank line of text, to generate a set of text lines; and in the same line of text, or adjacent to the same index value text block satisfies two index values: reference line a distance less than two blocks of text and font size of the product of the first reference line gap coefficient, and the horizontal spacing is less than two blocks of text font size and a second reference 差距系数的乘积; 所述步骤C包括:Cl.将所述文本行集合中的文本行按文本行的上边界值从小到大的顺序依次排列; C2.逐个比较两相邻的文本行,若两相邻的文本行在Y轴方向的投影相交,则放到同一个第一文本中;C3.分别将每个第一文本中的文本行按文本行的左边界值从小到大的顺序排列; C4.逐个比较两相邻的文本行,若两相邻的文本行在X轴方向的投影相交,则放到同一个第二文本中;C5.提取第二文本集合中一个或多个第二文本之间的空白区域,以组成空白区域集合;所述步骤D包括:第一次合并第一文本集合中两相邻的第一文本,第一次合并的条件是两相邻的第一文本存在数目相同的空白区域且所对应的空白区域在X轴方向的投影相交;在第一次合并之后,进行第二次合并,以得到文本排版行,所述第二次合并的合并条件是两相邻第一文本的空 Product of gap coefficients; C comprising the step of: Cl text row of the text lines set by the upper limit value of the text line in ascending order of arrangement; C2-by comparing two adjacent lines of text, if. two adjacent lines of text projection intersects the Y axis direction, the first text into the same;. C3, respectively, each of the first text line by the left edge of the text value of the text line ascending order ;. C4 individually comparing two adjacent lines of text, the text if two adjacent rows of the X-axis direction intersects the projection, the second text into the same;. C5 extracting a second set of one or more text of blank area between the two text, to form a blank region set; said step D comprises: first combining a first text in the first text set of two adjacent first combined with the proviso that two adjacent first text exists the same number of blank areas and the blank area corresponding to the projection intersects the X-axis direction; after the first combined and the second combined to obtain a line of text layout, the second condition is combined merged the first two adjacent empty text 区域在X轴方向的投影相交; 所述步骤E包括:El.将所述文本排版行中的文本行按文本行左边界值从小到大顺序排列; E2.新建一个文本排版列,依次从所述文本排版行中顺序取出文本行; E 3.判断所取出的文本行和新建的文本排版列是否在X轴方向的投影相交,若是,则转步骤E4;若否,则转步骤E2;E4.将所取出的文本行顺序放入所述新建的文本排版列中;E5.将所述文本排版列中的文本行按上边界值从小到大排列;E6.新建一个文本段落,依次从文本排版列中顺序取出文本行;E7.判断两相邻文本行是否满足预设的段落条件,若是,则转步骤E8 ;若否,则转步骤E6 ;E8.将所述两相邻文本行顺序放入同一文本段落中。 Projection area intersects the X-axis direction; said step E comprises:. El typeset text row of the text line in ascending order by the value of the left edge of the text line; E2 of a new text column layout, in order from the. said text layout line sequentially extracted text line; E 3. determines whether the extracted text lines and columns in the new text layout X-axis direction intersecting the projection, if yes, go to step E4; if not, then go to step E2; E4 the extracted text line into the order of the columns in the new text layout; E5 typeset text rows of the text columns by upper limit value in ascending order;.. Create a text passage E6, sequentially from the text column layout sequentially extracted text lines;. E7 determining whether a predetermined condition passages meet two adjacent lines of text, and if yes, go to step E8; if not, then go to step E6;. E8 the two adjacent lines of text sequence into the same text in a paragraph.
2.根据权利要求1所述的PDF文本段落生成的方法,其特征在于,在步骤A中,建立旋转角度分别为0、90、180、270度四个方向的文字块集合,并且以递增量建立数组索引来提取PDF文本的文字块。 2. The method of generating text paragraph PDF according to claim 1, wherein, in step A, the establishment of a set of text blocks 0,90,180,270 rotation angles of four directions, and to increment build the array index to extract blocks of text PDF text.
3.根据权利要求2所述的PDF文本段落生成的方法,其特征在于,所提取的文字块包括文字块的基准线、外围矩形、字体、字号、颜色和角度。 3. The method of claim 2 PDF generated text passages claim, wherein the blocks include the extracted word reference line of the text block, a peripheral rectangular, font, size, color and angle.
4.根据权利要求1所述的PDF文本段落生成的方法,其特征在于,所述预设的段落条件包括以下条件:(a)文本行之间的高度差小于字体大小与高度系数的乘积,且;(b)文本行之间的垂直间距小于字体平均高度与段落系数的乘积,且;(c)文本行之间的宽度之差小于字体大小与宽度系数的乘积,或,如果两文本行的左边界值相同,则前面一个文本行的宽度大于后面一个文本行的宽度;或,如果两文本行的右边界值相同,则前面一个文本行的宽度小于后面一个文本行的宽 4. The method of claim 1 PDF generated text paragraphs as claimed in claim wherein said predetermined condition comprises the paragraph following conditions: the height difference between (a) the font size of the text line height is less than the product of the coefficient, and; vertical spacing between (b) is less than the product of the average height of a line of text and font paragraph coefficients, and; the difference between the width (c) smaller than the product of the font size of the text line width coefficient, or, if the two text lines value of the same left margin, then the front width greater than a width of a text line behind a line of text; or, if the value of the right boundary of the same two lines of text, the width of the front of a line of text is less than the width of the back of a line of text
CN 201010136399 2010-03-25 2010-03-25 Method for generating PDF text paragraphs CN101876967B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN 201010136399 CN101876967B (en) 2010-03-25 2010-03-25 Method for generating PDF text paragraphs

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN 201010136399 CN101876967B (en) 2010-03-25 2010-03-25 Method for generating PDF text paragraphs

Publications (2)

Publication Number Publication Date
CN101876967A CN101876967A (en) 2010-11-03
CN101876967B true CN101876967B (en) 2012-05-02

Family

ID=43019525

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 201010136399 CN101876967B (en) 2010-03-25 2010-03-25 Method for generating PDF text paragraphs

Country Status (1)

Country Link
CN (1) CN101876967B (en)

Families Citing this family (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102479215B (en) * 2010-11-30 2013-10-30 汉王科技股份有限公司 Automatic file exporting method and electronic reading device
CN102546577A (en) * 2010-12-27 2012-07-04 北京大学 Compression and decompression method and system for format data
CN102890826B (en) * 2011-08-12 2015-09-09 北京多看科技有限公司 A kind of method of scanned version document re-ranking version
CN102306294A (en) * 2011-08-23 2012-01-04 深圳市万兴软件有限公司 Method and system for extracting image from portable document format (PDF) file page
CN102306143A (en) * 2011-09-22 2012-01-04 汉王科技股份有限公司 Method and system for generating and editing PDF (portable document format) document
CN102722475A (en) * 2012-05-09 2012-10-10 深圳市万兴软件有限公司 Method for converting form in portable document format (PDF) document into Excel form
CN104063364A (en) * 2013-03-19 2014-09-24 福建福昕软件开发股份有限公司北京分公司 PDF document recognition method
CN104516868B (en) * 2013-09-30 2018-03-06 北大方正集团有限公司 The streaming restoring method and system in a kind of space of a whole page space
CN105354174B (en) * 2014-08-22 2018-04-10 北大方正集团有限公司 For exporting the composition method and device of epub formatted files
CN104199805B (en) * 2014-09-11 2017-10-20 清华大学 Text joining method and device
CN104850316B (en) * 2015-04-29 2019-02-12 小米科技有限责任公司 E-book font method of adjustment and device
CN105373526B (en) * 2015-10-23 2019-02-15 北大方正集团有限公司 A kind of white space processing method and system in electronic document
CN107783956B (en) * 2017-11-23 2019-03-15 掌阅科技股份有限公司 Composition method, electronic equipment and the computer storage medium of text information

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0702322B1 (en) * 1994-09-12 2002-02-13 Adobe Systems Inc. Method and apparatus for identifying words described in a portable electronic document
CN1278260C (en) * 2004-02-06 2006-10-04 珠海金山软件股份有限公司 Typesetting method
CN101206639B (en) * 2007-12-20 2012-05-23 北京大学 Method for indexing complex impression based on PDF

Also Published As

Publication number Publication date
CN101876967A (en) 2010-11-03

Similar Documents

Publication Publication Date Title
KR101463703B1 (en) Methods and system for document reconstruction
US6903751B2 (en) System and method for editing electronic images
US6178431B1 (en) Method and system for providing side notes in word processing
US6587587B2 (en) System and methods for spacing, storing and recognizing electronic representations of handwriting, printing and drawings
CA2503636C (en) A method of formatting documents
CN100568907C (en) Layout adjustment method and apparatus
US6377704B1 (en) Method for inset detection in document layout analysis
CN101689177B (en) To use predefined layout of a dynamic layout of images and associated text
US6415305B1 (en) Method for displaying editable characters in a divided table cell
KR100725889B1 (en) User interface for creation and editing of variable data documents
CN100351770C (en) Layout adjustment method and apparatus
US6952803B1 (en) Method and system for transcribing and editing using a structured freeform editor
JP5113909B2 (en) Placement of graphics objects on the page with control based on relative position
US20060221064A1 (en) Method and apparatus for displaying electronic document including handwritten data
JP5465819B2 (en) Text grid creation tool
US10133707B2 (en) System and method for converting the digital typesetting documents used in publishing to a device-specific format for electronic publishing
CN102117269B (en) Apparatus and method for digitizing documents
JP2004139484A (en) Form processing device, program for implementing it, and program for creating form format
CN1294511C (en) Automatic typesetting method
US20100325529A1 (en) Resizing an Editable Area in a Web Page
US7203903B1 (en) System and methods for spacing, storing and recognizing electronic representations of handwriting, printing and drawings
US20120102388A1 (en) Text segmentation of a document
US7844896B2 (en) Layout-rule generation system, layout system, layout-rule generation program, layout program, storage medium, method of generating layout rule, and method of layout
US20060294460A1 (en) Generating a text layout boundary from a text block in an electronic document
US7350142B2 (en) Method and system for creating a table version of a document

Legal Events

Date Code Title Description
C06 Publication
C10 Entry into substantive examination
C14 Grant of patent or utility model
C56 Change in the name or address of the patentee

Owner name: SHENZHEN WONDERSHARE INFORMATION TECHNOLOGY CO., L

Free format text: FORMER NAME: SHENZHEN WONDERSHARE SOFTWARE CO., LTD.

CP03