CN101853246B - Method and device for converting document format - Google Patents

Method and device for converting document format Download PDF

Info

Publication number
CN101853246B
CN101853246B CN 201010206401 CN201010206401A CN101853246B CN 101853246 B CN101853246 B CN 101853246B CN 201010206401 CN201010206401 CN 201010206401 CN 201010206401 A CN201010206401 A CN 201010206401A CN 101853246 B CN101853246 B CN 101853246B
Authority
CN
China
Prior art keywords
information
text
graphic
graphics
document
Prior art date
Application number
CN 201010206401
Other languages
Chinese (zh)
Other versions
CN101853246A (en
Inventor
晏检平
李譞
Original Assignee
深圳市万兴软件有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳市万兴软件有限公司 filed Critical 深圳市万兴软件有限公司
Priority to CN 201010206401 priority Critical patent/CN101853246B/en
Publication of CN101853246A publication Critical patent/CN101853246A/en
Application granted granted Critical
Publication of CN101853246B publication Critical patent/CN101853246B/en

Links

Abstract

The invention belongs to the field of document application, and discloses a method and a device for converting a document format. The method comprises the following steps of: acquiring text information and graphic information in a primary document; performing text effect identification on the acquired text information and the acquired graphic information in the primary document, and identifying acorresponding relationship between the text information and the graphic information; storing the identified corresponding relationship between the text information and the graphic information; and generating the document format specified by a user according to the stored corresponding relationship between the text information and the graphic information. When the primary document such as a PDF document and the like is converted into a document of other format, the method and the device can keep the reduction degree of contents of the primary document, increase the editability after the document conversion and solve the problem of chaos of converted pages.

Description

一种文档格式的转换方法及装置 A document format conversion method and apparatus

技术领域 FIELD

[0001] 本发明属于文档应用领域,具体涉及一种文档格式的转换方法及装置。 [0001] The present invention belongs to the field of the application document, in particular, to a method and apparatus for converting a document format. 背景技术 Background technique

[0002] 随着电脑的不断普及,无纸化办公得到越来越多的应用,各种各样的文档也大量的出现在用户的面前。 [0002] With the growing popularity of computers, more and more paperless office applications, a variety of documents also appear in front of a large number of users.

[0003] 以可移植文档格式(Portable Document R)rmat,PDF)、office 文档为例,在将PDF 格式的文档转换为office格式的文档时,面临较多的困难。 [0003] Portable Document Format (Portable Document R) rmat, PDF), office document as an example, when converting a PDF document format of the document office, face more difficulties.

[0004] 在PDF格式的文档中,实际看到的文字特效,譬如例如下划线、删除线、字符底纹等,都是将图形与文本叠加形成的。 [0004] In the documents in PDF format, see the actual text effects, such as such as underline, strikethrough, character shading, it is the form of graphics and text overlay. 因此,在将PDF文件转换为office格式的文档时,如果仅仅是从PDF文档中提取原始数据内容,有文字特效的文本就会变成分散的文本与图形混合在一起,如果需要还原文本特效,需要手动删除多余的图形并重新设置文本特效。 Therefore, when converting PDF files into formatted office documents, if only to extract raw data content from a PDF document with text effects text will become fragmented text and graphics mixed together, if you need to restore text effects, need to manually remove the extra graphics and re-set the text effects.

[0005] 上述的转换方式不但丢失了原有PDF的文本效果,在转换后,还会造成页面的混舌L给转换后的文档的编辑带来极大的不便。 [0005] The above conversion way not only lost the original PDF text effect, and after the conversion, will result in a page of mixed tongue L great inconvenience to edit the document after conversion.

[0006] 如何使得诸如PDF文档在转换为其他格式的文档时,能够保持原文档内容的还原度,增加文档转换后可编辑性,是文档转换技术领域研究的方向之一。 [0006] how to make such documents when converting PDF documents to other formats, to maintain the reduction of the content of the original document, after document conversion can increase the editorial, it is one of the document conversion technology research field direction.

发明内容 SUMMARY

[0007] 本发明的目的在于提供一种文档格式的转换方法,旨在使得诸如PDF文档在转换为其他格式的文档时,能够保持原文档内容的还原度,增加文档转换后可编辑性。 [0007] The object of the present invention is to provide a method of converting a document format such as the PDF document aimed at converting a document to another format, it is possible to maintain the degree of reduction of the content of the original document, the document conversion editorial may increase.

[0008] 本发明实施例是这样实现的,一种文档格式的转换方法,所述方法包括以下步骤: [0008] Example embodiments of the present invention is implemented, a method of converting a document format, said method comprising the steps of:

[0009] 获取原文档中的文本信息和图形信息; [0009] to get the text and graphical information of the original document;

[0010] 将获取的原文档中的文本信息和图形信息进行文本特效识别,识别所述文本信息与所述图形信息之间的对应关系; [0010] The text and graphical information of the original document is acquired correspondence relationship between identification text effects, a text identification information and the graphics information;

[0011] 将识别出的所述文本信息与所述图形信息之间的对应关系进行存储; [0011] The correspondence relationship between the identified information and the graphic information stored in the text;

[0012] 根据所述存储的文本信息与所述图形信息之间的对应关系生成用户指定的文档格式。 [0012] The generated document format specified by the user according to a correspondence relationship between the text information stored with the graphic information.

[0013] 本发明实施例的另一目的在于提供一种文档格式的转换装置,所述装置包括: [0013] Another object of an embodiment of the present invention to provide a document format conversion means, said apparatus comprising:

[0014] 信息获取模块,用于获取原文档中的文本信息和图形信息; [0014] The information acquiring module, for acquiring the text information and graphic information of the original document;

[0015] 文本特效识别模块,用于将获取的原文档中的文本信息和图形信息进行文本特效识别,识别所述文本信息与所述图形信息之间的对应关系; [0015] Text Effect identification module for graphic information and text information of the original document is acquired correspondence relationship between identification text effects, a text identification information and the graphics information;

[0016] 存储模块,用于将识别出的所述文本信息与所述图形信息之间的对应关系进行存储; [0016] The storage module, for the corresponding relationship between the identified information and the text information stored in said pattern;

[0017] 文档格式转换模块,用于根据所述存储的文本信息与所述图形信息之间的对应关系生成用户指定的文档格式。 [0017] The document format conversion module for generating document format specified by the user according to a correspondence relationship between the text information stored with the graphic information. [0018] 本发明实施例通过获取PDF文档中的文本信息以及图形信息,并对PDF文档中的文本信息和图形信息进行文本特效识别,识别文本信息和图形信息之间的关系并存储, 根据存储的文本信息和图形信息之间的关系将PDF文档转换为其他格式的文档,使得诸如PDF文档在转换为其他格式的文档时,能够保持原文档内容的还原度,增加文档转换后可编辑性,解决了转换后页面混乱的问题。 [0018] Example embodiments of the present invention, graphic information and text information and PDF documents the relationship between the effects text recognition, the recognized text information and graphic information and text information by obtaining graphic information and stored in the PDF document, according to the storage the relationship between text and graphical information of the PDF documents into other document formats such as PDF documents when converting to other formats of documents, to maintain the reduction of the content of the original document, after document conversion can increase the editorial, after the conversion page to solve the problem of chaotic.

附图说明 BRIEF DESCRIPTION

[0019] 图1为本发明实施例提供的文档格式的转换方法的流程图; [0019] FIG. 1 is a flowchart of document format conversion method according to an embodiment of the present invention;

[0020] 图2为本发明实施例提供的将矩形转换为线段的流程图; [0020] FIG. 2 embodiment provides a flowchart of the converted rectangular segment of the present invention;

[0021] 图3为本发明实施例提供的特效图形中下划线的特征示意图; [0021] FIG. 3 is a schematic graphical effects provided in the embodiment of the invention wherein the underlined;

[0022] 图4为本发明实施例提供的特效图形中删除线的特征示意图; [0022] FIG. 4 is a schematic embodiment wherein the graphic effects provided strikethrough embodiment of the invention;

[0023] 图5为本发明实施例提供的特效图形中底纹与高亮的特征示意图; [0023] FIG. 5 effects provided by the pattern feature and the highlight shading schematic embodiment of the present invention;

[0024] 图6为本发明实施例提供的对带圈字符的识别转换流程图; [0024] FIG. 6 converts character recognition circled flowchart according to an embodiment of the present invention;

[0025] 图7为本发明实施例提供的对带圈字符之外的其他特效图形的处理流程图; [0025] The flowchart of FIG. 7 to other processing effects than with a circle pattern character according to an embodiment of the present invention;

[0026] 图8为本发明实施例提供的能与图形组合成为特效文本的文本块集合的流程图; [0026] FIG 8 is a flowchart of a text block set can be provided by a graphical text effects in combination with embodiments of the present invention;

[0027] 图9为本发明实施例提供的文档格式的转换装置的结构图。 [0027] FIG. 9 is a configuration diagram of document format conversion apparatus according to an embodiment of the present invention.

具体实施方式 Detailed ways

[0028] 为了使本发明的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本发明进行进一步详细说明。 [0028] To make the objectives, technical solutions and advantages of the present invention will become more apparent hereinafter in conjunction with the accompanying drawings and embodiments of the present invention will be further described in detail. 应当理解,此处所描述的具体实施例仅用以解释本发明,并不用于限定本发明。 It should be understood that the specific embodiments described herein are merely used to explain the present invention and are not intended to limit the present invention.

[0029] 图1示出了本发明实施例提供的文档格式的转换方法的流程。 [0029] FIG 1 illustrates a process of the present invention, the document format conversion method according to an embodiment.

[0030] 在步骤SlOl中,获取原文档中的文本信息和图形信息。 [0030] In step SlOl, get the text and graphical information of the original document.

[0031] 为了便于说明,本发明实施例以PDF文档作为原文档为例进行说明,当然也可以是将其他的文档格式进行转换,此处不一一列举。 [0031] For ease of description, embodiments of the present invention, the original document to PDF document as an example, of course, may be converted to other document format, not enumerated herein.

[0032] 在步骤S102中,将获取的原文档中的文本信息和图形信息进行文本特效识别,识别所述文本信息与所述图形信息之间的对应关系。 [0032] In step S102, the text information and graphic information of the original document is acquired correspondence relationship between identification text effects, a text identification information and the graphics information.

[0033] 在具体实施过程中,所述文本信息和所述图形信息包含的位置以及大小关系; [0033] In a specific implementation, the text information and the graphic information contained in the location and the magnitude relation;

[0034] 所述图形信息包含的图形的属性、特征等基本信息。 Basic information of [0034] the graphical information includes graphic attributes, features and the like.

[0035] 在步骤S103中,将识别出的所述文本信息与所述图形信息之间的对应关系进行存储。 [0035] In step S103, the identified correspondence relationship between the information and the graphic information stored in the text.

[0036] 本发明是将识别的结果保存至标识了文本所具有的特殊效果的独立中间数据结构中。 [0036] The present invention is a result of the recognition to identify the stored text having special effects independent intermediate data structure.

[0037] 在步骤S104中,根据所述存储的文本信息与所述图形信息之间的对应关系生成用户指定的文档格式。 [0037] In step S104, the generated document format designated by the user according to a correspondence relationship between the text information stored with the graphic information.

[0038] 其中,步骤SlOl中的图形信息包括有特效图形的特征信息,所述的特效图形为下划线、删除线、底纹与高亮以及带圈字符等图形。 [0038] wherein, in step SlOl graphical information comprises feature information with special effects pattern, said pattern effects as underline, strikethrough, shading and highlighting of characters and the like with a circle pattern.

[0039] 在根据所述存储的文本信息与所述图形信息之间的对应关系生成用户指定的文档格式时,根据所述特效图形的特征信息查找符合条件的特效图形,删除所述图形信息中的特效图形。 [0039] When generating a document format designated by the user according to a correspondence relationship between the text information stored with the graphic information, to find qualified graphic effects to the effects according to the characteristic information of the pattern, the pattern information deleting special effects graphics.

[0040] 其中,由于PDF页面显示的内容均是由页面内容流中的一系列控制字来表示的, 所以步骤SlOl中在获取原文档中的文本信息和图形信息时,首先读入并接收文档中存储的文档绘制指令,所述文档绘制指令包括绘制文本指令以及绘制图形指令;然后,根据接收到的绘制文本指令提取绘制文本指令中对应的文本信息;根据接收到的绘制图形指令提取所述绘制图形指令中对应的图形信息。 [0040] wherein, since the content of the PDF page are displayed by the page content stream represented by a series of control word, so at step SlOl acquiring text information and graphic information in the original document, first read and receive documents plotting instructions stored in a document, the document includes a drawing instruction to draw the text drawing command and graphics instructions; is then extracted text information corresponding to the instruction to draw text rendering text according to the received instruction; extracting pattern according to the received rendering instruction of the graphing information corresponding graphics instructions.

[0041] 作为本发明的优选的实施例,步骤S103中在将识别出的所述文本信息与所述图形信息之间的对应关系进行存储时,还包括; [0041] As a preferred embodiment of the present invention, at step S103, the correspondence relation between the recognized text information and the graphic information is stored, further comprising;

[0042] 将获取的文本信息保存至文本块集合中,将获取的图形信息保存至图形集合中。 [0042] The acquired text information stored into the text block set, stored graphics information to the acquired pattern set. 其中,提取出的文本信息与图形信息均保存有位置、外界矩形区域大小等基本信息,所述的图形信息还保存有组成该图形的边的属性、填充色等图形的基本信息。 Wherein the extracted information is textual information and graphics stored position, substantially outside the rectangular region size information, the graphic information further stores basic information constituting the attribute pattern edges, fill color, pattern and the like.

[0043] 在具体实施过程中,由于PDF中表示线段的方式有多种,除了通常理解的绘制一条线段之外,另一种方式就是绘制一个宽度很小的细长矩形。 [0043] In a specific implementation, since the PDF represents a variety of ways segments, in addition to a line drawn generally understood, another way is to draw a small width of the elongated rectangular. 后一种方式在显示时与前一种方式有同样的效果,为了简化识别的判断逻辑,本发明实施例将提取出的细长矩形全部转换为线段,具体转换步骤请参阅图2 : The latter approach in displaying the same effect as the former embodiment, in order to simplify the recognition logic determines, for example, the extracted line segments elongated rectangular convert all embodiments of the present invention, the step of converting the specific see Figure 2:

[0044] 步骤S21、判断获取的图形是否为四边形,若是,进行步骤S22,否则终止; [0044] step S21, the determination whether the acquired quadrangular pattern, if, for step S22, the otherwise terminated;

[0045] 步骤S22、判断获取的图形是否为矩形,若是,进行步骤S23,否则终止; [0045] step S22, it is determined whether the acquired pattern is rectangular, if, for step S23, the otherwise terminated;

[0046] 步骤S23、判断是否具有某一边的宽度是否小于PDF在正常显示时能够区分线段和矩形的临界宽度,若是,则进行步骤S24,否则终止; [0046] step S23, it is determined whether the width of a side of the PDF is less than in the normal display capable of distinguishing the critical width segments and rectangles, and if yes, perform a step S24, otherwise terminated;

[0047] 其中,上述的临界宽度为一经验值,根据大量具体的PDF的属性而定。 [0047] wherein the critical width is an empirical value according to numerous specific properties depending on the PDF.

[0048] 步骤S24、提取该矩形的区域信息,以该矩形的2条窄边的中点为线段的2个顶点, 转化为相应的线段,并用转化后的线段替换掉原来的矩形。 [0048] step S24, the information extraction region of the rectangle, the midpoint of the two narrow sides of the rectangle vertices segment 2, converted to the corresponding segments, and replace the original segment rectangles after conversion.

[0049] 下面详细的说明步骤S102中将获取的原文档中的文本信息和图形信息进行文本特效识别,识别所述文本信息与所述图形信息之间的对应关系、以及特效图形的过程。 [0049] The following detailed description graphic information and text information of the original document in step S102 the acquired identification text effects, the identification of the correspondence between the graphic information and text information, and process graphics effects.

[0050] 在进行文本特效识别时,要明确各种文本特效图形的特征,这需要对各种PDF文档中文本信息和图形信息进行样例分析,得出图形信息与对应文本信息之间的对应关系或者特效图形一般特征,本发明实施例以A下划线、B删除线、C底纹与高亮以及D带圈字符为例进行详细的说明。 [0050] During recognize text effects, text effects to clear the various features of the graphics, which requires a variety of PDF documents and information in this Chinese graphic information sample analysis, the correspondence between the graphic information corresponding to text information general relationship or pattern effects features, embodiments of the present invention are underlined in a, B strikethrough, C and D shading and highlighting of characters with a circle described in detail as an example.

[0051] A、以下划线为例,请参阅图3,下划线a就是在文本下方与文字方向平行的线段。 [0051] A, underlined an example, see Figure 3, is a parallel underlined text in the text below the line direction.

[0052] 根据对大量PDF文本与下划线线段之间的位置关系的分析,下划线线段一般位于文本对象外接矩形框内部下1/4位置到矩形框外部下方1/3位置,这些分数值也可以根据具体情况相应进行调整,并不限定于前面列出的值,本发明实施例使用的分数均是按外接矩形框高度为单位1来计算的。 [0052] The analysis of a large number of the positional relationship between the PDF and underlined text segment, the text object located generally underlined segment circumscribed rectangular frame to a position below the underlying 1/4 1/3 position outside rectangle, these fractional values ​​may be in accordance with circumstances be adjusted accordingly, is not limited to the values ​​listed above, the use of fractional embodiment of the present invention are embodiments according to the height of the circumscribed rectangle frame unit 1 is calculated.

[0053] 同时,由于PDF中的文本并不是以自然的单词或者字来分割,有可能是几个字母或者一个汉字就为一个文本对象,通过几个文本对象的组合来得到在阅读时的完整单词和句子的效果,因此,对下划线线段只能要求与文本对象在X方向有相交部分即可。 [0053] Also, because PDF text was not a natural word or word segmentation, there may be a few letters or characters on a text object is obtained by a combination of several text objects intact during reading the effect of words and sentences, therefore, can only underscore line with the requirements of the text object can have the intersection in the X direction. 由此,识别出的下划线(仅针对横排文本)的特征为: Thus, the identified underline (only for horizontal text) is characterized by:

[0054] Al、下划线为沿水平方向χ方向的线段; [0054] Al, as underlined in the horizontal direction χ directed line segment;

[0055] A2、下划线与划分出来的某个文本行列块有相交部分;[0056] A3、y方向所占区域落在文本块y方向区域的下3/4与4/3范围之内,χ方向与文本块X方向有相交部分。 [0055] A2, and underlined text is divided out of an intersection of the row block has; [0056] A3, y directions fall within the area occupied by the range of 3/4 to 4/3 of the area of ​​the y-direction of the text block, χ direction of the text block direction X intersecting portion.

[0057] B、以删除线为例,请参阅图4,删除线b的特点是穿过文字,与文字方向平行的线段。 [0057] B, strikethrough an example, see Figure 4, is characterized by deleting the line b through the text, the text is parallel to the direction of the line segment.

[0058] 根据对大量PDF文本与删除线线段之间的位置关系的分析,大部分的删除线线段都位于文本对象外接矩形框的上部1/4位置到下部1/4位置之间,这些分数值也可以根据具体情况相应进行调整,并不限定于前面列出的值。 [0058] The analysis of the positional relationship between a large number of line segments and deletion of text PDF, most strikethrough text object located in the upper segments are external to the rectangular frame between a lower position 1/4 1/4 position, these points value can be adjusted accordingly depending on the circumstances, it is not limited to the values ​​listed above. 在χ方向上的特征,由于PDF中文本对象的不确定性,与下划线是类似的。 Feature in the χ direction, due to the uncertainty Chinese PDF of this object, and underscores are similar. 本发明识别出的删除线(仅针对横排文本)的特征为: The present invention recognizes strikethrough (only for horizontal text) is characterized by:

[0059] Bi、删除线是水平方向χ方向的线段; [0059] Bi, delete line segments in the horizontal direction χ direction;

[0060] B2、删除线与划分出来的某个文本行列块有相交部分; [0060] B2, strikethrough and divided out of a text block with a line intersecting portion;

[0061] B3、y方向所占区域落在文本块y方向区域的1/4与3/4范围之内,χ方向与文本块X方向有相交部分。 [0061] B3, y directions fall within the area occupied by the y-direction text block area in the range of 3/4 and 1/4 of, χ X direction of the text block direction intersecting portion.

[0062] C、以底纹与高亮为例,请参阅图5,底纹与高亮在PDF中的表现形式都是文字下面有特效图形,特效图形的区域遮盖了文字的大部分区域。 [0062] C, and shading to highlight an example, see Figure 5, shading and highlighting forms in the PDF are text area below the graphic effects, special effects and graphics covering most areas of text.

[0063] 在对大量的PDF中的底纹与高亮进行分析后,得出底纹与高亮确实具有完全相同的PDF元素组合关系,特效图形几乎完全覆盖了文字。 [0063] After a large number of PDF in shading and highlighting to analyze, draw shading and highlight PDF does exactly the same combination of elements relations, special effects and graphics almost completely covered the text. 进行样例分析后,发现特效图形上端一般都不会超过文本外接矩形框上方1/4,也不会低于矩形框内侧上部的1/4,特效图形下端超出矩形框内侧下1/10,但不会超过矩形框下方外侧1/4。 After the sample analysis, found graphics effects are generally not exceed above the upper end of the text box 1/4 circumscribed rectangles, not less than 1/4 of an upper portion inside a rectangular frame, the lower end beyond the lower graphical effects 1/10 inner rectangular frame, but not more than 1/4 of the outer rectangle below. 这些分数值也可以根据具体情况相应进行调整,并不限定于前面列出的值。 These fractional values ​​may also be adjusted accordingly depending on the circumstances, it is not limited to the values ​​listed above.

[0064] 同时,在χ方向上的特征由于PDF中文本对象的不确定性,底纹与高亮跟下划线和删除线类似,都只要求与文本对象在X方向有相交部分即可。 [0064] Meanwhile, the features on the χ direction due to the uncertainty of this PDF Chinese objects, shading and highlighting with similar underline and strikethrough, and text objects are only required to have the intersection in the X direction. 由此,本发明识别出的高亮与底纹(仅针对横排文本)的特征为: Thus, the present invention is identified with shading to highlight (only for horizontal text) is characterized by:

[0065] Cl、高亮与底纹是矩形,且有填充色; [0065] Cl, highlighting and shading is rectangular and has a fill color;

[0066] C2、高亮与底纹与划分出来的某个文本行列块有相交部分; [0066] C2, and highlighting and shading split up into a text block with a line intersecting portion;

[0067] C3、y方向所占区域的上部既不越过文本块y方向区域上方的y方向区域1/4大小,也不低于文本块y方向区域1/4处,底部超过文本块y方向区域9/10,但不超过与文本块y方向区域5/4,χ方向与文本块χ方向有相交部分。 [0067] C3, y directions of the upper region occupied by the text block is neither the y-direction across the x-y direction of the region above the size of a quarter, the text block is not less than 1/4 area of ​​the y-direction, y-direction text block bottom over region 9/10, but not more than the y-direction with the text block region 5/4, χ text [chi] direction intersecting with a direction of the block portion.

[0068] D、以带圈字符为例,带圈字符是PDF中比较特殊的文本特效类型。 [0068] D, with characters with a circle, for example, with a circle characters are in PDF special type of text effects.

[0069] 根据对PDF的分析,带圈字符是由2个文本对象叠加得到的,其中一个文本对象即是圈字符,一般为字符“〇、口、Δ、◊”中的一个。 [0069] The analysis of the PDF, with a circle character is composed of two superimposed text object obtained, i.e. where the object is a text character ring, generally the characters "square, mouth, Δ, ◊" in a. 另一个文本对象是一个至多只有2个字符的文本对象,这2个文本对象的区域大部分是相交的。 Another text object is a text object at most only two characters, most of these two areas of the text object intersect. 本发明识别出的带圈字符的特征为: The present invention is characterized in the identified character with a circle is:

[0070] D1、带圈字符是只有1个字符的文本块,字符必须为“〇、口、Δ、◊”中的一个; [0070] D1, only the text character is a circle with a block of characters, the character must be "square, mouth, Δ, ◊" one;

[0071] D2、带圈字符与除了自身之外的某个文本块相交,且这个文本决最多只有2个字符。 [0071] D2, a character with a circle with the addition of a text block intersects itself, and this must most two text characters.

[0072] 当然在具体实施过程中,还包括其他若干的图形,此处仅以上述A、B、C、D四个为例,总结出图形信息的基本特征后,这些基本特征可以在识别过程中对图形进行分级多次筛选,提高筛选效率。 [0072] In a specific embodiment the process of course, also includes a number of other graphics, where only the above A, B, C, D four as an example, summarizes the basic characteristics of graphical information, the basic features can be identified in the process the graphics were graded repeated screening, improve screening efficiency. 同时,判断特征的过程是相对独立的,可以自由分离或组合使用。 Meanwhile, the process of determining characteristics are relatively independent, separate or free combination.

[0073] 在具体实施过程中,首先进行带圈字符的识别转换,该识别转换过程请参阅图6 : [0073] In a specific implementation, the first converter to identify a character with a circle, the identification conversion process, please refer to FIG. 6:

7[0074] 步骤S61、查找文本块集合中的图形,是否找到符合带圈字符特效图形特征(Dl) 的文本块,若找到,则进行S62,若找不到这种文本块,结束识别; 7 [0074] step S61, the text block to find the set of graphics, text blocks to find whether or not compliance with circle pattern feature character effects (Dl), if found, S62 is performed, if such a text block is not found, the end of the identification;

[0075] 步骤S62、根据找到的特效图形属性,查找文本块集合直至找到一个至多只有2个字符的,并且与特效图形相交的文本块,若找到,进行步骤S63,若找不到这样的文本块,回到步骤S61 ; [0075] step S62, the special effect in accordance with the attribute pattern found, to find the text until it finds a set of blocks at most two characters, graphics and text blocks intersecting with the effects, if found, a step S63, the text if no such block, returns to step S61,;

[0076] 步骤S63、根据此特效图形的字符为“〇、口、Δ、◊”中的哪一个,设置对应的相交文本块的属性为带圆圈字符、带矩形圈字符、带三角形圈字符,带菱形圈字符中的一种; [0076] step S63, the according to the character of this effect pattern is "square, mouth, Δ, ◊" which is set corresponding to the attribute intersecting the text block is circled characters, with rectangular ring character, with triangular ring character, diamond ring with one kind of characters;

[0077] 步骤S64、删除特效图形文本块。 [0077] step S64, the effects of deleting the text block pattern.

[0078] 完成带圈字符的识别后,进行其它特效图形的识别,识别的方法为,遍历图形集合中的图形,对每个图形应用以下步骤(请参阅图7): [0078] After completion of the identification character with a circle, other effects are pattern recognition, a method for identifying, traversing the graphic pattern set, apply the following steps for each pattern (see FIG. 7):

[0079] 步骤S71、查找能与该图形组合成为特效文本的文本块集合;若查找出文本块集合,此图形即为特效图形,进行步骤S72 ;若找不到则结束; [0079] step S71, the lookup effects can be set in combination with text of the text block pattern; find out if the text block set, this pattern is the pattern effects, to step S72; if no ends;

[0080] 步骤S72、计算查找出的文本块集合的区域大小,如果与特效图形区域χ方向宽度差别过大,则回到步骤S71继续查找文本块,否则进行步骤S73 ; [0080] step S72, the region size is calculated to find the set of blocks of text, graphics and special effects if the difference between the width of region χ direction is too large, the process returns to step S71 to continue to look for the text block, otherwise proceeds to step S73;

[0081] 步骤S73、对文本块集合中的每一个文本块设置与特效图形对应的文本特效属性; [0081] step S73, the text block is provided for each text block graphics and special effects corresponding to the set of attributes text effects;

[0082] 步骤S74、删除特效图形。 [0082] Step S74, delete graphical effects.

[0083] 其中,上述步骤S71的具体过程请参阅图8 : [0083] wherein the above-described specific process step S71, see FIG 8:

[0084] 步骤3711、判断图形是否符合特效图形特征的第一特征(即上文中的4131、(:1), 若符合则进行步骤S712,若不符合则查找结束,查找结果为空,结束; [0084] Step 3711, the first feature pattern is determined whether the effects of the graphical features (i.e., earlier in 4131, (: a), if carried out in line with the step S712, the ends do not meet is searched, the search result is empty, ending;

[0085] 步骤S712、遍历文本集合中的每一个行列块,判断图形与行列块的关系是否符合特效图形的第二特征(即上文中的A2、B2、C2),若符合,进行步骤S713,若全部都不符合,则查找结束,查找结果为空,结束; [0085] step S712, the text for each line block set of traversal, determining a second characteristic relationship with the pattern block row meets graphics effects (i.e., in the above A2, B2, C2), if they meet, for step S713, the if all are not met, then the search ends, the search result is empty, ending;

[0086] 步骤S713、对于找到的行列块中的每一个文本块,判断图形与文本块是否符合特效图形的第三特征(即上文中的A3、B3、C3),如符合,进行步骤S714,若没有一个符合,继续进行步骤S712 ; [0086] step S713, the text block for each row in the block found, it is determined whether the third feature graphics and text block in line with pattern effects (i.e., in the above A3, B3, C3), if they meet, for step S714, the If there is no accord, proceed to step S712;

[0087] 步骤S714、将符合文本块集合作为查找结果返回。 [0087] step S714, the line with the text block as a set of search results returned.

[0088] 其中,对于每一次符合的判断,都将对应的文本块记录至一个符合文本块集合中。 [0088] wherein, for each matching is determined, the corresponding text block are recorded to a line with text block set.

[0089] 完成图形的遍历与识别后,PDF中文本特效识别即结束。 After [0089] completion of the traverse pattern recognition, PDF effects of the present Chinese recognition is finished. 通过对中间结构的读取, 就可以在生成其它文档格式时,设置这些格式支持的文本特效。 By reading of the intermediate structure, you can be generated when the other document format, is provided to support these formats text effects. 经过本发明实施例处理过的PDF文档元素,生成的其它文档格式中的文本附带特效,还原度、可读性都得到了很大的提尚。 After embodiment PDF document element processed embodiment of the present invention, other documents in text format generation incidental effects, the degree of reduction, the readability of mention have been still great.

[0090] 而且,本发明实施例可以消除简单的PDF文档元素提取所得的文本与特效图形混合在一起,页面元素混乱的问题,处理过程可以方便的增加新的特效识别。 [0090] Moreover, embodiments of the present invention can eliminate the simple PDF document element extraction pattern resulting mixed text and effects, confusing problem page elements, the process can easily add new effects identified.

[0091] 而且,识别及设置各种文本特效的操作是可分离的,可以自由使用其中的某一个识别功能或者设置功能。 [0091] Moreover, various text effect identification and set operations are separable, using a free or a recognition function in which the function is provided.

[0092] 本发明还提供一种文档格式的转换装置,请参阅图9。 [0092] The present invention further provides an apparatus for document format conversion, see Figure 9.

[0093] 其中,信息获取模块91,用于获取原文档中的文本信息和图形信息; [0093] wherein, the information obtaining module 91, configured to obtain text information and graphic information in the original document;

[0094] 文本特效识别模块92,用于将获取的原文档中的文本信息和图形信息进行文本特效识别,识别所述文本信息与所述图形信息之间的对应关系; [0094] Text and graphical information of the original document text effect identification module 92, configured to obtain special effects in text recognition, identifying the correspondence between the graphic information and text information;

[0095] 存储模块93,用于将识别出的所述文本信息与所述图形信息之间的对应关系进行存储;其中,图形信息包括有所述特效图形的特征信息; [0095] The storage module 93, configured to identify the correspondence between the information and the text information storing pattern; wherein said graphical information comprises feature information pattern of effects;

[0096] 文档格式转换模块94,用于根据所述存储的文本信息与所述图形信息之间的对应关系生成用户指定的文档格式。 [0096] Document format conversion module 94, for generating a document format specified by the user according to a correspondence relationship between the text information stored with the graphic information.

[0097] 其中,所述图形信息包含的图形的属性以及特征,所述文本信息与所述图形信息之间的对应关系包括位置以及大小关系。 [0097] wherein the pattern feature and attribute information included in the pattern, the correspondence between the graphic information and text information includes the position and the magnitude relation. 所述文档格式转换模块94包括: The document format conversion module 94 comprises:

[0098] 线段转换模块941,用于判断获取的图形是否为四边形,判断获取的图形是否为矩形,判断是否具有某一边的宽度是否小于PDF在正常显示时能够区分线段和矩形的临界宽度,以及,将该矩形转化为相应的线段,并用转化后的线段替换掉原来的矩形。 [0098] line conversion module 941, configured to judge whether the acquired quadrangular pattern, determines whether the acquired pattern is rectangular, it is determined whether width is less than one side of the line and can distinguish between PDF rectangular display in the normal critical width, and , the rectangular converted to the corresponding segments, and replace the original segment rectangles after conversion.

[0099] 特效图形查找模块942,用于根据所述特效图形的特征信息查找符合条件的特效图形。 [0099] effect pattern lookup module 942, according to the characteristic pattern of special effect information to find qualified pattern effects.

[0100] 特效图形删除模块943,用于删除所述图形信息中的特效图形。 [0100] Graphics effect deleting module 943, graphics effects for deleting the graphic information.

[0101] 具体的各模块的工作流程在上文已有详细的描述,此处不再赘述。 [0101] Workflow of each specific module has been described in detail above, it will not be repeated here.

[0102] 本发明实施例通过获取PDF文档中的文本信息以及图形信息,并对PDF文档中的文本信息和图形信息进行文本特效识别,识别文本信息和图形信息之间的关系并存储, 根据存储的文本信息和图形信息之间的关系将PDF文档转换为其他格式的文档,使得诸如PDF文档在转换为其他格式的文档时,能够保持原文档内容的还原度,增加文档转换后可编辑性,解决了转换后页面混乱的问题。 [0102] Example embodiments of the present invention, graphic information and text information and PDF documents the relationship between the effects text recognition, the recognized text information and graphic information and text information by obtaining graphic information and stored in the PDF document, according to the storage the relationship between text and graphical information of the PDF documents into other document formats such as PDF documents when converting to other formats of documents, to maintain the reduction of the content of the original document, after document conversion can increase the editorial, after the conversion page to solve the problem of chaotic.

[0103] 应当理解的是,对本领域普通技术人员来说,可以根据上述说明加以改进或变换, 而这些改进和变换都应属于本发明所附权利要求的保护范围。 [0103] It should be understood that those of ordinary skill in the art, can be modified or converted according to the above description, and these modifications and variations shall fall within the scope of the appended claims of the invention.

Claims (7)

1. 一种文档格式的转换方法,其特征在于,所述方法包括以下步骤: 获取原文档中的文本信息和图形信息;将获取的原文档中的文本信息和图形信息进行文本特效识别,识别所述文本信息与所述图形信息之间的对应关系;将识别出的所述文本信息与所述图形信息之间的对应关系进行存储;根据所述存储的文本信息与所述图形信息之间的对应关系生成用户指定的文档格式;其中,在获取原文档中的文本信息和图形信息时,首先读入并接收文档中存储的文档绘制指令,所述文档绘制指令包括绘制文本指令以及绘制图形指令;然后,根据接收到的绘制文本指令提取绘制文本指令中对应的文本信息;根据接收到的绘制图形指令提取所述绘制图形指令中对应的图形信息;并且在进行文本特效识别时,明确各种文本特效图形的特征,对各种PDF文档中文本信息和 1. A method of converting a document format, characterized in that the method comprises the steps of: obtaining text information and graphic information of the original document; graphic information and text information of the original document in the acquired text effects recognition, identification the correspondence between the graphic information and text information; correspondence between the identified text information and the graphic information is stored; the text between the stored information and the graphics information generating a correspondence between the document format designated by the user; wherein, when acquiring text information and graphic information in the original document, and the received document is read first drawing instruction stored in the document, the document includes a drawing instruction to draw text and draw graphics instructions instruction; is then extracted text information corresponding to the instruction to draw text rendering text according to the received instruction; extracting the graphics drawing command corresponding to graphic information according to the received instruction to draw graphics; and during text recognition effects, clarify the graphic text effects characteristic species of this information on a variety of Chinese and PDF documents 图形信息进行样例分析,得出图形信息与对应文本信息之间的对应关系或者特效图形一般特征;在将识别出的所述文本信息与所述图形信息之间的对应关系进行存储时,将获取的文本信息保存至文本块集合中,将获取的图形信息保存至图形集合中,其中,提取出的文本信息与图形信息均保存有位置、外界矩形区域大小的基本信息,所述的图形信息还保存有组成该图形的边的属性、填充色的图形基本信息; 所述的图形信息包含有特效图形,在根据所述存储的文本信息与所述图形信息之间的对应关系生成用户指定的文档格式时,所述方法还包括以下步骤:删除所述图形信息中的特效图形;在删除所述图形信息中的特效图形的步骤之前,所述方法还包括: 判断获取的图形是否为四边形,若是则继续判断获取的图形是否为矩形,若是,则继续判断是否具 Graphics information sample analysis, the correspondence relationship between the general characteristics of graphical effects or graphical information corresponding to text information; when stored in the correspondence relation between the recognized text information and the graphic information, Get saved text information into the text block set, the stored graphics information to obtain a set pattern, wherein the extracted information is textual information and graphics stored position, the size of the basic information region outside the rectangle, the graphics information also the preservation of the properties of the composition of the side pattern, the basic pattern information of the fill color; the graphical information includes a graphic effects, the specified correspondence relationship between the text information stored with the graphic information generated according to user when the document format, said method further comprising the step of: deleting the graphic effects pattern information; effects prior to the step of deleting said graphic pattern information, the method further comprising: determining whether the acquired quadrangular pattern, if so then determine whether the acquisition of the graphic is rectangular, and if so, to determine whether to continue with 某一边的宽度小于原文档在正常显示时能够区分线段和矩形的临界宽度,若是,提取该矩形的区域信息,将该矩形转化为相应的线段,并用转化后的线段替换掉原来的矩形;所述特效图形为下划线、删除线、底纹与亮度以及带圈字符。 Width of one side is smaller than the original document in the normal display capable of distinguishing critical width segments and rectangles, if the extracted area information of the rectangle, the rectangle converted to the corresponding segment, and a segment after the conversion replace the original rectangle; the said graphics effects as underline, strikethrough, shading and brightness character with a circle.
2.如权利要求1所述的文档格式的转换方法,其特征在于,所述图形信息包含的图形的属性以及特征,所述文本信息与所述图形信息之间的对应关系包括文本和图形的位置以及大小关系。 2. The method of converting a document format according to claim, wherein said graphic pattern and the feature attribute information includes a correspondence relationship between the text information and the graphic information, including text and graphics relationship between the size and location.
3.如权利要求1所述的文档格式的转换方法,其特征在于,在删除所述图形信息中的特效图形的步骤之前,所述方法还包括;存储所述特效图形的特征信息;在根据所述存储的文本信息与所述图形信息之间的对应关系生成用户指定的文档格式时,根据所述特效图形的特征信息查找符合条件的特效图形。 3. The method of converting a document format according to claim, characterized in that, prior to the step pattern effects in the deletion of the graphic information, said method further comprising; wherein said information storing pattern effects; In when the correspondence between the graphic information and text information stored in the user specified document format generated, to find qualified graphic effects to the effects according to the characteristic information of the pattern.
4. 一种文档格式的转换装置,其特征在于,所述装置包括:信息获取模块,用于获取原文档中的文本信息和图形信息,其中,在获取原文档中的文本信息和图形信息时,首先读入并接收文档中存储的文档绘制指令,所述文档绘制指令包括绘制文本指令以及绘制图形指令;然后,根据接收到的绘制文本指令提取绘制文本指令中对应的文本信息;根据接收到的绘制图形指令提取所述绘制图形指令中对应的图形信息;文本特效识别模块,用于将获取的原文档中的文本信息和图形信息进行文本特效识另IJ,识别所述文本信息与所述图形信息之间的对应关系,在进行文本特效识别时,明确各种文本特效图形的特征,对各种PDF文档中文本信息和图形信息进行样例分析,得出图形信息与对应文本信息之间的对应关系或者特效图形一般特征;存储模块,用于将识别 A document format conversion apparatus, wherein the apparatus comprises: an information acquiring module, for acquiring the text information and graphic information in the original document, wherein, in acquiring the text information and graphic information in the original document first read the document and the received document stored in the drawing command, the drawing instruction includes a document text drawing instruction and graphics drawing instruction; is then extracted text information corresponding to the instruction to draw text rendering text according to the received instruction; according to the received extracting the graphics drawing command to draw graphics command corresponding to graphic information; text effect identification module, text information and graphical information for the original document in the acquired knowledge of the effects of another IJ text, the text information and the identification correspondence between the graphics information, text effects during recognition, various features clear text effects graphics on a variety of PDF documents and information in this Chinese graphic information sample analysis, the correspondence between graphic information and text information the correspondence or general effects characteristic pattern; storage module for identifying 的所述文本信息与所述图形信息之间的对应关系进行存储, 在将识别出的所述文本信息与所述图形信息之间的对应关系进行存储时,将获取的文本信息保存至文本块集合中,将获取的图形信息保存至图形集合中,其中,提取出的文本信息与图形信息均保存有位置、外界矩形区域大小的基本信息,所述的图形信息还保存有组成该图形的边的属性、填充色的图形基本信息;文档格式转换模块,用于根据所述存储的文本信息与所述图形信息之间的对应关系生成用户指定的文档格式;所述文档格式转换模块还包括:线段转换模块,用于判断获取的图形是否为四边形,判断获取的图形是否为矩形,判断是否具有某一边的宽度是否小于原文档在正常显示时能够区分线段和矩形的临界宽度,以及,将该矩形转化为相应的线段,并用转化后的线段替换掉原来的矩 A correspondence relationship between the text information and the graphic information is stored, when the correspondence relationship between the recognized text information and stores the graphics information, text information is saved to the acquired text block collection, a set of stored graphics information to the graphics in the obtained, wherein the extracted information is textual information and graphics stored position, the size of the basic information region outside the rectangle, the graphics information is also stored in the graphic composition of the sides attribute information of the basic pattern fill color; document format conversion module for generating document format specified by the user according to a correspondence relationship between the text information stored in the graphics information; the document format conversion module further comprises: line conversion module, configured to judge whether the acquired pattern quadrangular, determines whether the acquired pattern is rectangular, it is determined whether width is less than one side of the original document can be distinguished in the normal display critical width segments and rectangles, and, the rectangular converted to the corresponding segments, and replace the original line segment after the moment of conversion ;所述特效图形为下划线、删除线、底纹与亮度以及带圈字符。 ; The effects underscore graphic, strikethrough, shading and brightness character with a circle.
5.如权利要求4所述的文档格式的转换装置,其特征在于,所述图形信息包含的图形的属性以及特征,所述文本信息与所述图形信息之间的对应关系包括文本和图形的位置以及大小关系。 5. A converter device according to claim document format, characterized in that said graphic pattern and the feature attribute information includes a correspondence relationship between the text information and the graphic information, including text and graphics relationship between the size and location.
6.如权利要求4所述的文档格式的转换装置,其特征在于,所述文档格式转换模块具体包括:特效图形删除模块,用于删除所述图形信息中的特效图形。 Document conversion apparatus of claim 4 format as claimed in claim 6, wherein the document format conversion module comprises: a graphics effect deleting module for deleting the graphic effects graphics information.
7.如权利要求4所述的文档格式的转换装置,其特征在于,所述图形信息包括有特效图形的特征信息,所述文档格式转换模块还包括;特效图形查找模块,用于根据所述特效图形的特征信息查找符合条件的特效图形。 7. The conversion apparatus according to claim document format, wherein the graphic information includes graphical effects characteristic information, the document format conversion module further comprises; graphic effects searching module, according to the special effects graphics feature information to find qualified special effects graphics.
CN 201010206401 2010-06-14 2010-06-14 Method and device for converting document format CN101853246B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN 201010206401 CN101853246B (en) 2010-06-14 2010-06-14 Method and device for converting document format

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN 201010206401 CN101853246B (en) 2010-06-14 2010-06-14 Method and device for converting document format

Publications (2)

Publication Number Publication Date
CN101853246A CN101853246A (en) 2010-10-06
CN101853246B true CN101853246B (en) 2012-05-23

Family

ID=42804744

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 201010206401 CN101853246B (en) 2010-06-14 2010-06-14 Method and device for converting document format

Country Status (1)

Country Link
CN (1) CN101853246B (en)

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102332002B (en) * 2011-07-28 2013-11-13 深圳市万兴软件有限公司 Method and system for converting file from portable document format (PDF) to electronic publication (EPUB) format
CN102521215B (en) * 2011-11-28 2017-03-22 上海量明科技发展有限公司 Method and system for marking off document
CN103186513B (en) * 2011-12-31 2016-04-27 北大方正集团有限公司 A kind of method of document format conversion and device
DE102012008512A1 (en) 2012-05-02 2013-11-07 Eyec Gmbh Apparatus and method for comparing two graphics and text elements containing files
CN104111913B (en) * 2013-04-16 2017-10-03 北大方正集团有限公司 A kind of processing method and processing device of streaming document
CN105988979B (en) * 2015-02-16 2018-11-16 北京邮电大学 Table extracting method and device based on pdf document
CN105335339A (en) * 2015-10-19 2016-02-17 江苏沃叶软件有限公司 Pdf document conversion method
CN105302782B (en) * 2015-11-23 2019-04-26 魅族科技(中国)有限公司 A kind of information conversion method and device
CN107589926A (en) * 2016-07-08 2018-01-16 珠海金山办公软件有限公司 A kind of lantern slide diagram matching process and device
CN108304361B (en) * 2018-02-12 2019-09-24 掌阅科技股份有限公司 The display methods of the hand-written notes of e-book calculates equipment and computer storage medium

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7460710B2 (en) * 2006-03-29 2008-12-02 Amazon Technologies, Inc. Converting digital images containing text to token-based files for rendering
US8155444B2 (en) * 2007-01-15 2012-04-10 Microsoft Corporation Image text to character information conversion
CN101441713B (en) * 2007-11-19 2010-12-08 汉王科技股份有限公司 Optical character recognition method and apparatus of PDF document

Also Published As

Publication number Publication date
CN101853246A (en) 2010-10-06

Similar Documents

Publication Publication Date Title
US7916979B2 (en) Method and system for displaying and linking ink objects with recognized text and objects
US7555711B2 (en) Generating a text layout boundary from a text block in an electronic document
JP3282860B2 (en) Apparatus for processing a digital image of text on the document
US6014458A (en) System for designating document direction
CN1126025C (en) Window display
JP4486325B2 (en) How to display a writing guide for a freeform document editor
CN1220162C (en) Title extracting device and its method for extracting title from file images
EP0434930B1 (en) Editing text in an image
US20060214937A1 (en) Method and apparatus to convert bitmapped images for use in a structured text/graphics editor
US20060193008A1 (en) Document processing apparatus, document processing method and computer program
JP2010020795A (en) Interfacing with ink
CN102117269B (en) Apparatus and method for digitizing documents
JP6141921B2 (en) Document reconstruction method and system
CN1104677C (en) Cut-and-paste method and data processing system in table
JP5068963B2 (en) Method and apparatus for determining logical document structure
DE60312572T2 (en) Method and apparatus for converting digital images of hand drawings for further use in a structured text / graphics editor.
US20010014176A1 (en) Document image processing device and method thereof
JP5113909B2 (en) Placement of graphics objects on the page with control based on relative position
RU2357284C2 (en) Method of processing digital hand-written notes for recognition, binding and reformatting digital hand-written notes and system to this end
JP3504054B2 (en) Document processing apparatus and document processing method
US6952803B1 (en) Method and system for transcribing and editing using a structured freeform editor
JP2536998B2 (en) Storage and retrieval of non-text objects
EP1361544B1 (en) System and method for editing electronic images
JP4098880B2 (en) Information retrieval device
CN101253514B (en) Grammatical parsing of document visual structures

Legal Events

Date Code Title Description
C06 Publication
C10 Entry into substantive examination
C14 Grant of patent or utility model
C56 Change in the name or address of the patentee

Owner name: SHENZHEN WONDERSHARE INFORMATION TECHNOLOGY CO., L

Free format text: FORMER NAME: SHENZHEN WONDERSHARE SOFTWARE CO., LTD.

CP03