WO2024087566A1 - 文档转换方法及装置、计算机可读存储介质、计算机设备 - Google Patents

文档转换方法及装置、计算机可读存储介质、计算机设备 Download PDF

Info

Publication number
WO2024087566A1
WO2024087566A1 PCT/CN2023/091535 CN2023091535W WO2024087566A1 WO 2024087566 A1 WO2024087566 A1 WO 2024087566A1 CN 2023091535 W CN2023091535 W CN 2023091535W WO 2024087566 A1 WO2024087566 A1 WO 2024087566A1
Authority
WO
WIPO (PCT)
Prior art keywords
page
row
document
preset
columns
Prior art date
Application number
PCT/CN2023/091535
Other languages
English (en)
French (fr)
Inventor
李乐乐
刘海林
Original Assignee
深圳市网旭科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳市网旭科技有限公司 filed Critical 深圳市网旭科技有限公司
Publication of WO2024087566A1 publication Critical patent/WO2024087566A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/16File or folder operations, e.g. details of user interfaces specifically adapted to file systems

Definitions

  • the present application relates to the technical field of document conversion, for example, to a document conversion method and apparatus, a computer-readable storage medium, and a computer device.
  • PDF Portable Document Format
  • Word documents are non-editable and editable documents that we commonly use, respectively. Due to the non-editable nature of non-editable documents, it is often necessary to convert non-editable documents into editable documents during the use of documents. For example, most PDF documents are non-editable, and some software can also implement the editing function of PDF documents, but they are often not as convenient as Word documents. Therefore, when users want to use some content in PDF documents to re-edit and obtain new document content, they usually need to convert PDF documents into Word documents.
  • the present application provides a document conversion method and apparatus, a computer-readable storage medium, and a computer device, which can improve the restoration degree of document conversion.
  • an embodiment of the present application provides a document conversion method, which is used to convert a non-editable first document into an editable second document, and the document conversion method includes: parsing the first document page by page to obtain all elements of each page of the first document, each element having a position and content; mapping all elements of each page of the content to each preset page, so that each page contains all elements of the corresponding page in the first document; constructing at least one of the following items according to the position and content of all elements in each preset page: at least one text block and at least one shape block; according to The preset layout rules determine the sections and columns of at least one of each text block and each shape block in each preset page, and obtain the layout of all elements of each page content in the corresponding preset page; the second document is generated according to each preset page with all elements laid out; the element layout of each page of the second document is the same as the element layout of the corresponding preset page.
  • an embodiment of the present application provides a computer device, the computer device comprising a memory and a processor.
  • the memory is configured to store program instructions.
  • the processor is configured to execute the program instructions to implement the above document conversion method.
  • an embodiment of the present application provides a document conversion device, wherein the document conversion device is configured to convert a non-editable first document into an editable second document, and the document conversion device includes a parsing module, a mapping module, a construction module, a layout module, and a generation module.
  • the parsing module is configured to parse the first document page by page to obtain all elements of each page of the first document, each element having a position and content.
  • the mapping module is configured to map all elements of each page of the content to each preset page, so that each page contains all elements of the corresponding page in the first document.
  • the construction module is configured to construct at least one of the following items according to the position and content of all elements in each preset page: at least one text block and at least one shape block.
  • the layout module is configured to determine the sections and columns of at least one of each text block and each shape block in each preset page according to a preset layout rule, and obtain the layout of all elements of each page of the content in the corresponding preset page.
  • the generation module is configured to generate the second document according to each preset page where all elements are laid out; the element layout of each page of the second document is the same as the element layout of the corresponding preset page.
  • an embodiment of the present application provides a computer-readable storage medium, wherein the computer-readable storage medium is used to store computer program instructions, and the computer program instructions are executed by a processor to implement the above-mentioned document conversion method.
  • FIG1 is a schematic diagram of a process flow of a document conversion method provided in an embodiment of the present application.
  • FIG2 is a schematic diagram of elements in a first document provided in an embodiment of the present application.
  • FIG3 is a schematic diagram of mapping a page in a first document to a preset page according to an embodiment of the present application
  • FIG4 is a schematic diagram of creating a text block or a shape block in a preset page provided by an embodiment of the present application
  • FIG5 is a schematic diagram of the sub-step flow chart of step S107 of the document conversion method provided in the first embodiment of the present application;
  • FIG6 is a schematic diagram of the sub-step flow chart of step S107 of the document conversion method provided in the second embodiment of the present application.
  • FIG. 7 is a schematic diagram of the sub-step flow chart of step S107 of the document conversion method provided in the third embodiment of the present application.
  • FIG8 is a schematic diagram of the sub-step flow chart of step S107 of the document conversion method provided in the fourth embodiment of the present application.
  • FIG9 is a schematic diagram of the sub-step flow chart of step S105 of the document conversion method provided in the first embodiment of the present application.
  • FIG10 is a schematic diagram of the sub-step flow chart of step S105 of the document conversion method provided in the second embodiment of the present application.
  • FIG11 is a schematic diagram of a document conversion device provided in an embodiment of the present application.
  • FIG12 is a schematic diagram of a document conversion process provided in an embodiment of the present application.
  • FIG13a is a schematic diagram of a double-column page of a first document provided in an embodiment of the present application.
  • FIG13b is a schematic diagram of a single-column page of a first document provided in an embodiment of the present application.
  • FIG. 14 is a schematic diagram of a computer device provided in an embodiment of the present application.
  • the present application provides a document conversion method, which can convert a non-editable document into an editable document, and can also layout the content of each page in the non-editable document according to the layout method of the editable document, so that the content layout of the document before and after the conversion is the same.
  • Fig. 1 is a document conversion method provided by an embodiment of the present application, for converting a non-editable first document into an editable second document.
  • the document conversion method includes the following steps.
  • Step S101 parse the first document page by page to obtain all elements of each page of the first document, each element having a position and content, and the content of each element is text content or format content.
  • an element whose content is text content is a text element
  • an element whose content is format content is a format element.
  • Text content includes text, pictures, graphics, etc.
  • Format content includes formats used to represent text content, such as stroke, underline, table border, fill, text highlight, cell background color, etc.
  • element C1 whose content is text and element C2 whose content is picture are text elements.
  • Element S1 whose content is background color, element S2 whose content is underline, and element S3 whose content is rectangular box are format elements.
  • the text content and the format content representing the text content are represented by different elements respectively.
  • the content of each page of the first document can be implemented by a deep learning model or obtained by a specific method.
  • the first document can be a PDF document, and the content of each page of the first document is each PDF page in the PDF document.
  • PDF documents generally include PDF documents in text format and PDF documents in scanned format.
  • PDF documents in text format can be parsed by the PDFium tool (a PDF rendering engine).
  • the deep learning model can be implemented using an existing model.
  • the first document can also be a picture format document, such as a JPG format document, a PNG format document, and the like.
  • Step S103 mapping all elements of each page content to each preset page, so that each preset page includes all elements of the corresponding page in the first document.
  • each preset page is used to lay out all elements of each page content.
  • the preset page has the same layout method as the second document, and the elements in the preset page can be laid out using the layout method. For example, all elements of each page content of the first document can be laid out in the corresponding preset page to obtain the layout attributes of each element, and the layout attributes obtained after the layout can be supported by the second document. That is, the layout attributes obtained after each element is laid out in the preset page are still effective in the second document.
  • the document conversion method provided by the present application further creates a A blank preset page is created, and all elements of each page of the first document are laid out in each preset page so that each element has layout attributes.
  • the size of the preset page matches the size of each page of the PDF document.
  • the document conversion method provided by the present application first parses the size of each page in the first document, such as the height and width of the page; then the height and width of the preset page are set according to the height and width of the obtained page, so that the size of each page of the first document matches that of each preset page.
  • FIG. 3 shows a schematic diagram of all elements a1 ⁇ an in a page 11 of the first document F being mapped to the preset page X one by one according to the position.
  • the position of the elements a1 ⁇ an is the coordinate (X01, Y01)...(X0n, Y0n)
  • the position of the elements a1 ⁇ an is the coordinate (X11, Y11)...(X1n, Y1n)
  • each element in the elements a1 ⁇ an has a unique coordinate corresponding to the preset page.
  • the second document is an editable document
  • the editable document usually has layout attributes, so that the user can present a better layout when editing the content, so as to achieve neat and standardized content typesetting.
  • the elements of the non-editable first document do not have layout attributes. That is to say, before all elements of each page of the first document are mapped to the preset page, each element does not contain layout attributes.
  • the preset page provides the same layout attributes as those of the second document. After all elements of each page of the first document are laid out in the corresponding preset page, each element has the layout attributes.
  • the second document may be a Word document
  • the layout attributes of the Word document include rows, sections, and columns.
  • the preset page also provides corresponding layout attributes such as rows, sections, and columns. In other words, elements without layout attributes are laid out in the preset page to obtain the layout attributes such as rows, sections, and columns of the elements.
  • Step S105 construct at least one text block and/or at least one shape block according to the positions and contents of all elements in each preset page.
  • the positional relationship between multiple elements is combined with the content to infer whether the multiple elements can be combined together, so that multiple text elements or multiple shape elements are integrated together to form an element block.
  • the element block is a text block or a shape block.
  • the element block in which multiple text elements are integrated together is the text block.
  • Multiple shape elements are combined together to form a shape block.
  • whether to group the elements together to form an element block is mainly determined based on the positional relationship between the elements.
  • element A and element B overlap in position, and the content of element A is text and the content of element B is a picture, then element A and element B can be combined together to form text block C, that is, text block C is equivalent to a picture with text.
  • element D1, element D2, and element D3 are all text elements, and element D1, element D2, and element D3 are located in multiple consecutive lines, and element D1, element D2, and element D3 have the same font and the text length is basically the same. If they are consistent, element D1, element D2, and element D3 can be combined together to form text block D0, that is, text block D0 is equivalent to a text paragraph.
  • each text block includes position coordinates, area information, rows, and elements of each row.
  • the position coordinates of the text block are represented by the position coordinates of the upper left corner of the text block.
  • the area information represents the size information covered by the text block in the preset page, such as the height and width of the text block.
  • the shape block also includes position coordinates, area information, rows, and elements of each row.
  • the rows of the text block in the preset page are represented by the upper and lower side lines in the horizontal direction, and the left and right side lines in the vertical direction.
  • each element block before creating each element block, the row of each element is determined first, that is, the elements in each page are divided into rows.
  • each element is divided into the same row according to the position of each element. In implementation, elements with little difference in vertical position are divided into one row.
  • the position coordinates of each text block and the covered area are determined according to the row of each row of elements.
  • the upper sideline of the starting row and the lower sideline of the ending row in the text block are used as the upper sideline and lower sideline of the text block, and the left sideline (i.e. the leftmost left sideline) with the smallest horizontal coordinate in the text block and the right sideline (i.e. the rightmost right sideline) with the largest horizontal coordinate in the text block are used as the left sideline and right sideline of the text block, thereby determining the regional information.
  • all elements are located in the area covered by the text block or the shape block. Through the construction of the text block or the shape block, it is possible to prevent the occurrence of wrong lines, overflow or font size being changed in each element.
  • the construction of the text block is described below by taking the element block D0 shown in FIG4 as an example.
  • the elements D1, D2, and D3 in the element block D0 are located in different rows, wherein the row where the element D1 is located is the starting row of the element block D0, and the upper edge line of the row where the element D1 is located is used as the upper edge line L1 of the element block D0.
  • the row where the element D3 is located is the ending row of the element block D0, and the lower edge line of the row where the element D3 is located is used as the lower edge line L2 of the element block D0.
  • the left edge line of the row where the element D2 is located is the leftmost line in the element block D0, therefore, the left edge line of the row where the element D2 is located is used as the left edge line of the element block D0.
  • the right edge line of the row where the element D1 is located is the rightmost line in the element block D0, therefore, the right edge line of the row where the element D1 is located is used as the right edge line of the element block D0.
  • the upper, lower, left, and right edges L1-L4 of the element block D0 determine the area covered by the element block D0 and obtain the area information of the element block D0.
  • the coordinates of the upper left corner of the element block D0 can be determined according to the area covered by the element block D0.
  • the row of each element block is first determined based on the positions of all elements; then the text block area is determined based on the row, so that the position coordinates of the text block, that is, the position coordinates of the upper left corner of the text block, and the area information of the text block, that is, the width and height, can be determined.
  • a shape block E0 which is equivalent to a table.
  • elements E1 to E10 are multiple groups of intersecting border lines, then elements E1 to E10 are combined to form a shape block E0, which is equivalent to a table.
  • elements E1 to E10 are combined to form a shape block E0, which is equivalent to a table.
  • a shape block also includes position coordinates, area information, rows, and elements of each row. That is, it can be understood that the method for constructing element blocks can set corresponding rules according to different content forms in the PDF, so as to construct the element blocks according to the rules. Build an element block.
  • Step S107 determining the sections and columns of each text block and/or each shape block in each preset page according to the preset layout rule, and obtaining the layout of all elements of each page content in the corresponding preset page.
  • the sections and columns of each text block and/or each shape block in each preset page are determined according to the preset layout rules. For example, the sections of each text block and/or each shape block in each preset page are determined according to the preset layout rules, and then the columns of each text block and/or each shape block are determined.
  • the sections of each text block are determined according to a preset layout rule, and then the columns of each text block are determined. Then, the sections and columns of the corresponding shape blocks are determined according to the sections and columns of each text block.
  • the layout of the shape blocks is determined according to the layout of the text blocks, and the corresponding text blocks and shape blocks are placed in the same section and column. How to determine the sections and columns will be described below.
  • Step S109 Generate a second document according to each preset page with all elements laid out, and the second document is an editable document.
  • the element layout of each page of the second document is the same as the element layout of the corresponding preset page.
  • page F of the first document maps the elements in page F of the first document to the preset page X and lays them out in the preset page, and then generates the second document W.
  • the preset page X is created using an Extensible Markup Language (XML) file.
  • the document conversion method of this embodiment can convert a non-editable first document into an editable second document, and during the conversion process, the above-mentioned document conversion method can also add section and column layouts to all elements of the content of each page of the first document, which greatly improves the situation where all elements in the first document have position deviations due to the lack of layout when they are converted to the second document, thereby improving the degree of restoration when converting non-editable documents into editable documents.
  • Step S107 may include steps S500-S508.
  • steps S500-S508 implement how to Divide into sections.
  • Step S500 calculating the gaps between all text blocks in each line line by line.
  • Step S502 determining the number of columns in each row based on the gaps between all text blocks, wherein when the gap between two text blocks is greater than a first preset value, it is determined that the two text blocks are located in two different columns; when the gap between two text blocks is less than or equal to the first preset value, it is determined that the two text blocks are located in the same column.
  • Step S504 detecting the number of columns in each row line by line.
  • Step S506 when the number of columns in a row is different from the number of columns in the row before the row, the row and the row before the row are divided into different sections.
  • Step S508 when the number of columns in a row is the same as the number of columns in the row before the row, the row and the row before the row are grouped into the same section.
  • the number of columns of each row of elements in a section should be the same, and the layout needs to be consistent. Therefore, in this embodiment, the number of columns of the upper and lower rows is used to determine whether the upper and lower rows are divided into the same section or different sections.
  • Step S107 includes steps S600-S604.
  • the columns of each row are single-column or double-column, as shown in Figures 13a and 13b
  • the layout of the first document is equivalent to a double-column page as shown in page F1
  • the layout of the first document is equivalent to a single-column page as shown in F2.
  • Steps S600-S604 implement a method of dividing each row of elements into columns.
  • Step S600 calculating the gaps between all text blocks in each line line by line.
  • Step S602 determining the number of columns in each row based on the gaps between all text blocks, wherein when the gap between two text blocks is greater than a first preset value, it is determined that the two text blocks are located in two different columns; when the gap between two text blocks is less than or equal to the first preset value, it is determined that the two text blocks are located in the same column.
  • Step S604 if the number of columns in a row is greater than two, the row is set to a single column.
  • the rows with more than two columns are set as single columns.
  • PDF files there are generally only two columns at most, so when there are more than two columns, it means that the text blocks in each row are not divided into columns, but it is caused by the arrangement of the text blocks. Therefore, such rows are determined as single columns, so as to quickly determine the columns of the rows with more than two columns.
  • Step S107 may also include steps S700-S704.
  • Step S700 If the number of columns in a row is equal to two, and the width of a column in the row is less than When the second preset value is set, the row is set to a single column.
  • the columns in the row may also be determined through the following steps.
  • Step S702 if the number of columns in a row is equal to two, detect the number of columns in the previous section of the row, the column dividing line of the previous section, and the column dividing line of the section where the row is located.
  • Step S704 if the number of columns in the previous section of the row is also equal to two, and the column dividing line of the previous section does not overlap with the column dividing line of the section where the row is located, set the row to a single column.
  • step S702-step S704 normally, if the consecutive rows are double columns, the dividing lines should overlap, and the rows in the same section are either all single columns or all double columns. Therefore, if the number of columns in a row is equal to two but the dividing line of the column of the previous section does not overlap with the column dividing line of the section where the row is located, it means that the row is not divided into columns and is thus regarded as a single column. That is, it is quickly determined that the row with the number of columns equal to two but the column dividing line does not overlap with the column dividing line of the previous section is regarded as a single column.
  • the present application can divide a row into columns when the number of columns in a row is equal to two.
  • Step S107 may also include steps S800-S808.
  • Step S800 If the number of columns in a row is equal to one and the number of columns in the previous section of the row is two, determine whether the text block in the row is completely located in the left column of the previous section of the row. In a section, from left to right, the column on the left is the left column, and the column on the right is the right column.
  • Step S802 When the text block of the row is completely located in the left column of the previous section of the row, the row is set to double columns.
  • the columns of the line can also be determined by the following steps.
  • Step S804 if the number of columns in a row is equal to one and the number of columns in the previous section of the row is two, detect the height of the previous section of the row.
  • Step S806 determining whether the height of the previous section of the row is less than a third preset value.
  • Step S808 When the height of the previous section of the row is less than a third preset value, the row is set to a single column, and the columns of the previous section of the row are adjusted to a single column.
  • the present application can determine the row when the number of columns in a row is equal to one. Columns.
  • Step S105 includes steps S900-S908.
  • steps S900-S908 how to construct a shape block is implemented.
  • what is constructed is an explicit table shape block, that is, the area of the table with border lines displayed in the PDF document.
  • Step S900 Detect whether there is one or more groups of intersecting border lines in each preset page, each border line corresponding to an element.
  • a set of intersecting border lines includes at least two intersecting border lines.
  • Step S902 If there are multiple groups of intersecting border lines, the areas corresponding to the multiple groups of intersecting border lines are determined as potential explicit table areas, and area information of a shape block is obtained.
  • Step S904 determining the table structure of the potential explicit table area according to the one or more groups of intersecting border lines to obtain one or more cells.
  • imaginary horizontal and vertical lines can be used to calculate whether the horizontal and vertical lines intersect with the vertical border lines or the horizontal border lines. If there are no intersection points, it means that there are merged cells in the horizontal or vertical direction, thereby obtaining the table structure of the explicit table area.
  • Step S906 confirming the area corresponding to each of the cells as the area information of each of the cells.
  • Step S908 obtaining a corresponding explicit table shape block according to the area information of the shape block, the area information of all cells, and the elements corresponding to all border lines.
  • the text block in the corresponding area is set to a table format, thereby improving the convenience of editing in the editable second document and meeting the layout requirements.
  • Step S105 includes steps S1001-S1009.
  • an invisible table-shaped block is constructed, that is, an area in which no border line is displayed in the PDF document but a table is needed to layout the corresponding text block.
  • the content of some areas is not a table, it can be moved as a whole when editing, such as a text block without a border but in a table layout.
  • the editing of these text blocks requires the use of a setting table for layout needs.
  • Step S1001 determining a potential implicit table area according to the positional relationship between all text blocks.
  • Step S1003 determining the imaginary border lines of the potential implicit table area, each imaginary border line It is represented by a hypothetical element, which includes position and format content.
  • Step S1005 determining the table structure of the potential implicit table area according to the imaginary border line to obtain one or more cells.
  • the implementation method is the same as step S904.
  • Step S1007 confirming the area corresponding to each of the cells as the area information of each of the cells.
  • Step S1009 obtaining a corresponding invisible table shape block according to the area information of the shape block, the area information of all cells, and the imaginary elements corresponding to all imaginary border lines.
  • the invisible table area is identified and an imaginary border line is added to set the text block content in the area into a table format, thereby improving the convenience of editing in the editable second document and meeting layout requirements.
  • FIG. 11 is a functional module diagram of a document conversion device 100.
  • the document conversion device 100 is configured to convert a non-editable first document into an editable second document.
  • the document conversion device 100 includes a parsing module 101, a mapping module 103, a construction module 105, a layout module 107, and a generation module 109.
  • the parsing module 101 is configured to parse the first document page by page to obtain all elements of each page of the first document, each element having a position and content.
  • the implementation process of the parsing module 101 can be implemented with reference to the description of the above step S101.
  • the mapping module 103 is configured to map all elements of each page content to each preset page, so that each preset page contains all elements of the corresponding page in the first document.
  • the mapping module 103 can refer to the description of the above step S103.
  • the construction module 105 is configured to construct at least one of the following items according to the positions and contents of all elements in each preset page: at least one text block and at least one shape block.
  • the construction module 105 can refer to the description of the above step S105 and its sub-steps.
  • the layout module 107 is configured to determine the sections and columns of at least one of each text block and each shape block in each preset page according to a preset layout rule, and obtain the layout of all elements of each page content in the corresponding preset page.
  • the layout module 107 can refer to the description of the above step S107 and its sub-steps.
  • the generating module 109 is configured to generate a second document according to each preset page with all elements laid out, and the second document is an editable document; the element layout of each page of the second document is the same as the element layout of the corresponding preset page.
  • the generating module 109 can refer to the description of the above step S109.
  • FIG 14 is a schematic diagram of the internal structure of a computer device provided in an embodiment of the present application.
  • the computer device 10 includes a memory 11 and a processor 12.
  • the memory 11 is configured to store program instructions
  • the processor 12 is configured to execute program instructions to implement the above document conversion method.
  • the processor 12 can be a central processing unit (CPU), a controller, a microcontroller, a microprocessor or other data processing chip, which is configured to run program instructions stored in the memory 11.
  • CPU central processing unit
  • controller a controller
  • microcontroller a microprocessor or other data processing chip, which is configured to run program instructions stored in the memory 11.
  • the memory 11 includes at least one type of readable storage medium, which includes a flash memory, a hard disk, a multimedia card, a card-type memory (for example, an SD or DX memory, etc.), a magnetic memory, a magnetic disk, an optical disk, etc.
  • the memory 11 may be an internal storage unit of a computer device, such as a hard disk of a computer device.
  • the memory 11 may also be an external storage device of a computer device, such as a plug-in hard disk, a smart memory card (Smart Media Card, SMC), a secure digital (Secure Digital, SD) card, a flash card (Flash Card), etc. equipped on the computer device.
  • the memory 11 may also include both an internal storage unit of a computer device and an external storage device.
  • the memory 11 may not only be configured to store application software and various types of data installed in the computer device, such as codes for implementing a document conversion method, etc., but may also be configured to temporarily store data that has been output or is to be output.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Human Computer Interaction (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Document Processing Apparatus (AREA)

Abstract

本申请提供一种文档转换方法,包括逐页解析第一文档获得所述第一文档的每页内容的所有元素,每一元素具有位置和内容;将每页内容的所有元素对应映射于每一预设页面;根据在所述每一预设页面中所有元素的位置和内容构建出下述至少一项:至少一个文本块和至少一个形状块;按照预设的布局规则确定出每个文本块和每个形状块中的至少之一在每一预设页面中的分节和分栏;根据布局好所有元素的每个预设页面生成第二文档。此外,本申请还提供一种应用所述文档转换方法的装置、计算机可读存储介质以及计算机设备。

Description

文档转换方法及装置、计算机可读存储介质、计算机设备
本公开要求在2022年10月28日提交中国专利局、申请号为202211332538.3的中国专利申请的优先权,以上申请的全部内容通过引用结合在本申请中。
技术领域
本申请涉及文档转换技术领域,例如涉及一种文档转换方法及装置、计算机可读存储介质、计算机设备。
背景技术
目前的文档各种各样,有不可编辑的文档也有可编辑的文档。例如,便携式文件格式(Portable Document Format,PDF)和Word文档分别是我们常用的不可编辑的文档和可编辑的文档。由于不可编辑文档的不可编辑特性,在使用文档过程中往往需要将不可编辑的文档转换为可编辑的文档。例如,PDF文档大部分是不可编辑的,也有一部分软件也可以实现PDF文档的编辑功能,但往往不如Word文档方便。因此,用户想使用一些PDF文档中的内容进重新编辑得到新的文档内容时,通常需要将PDF文档转换Word文档。
目前也出现将PDF文档转换为Word文档的方法,然而,由于PDF文档不存在Word文档中以行、节、栏进行的流式布局,相关技术的转换方法一般仅是根据PDF文档每页的各元素的位置在Word文档中排版,有时候会出现元素重叠或者排错行,也就是说采用相关技术的转换方法往往出现转换出来的Word文档元素的位置和PDF文档中的位置出现偏差。
发明内容
本申请提供一种文档转换方法及装置、计算机可读存储介质及计算机设备,可以提高文档转换的还原度。
第一方面,本申请实施例提供一种文档转换方法,所述文档转换方法用于将不可编辑的第一文档转换为可编辑的第二文档,所述文档转换方法包括:逐页解析所述第一文档获得所述第一文档每页内容的所有元素,每一元素具有位置和内容;将每页内容的所有元素对应映射于每一预设页面,使所述每一页面包含所述第一文档中对应页的所有元素;根据在所述每一预设页面中所有元素的位置和内容构建出下述至少一项:至少一个文本块和至少一个形状块;按照 预设的布局规则确定出每个文本块和每个形状块中的至少之一在每一预设页面中的分节和分栏,得到每页内容的所有元素在对应的预设页面中的布局;根据布局好所有元素的每个预设页面生成所述第二文档;所述第二文档每页的元素布局与对应的预设页面的元素布局相同。
第二方面,本申请实施例提供一种计算机设备,所述计算机设备包括存储器和处理器。存储器,设置为存储程序指令。所述处理器,设置为执行所述程序指令以实现上述文档转换方法。
第三方面,本申请实施例提供一种文档转换装置,所述文档转换装置设置为将不可编辑的第一文档转换为可编辑的第二文档,所述文档转换装置包括解析模块、映射模块、构建模块、布局模块、以及生成模块。所述解析模块设置为逐页解析所述第一文档获得所述第一文档每页内容的所有元素,每一元素具有位置和内容。所述映射模块设置为将每页内容的所有元素对应映射于每一预设页面,使所述每一页面包含所述第一文档中对应页的所有元素。所述构建模块设置为根据在所述每一预设页面中所有元素的位置和内容构建出下述至少一项:至少一个文本块和至少一个形状块。所述布局模块设置为按照预设的布局规则确定出每个文本块和每个形状块中的至少之一在每一预设页面中的分节和分栏,得到每页内容的所有元素在对应的预设页面中的布局。所述生成模块设置为根据布局好所有元素的每个预设页面生成所述第二文档;所述第二文档每页的元素布局与对应的预设页面的元素布局相同。
第四方面,本申请实施例提供一种计算机可读存储介质,所述计算机可读存储介质用于存储计算机程序指令,所述计算机程序指令由处理器执行以实现上述文档转换方法。
附图说明
下面将对实施例或相关技术描述中所需要使用的附图作介绍,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图示出的结构获得其他的附图。
图1为本申请实施例提供的文档转换方法的流程示意图;
图2为本申请实施例提供的第一文档中元素的示意图;
图3为本申请实施例提供的第一文档中一个页面映射到预设页面映射示意图;
图4为本申请实施例提供的在预设页面中创建文本块或者形状块的示意图;
图5为本申请第一实施例提供的文档转换方法的步骤S107的子步骤流程示意图;
图6为本申请第二实施例提供的文档转换方法的步骤S107的子步骤流程示意图;
图7为本申请第三实施例提供的文档转换方法的步骤S107的子步骤流程示意图;
图8为本申请第四实施例提供的文档转换方法的步骤S107的子步骤流程示意图;
图9为本申请第一实施例提供的文档转换方法的步骤S105的子步骤流程示意图;
图10为本申请第二实施例提供的文档转换方法的步骤S105的子步骤流程示意图;
图11为本申请实施例提供的文档转换装置的示意图;
图12为本申请实施例提供的文档转换过程示意图;
图13a为本申请实施例提供的第一文档的双栏页面示意图;
图13b为本申请实施例提供的第一文档的单栏页面示意图;
图14为本申请实施例提供的计算机设备示意图。
具体实施方式
以下结合附图及实施例,对本申请进行说明。应当理解,此处所描述的具体实施例仅用以解释本申请。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”、“第三”、“第四”等(如果存在)是用于区别类似的规划对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,换句话说,描述的实施例根据除了这里图示或描述的内容以外的顺序实施。此外,术语“包括”和“具有”以及他们的任何变形,还可以包含其他内容,例如,包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于列出的那些步骤或单元,而是可包括列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。
需要说明的是,在本申请中涉及“第一”、“第二”等的描述仅用于描述目的,而不能理解为指示或暗示其相对重要性或者隐含指明所指示的技术特征的数量。由此,限定有“第一”、“第二”的特征可以明示或者隐含地包括一个或者多个该特征。
本申请提供一种文档转换方法,可以使不可编辑文档转换成可编辑文档,还可将不可编辑文档中每页的内容按照可编辑文档的布局方式进行布局,从而使得转换前后文档的内容布局一样。
请参看图1,其为本申请实施例提供的一种文档转换方法,用于将不可编辑的第一文档转换为可编辑的第二文档。文档转换方法包括下面步骤。
步骤S101,逐页解析第一文档获得所述第一文档的每页内容的所有元素,每一元素具有位置和内容。每一元素的内容为文本内容或者格式内容。
在本实施例中,内容是文本内容的元素为文本元素,内容是格式内容的元素为格式元素。文本内容包括文本、图片、图形等。格式内容包括用于表示文本内容的格式,例如描边、下划线、表格边框、填充、文本高亮、单元格背景色等。如图2所示,内容为文本的元素C1、内容为图片的元素C2为文本元素。内容为背景色的元素S1、内容为下划线的元素S2、内容为矩形框的元素S3为格式元素。
可以理解地,文本内容和表示文本内容的格式内容分别用不同元素表示。第一文档的每页内容可以通过深度学习模型实现或者通过特定的方法来获得。在实现时,第一文档可以为PDF文档,第一文档的每页内容为PDF文档中每个PDF页面。PDF文档一般有文本格式的PDF文档和扫描格式的PDF文档。文本格式的PDF文档可以通过PDFium工具(一种PDF渲染引擎)进行解析得到。而对于扫描格式的PDF文档往往需要通过深度学习技术解析获得。深度学习模型可采用现有的模型实现。可以理解地,在一些可行的实施例中,第一文档还可以是图片格式文档,如JPG格式文档、PNG格式文档等。
步骤S103,将每页内容的所有元素对应映射于每一预设页面,使所述每一预设页面包含所述第一文档中对应页的所有元素。
在本实施例中,每一预设页面用于将每页内容的所有元素进行布局。预设页面具有与第二文档相同的布局方法,在预设页面中的元素可以利用布局方法进行布局。例如,第一文档每页内容的所有元素可以在相应的预设页面中进行布局而得到每个元素的布局属性,且布局后得到布局属性可被第二文档所支持。即每个元素在预设页面中进行布局后得到布局属性在第二文档中仍生效。
在本实施例中,本申请提供的文档转换方法还为第一文档的每页创建一个 空白的预设页面,并在每一预设页面中对第一文档的每页内容所有元素进行布局从而使每个元素带上布局属性。其中,预设页面的尺寸与PDF文档每页的尺寸匹配。示例性地,本申请提供的文档转换方法先解析第一文档中每页的尺寸,如页面的高和宽;然后根据获取得到的页面的高和宽来设定预设页面的高和宽,从而使得第一文档每页和每个预设页面的尺寸相匹配。
在本实施例中,第一文档每页内容的所有元素根据位置一一映射到预设页面中。图3展现了第一文档F中一个页面11所有元素a1~an根据位置一一映射到预设页面X中的示意图。其中,元素a1~an中的位置为坐标(X01,Y01)……(X0n,Y0n),映射到预设页面X后,元素a1~an的位置为坐标(X11,Y11)……(X1n,Y1n),也就是说,元素a1~an中每一个元素在预设页面中都有一个唯一的坐标对应。可以理解地,第二文档是可编辑文档,而可编辑文档通常带有布局属性,从而使用户在编辑内容的时候呈现出较好的布局,以实现内容排版的整齐、规范。不可编辑的第一文档的元素并未具有布局属性。也就是说,第一文档每页内容的所有元素在未映射在预设页面之前,每个元素并未包含布局属性。在本实施例中,预设页面提供了与第二文档相同的布局属性,第一文档每页内容的所有元素在相应的预设页面中布局后,使每个元素带有布局属性。
在实现时,第二文档可以为Word文档,Word文档的布局属性包括行、分节、分栏。同样地,预设页面中也提供对应的行、分节、分栏等布局属性。也就是说,没有带有布局属性的元素在预设页面中进行布局从而获得元素的行、分节、分栏等布局属性。
步骤S105,根据在所述每一预设页面中所有元素的位置和内容构建出至少一个文本块和/或至少一个形状块。
在本实施例中,以多个元素之间的位置关系结合内容推断出所述多个元素是否可以组合在一起,从而将多个文本元素或者多个形状元素整合在一起形成元素块。也就是说,元素块为文本块或者形状块。其中,多个文本元素整合在一起的元素块为所述文本块。多个形状元素组合在一起形成形状块。
请参看图4,在本实施例中,主要是以元素之间的位置关系来确定是否将这些元素组合在一起形成元素块。
例如,元素A和元素B在位置上有重叠,且元素A的内容为文本及元素B的内容为图片,则可以将元素A和元素B组合在一起,形成文本块C,即文本块C相当于是带有文字的图片。
又例如,元素D1、元素D2、元素D3都为文本元素,元素D1、元素D2、元素D3分别位于连续的多行,且素D1、元素D2、元素D3字体相同且文本长度基本 一致则可以把元素D1、元素D2、元素D3组合在一起形成文本块D0,即文本块D0相当于一个文本段落。
在本实施例中,每个文本块包括位置坐标、区域信息、行、及每行的元素。其中文本块的位置坐标用文本块左上角的位置坐标表示。区域信息表示预设页面中被文本块覆盖的尺寸信息,如文本块的高和宽。相应地,形状块也同样包括位置坐标、区域信息、行、及每行的元素。在本实施例中,文本块在预设页面中的行用水平方向上的上、下边线,以及竖直方向上的左、右边线表示。
在本实施例中,在创建每个元素块之前,先确定每个元素的行,也就是说将每个页面中的元素进行分行。在实现时,根据每个元素的位置将每个元素划分为同一行。在实现时,在竖直方向位置差异不大的元素被划分为一行。
在本实施例中,每行元素的行确定后,根据每行元素的行确定出每个文本块的位置坐标和所覆盖的区域。将文本块中起始行的上边线和结束行的下边线作为文本块的上边线和下边线,将文本块中水平坐标最小的左边线(即最左的左边线)和文本块中水平坐标最大的右边线(即最右的右边线)作为文本块的左边线和右边线,从而确定出区域信息。如此,所有元素就位于文本块或者形状块所覆盖的区域。经过文本块或者形状块的构建,可以防止各元素中出现错行、溢出或者字体大小被改变。
下面以如图4所示的元素块D0为例介绍文本块的构建。元素块D0中的元素D1、D2、D3分别位于不同的行,其中,元素D1所在的行为元素块D0的起始行,则将元素D1所在的行的上边线作为元素块D0的上边线L1。元素D3所在的行为元素块D0的结束行,将元素D3所在的行的下边线作为元素块D0的下边线L2。元素D2所在的行的左边线是位于元素块D0中最左边的线,因此,将元素D2所在的行的左边线作为元素块D0的左边线。元素D1所在的行的右边线是位于元素块D0中最右边的线,因此,将元素D1所在的行的右边线作为元素块D0的右边线。元素块D0的上、下、左、右边线L1-L4确定出元素块D0所覆盖的区域而得到元素块D0的区域信息。根据元素块D0所覆盖的区域可以确定出元素块D0左上角的坐标。
可以理解地,构建文本块时,先根据所有元素的位置确定出每个元素块的行;再根据行确定文本块区域,从而可以确定出文本块的位置坐标即文本块左上角的位置坐标,以及确定出文本块的区域信息即宽和高。
请再次参看图4,又例如,元素E1~E10是多组相互交叉的边框线,则元素E1~E10组合起来形成形状块E0,形状块相当于一个表格。又例如,多个元素表示一个文本块的格式,则将多个元素组合一起形成元素块,即形状块。同样地,形状块也包括位置坐标、区域信息、行、及每行的元素。即可以理解地,元素块的构建方法可以根据PDF中不同的内容形式设置对应的规则,从而根据规则构 建出元素块。亦可以理解的是,将多个元素组合在一起形成元素块后再进行布局,只需要对元素块的布局进行计算,无需对每个元素的布局进行计算,可以大大减少计算每个元素的布局的运算量。下文将以表格为例描述形状块的创建过程,请参看下文构建出形状块的相关描述。
步骤S107,按照预设的布局规则确定出每个文本块和/或每个形状块在每一预设页面中的分节和分栏,得到每页内容的所有元素在对应的预设页面中的布局。
在本实施例中,按照预设的布局规则确定出每个文本块和/或每个形状块在每一预设页面中的分节和分栏例如可以为按照预设的布局规则确定出每个文本块和/或每个形状块在每一预设页面中的分节,然后再确定出每个文本块和/或每个形状块的分栏。
可选地,按照预设的布局规则确定出每个文本块的分节,然后再确定出每个文本块的分栏。然后再根据每个文本块的分节和分栏确定出对应的形状块的分节和分栏。其中,形状块的布局是根据文本块的布局确定的,相对应的文本块和形状块放在同一个分节和分栏中。其中,如何确定分节和分栏将在下文进行描述。
步骤S109,根据布局好所有元素的每个预设页面生成第二文档,且所述第二文档为可编辑文档。所述第二文档每页的元素布局与对应的预设页面的元素布局相同。
可以理解地,每页内容的所有元素在相应的预设页面中布局属性确定后,即确定了每个预设页面的所有元素的布局,布局好所有元素的每个预设页面相应地转换后即可得到第二文档,然后将第二文档保存。如图12所示,第一文档的页面F,将第一文档的页面F中的元素映射到预设页面X中并在预设页面中进行布局,然后生成第二文档W。以第二文档W为Word文档为例,可以理解地,在本方法中,预设页面X用可扩展标记语言(Extensible Markup Language,XML)文件创建。
本实施例文档转换方法可以将不可编辑的第一文档转换为可编辑的第二文档,且在转换过程中,上述文档转换方法还可以对第一文档每页内容的所有元素添加分节和分栏布局,大大改善了第一文档中所有元素转换至第二文档的时候由于没有布局而出现位置偏差的情况,从而提升了不可编辑文档转换成可编辑文档的还原度。
请参看图5,其为本申请第一实施例提供的步骤S107的子步骤流程图。步骤S107可包括步骤S500-S508。在本实施例中,步骤S500-S508实现了如何对元素 进行分节。
步骤S500,逐行计算出每一行的所有文本块之间的间隙。
步骤S502,根据所有文本块之间的间隙确定出所述每一行的分栏数量,其中,当两个文本块之间的间隙大于第一预设值时,确定所述两个文本块位于两个不同分栏;当两个文本块之间的间隙小于或者等于第一预设值,确定所述两个文本块位于同一分栏。
步骤S504,逐行检测每一行的分栏数量。
步骤S506,当一行的分栏数量与所述行的前一行的分栏数量不同,将所述行与所述行的前一行分在不同的分节中。
步骤S508,当一行的分栏数量与所述行的前一行的分栏数量相同,将所述行与所述行的前一行分在同一分节中。
可以理解地,在一个分节中的每行元素,其分栏数量应该相同,也就布局需要是一致的。因此,在本实施例中以上下行的分栏数量来确定是否将上下行划分为同一个分节还是不同的分节。
请参看图6,其为本申请第二实施例提供的步骤S107的子步骤流程图。步骤S107包括步骤S600-S604。在本实施例中,所述每行的分栏为单栏或者双栏,如图13a和13b所示,第一文档的版面相当于双栏的页面如页面F1所示,第一文档的版面相当于单栏的页面如F2所示。步骤S600-S604实现了对每行元素进行分栏的一种方法。
步骤S600,逐行计算出每一行的所有文本块之间的间隙。
步骤S602,根据所有文本块之间的间隙确定出所述每一行的分栏数量,其中,当两个文本块之间的间隙大于第一预设值时,确定所述两个文本块位于两个不同分栏;当两个文本块之间的间隙小于或者等于第一预设值,确定所述两个文本块位于同一分栏。
步骤S604,若一行的分栏数量大于二时,将所述行设置为单栏。
上述实施例中,将分栏数量大于二的行设置为单栏。在PDF文件中一般最多只有两栏,因此当超过两栏则表示每行的文本块并没有分栏,而是因为文本块的编排导致的,因此,将这样的行确定为单栏,从而快捷地确定出分栏数大于二的行的分栏。
请参看图7,其为本申请第三实施例提供的步骤S107的子步骤流程图。步骤S107还可包括步骤S700-S704。
步骤S700,若一行的分栏数量等于二,且所述行存在一栏的分栏宽度小于 第二预设值时,将所述行设置为单栏。
可以理解地,一行分栏数量等于二但是分栏的宽度比较小,则表示本行的文本块并没有分栏,从而视为单栏,即快捷的确定出分栏数量等于二但栏宽较小的行的分栏。
在本实施例中,若一行的分栏数量等于二时但分栏宽度不小于预设值,还可以通过下面步骤确定所述行的分栏。
步骤S702,若一行的分栏数量等于二时,检测所述行的前一个分节的分栏数量、所述前一个分节的分栏分割线、所述行所在的分节的分栏分割线。
步骤S704,若所述行的前一个分节的分栏数量也等于二,且前一分节的分栏分割线与所述行所在的分节的分栏分割线不重合时,将所述行设置为单栏。
可以理解地,在步骤S702-步骤S704中,通常情况下,如果连续的行是双栏,则分割线应当是重合的,而同一分节中的行要不都是单栏要不都是双栏,因此,一行分栏数量等于二但是前一个分节的分栏的分割线与所述行所在的分节的分栏分割线不重合,则表示所述行并没有进行分栏,从而视为单栏,即快捷的确定出分栏数量等于二但分栏分割线与前一节的分栏分割线不重合的行视为单栏。
通过上述实施例,本申请可以将一行的分栏数量等于二的情况下进行分栏。
请参看图8,其为本申请第四实施例提供的步骤S107的子步骤流程图。步骤S107还可包括步骤S800-S808。
步骤S800,若一行的分栏数量等于一且所述行的前一个分节的分栏数量为二时,判断所述行的文本块是否完全位于所述行的前一个分节的左分栏。其中,在一个分节中从左到右排列,左边的分栏为左分栏,右边的分栏为右分栏。
步骤S802,当所述行的文本块完全位于所述行的前一个分节的左分栏时,将所述行设置为双栏。
在本实施例中,若一行的分栏数量等于一且所述行的前一个分节的分栏数量为二,但所述行的文本块并未完全位于所述前一个分节的左分栏时,还可以通过下面步骤确定所述行的分栏。
步骤S804,若一行的分栏数量等于一且所述行的前一个分节的分栏数量为二时,检测所述行的前一个分节的高度。
步骤S806,判断所述行的前一个分节的高度是否小于第三预设值。
步骤S808,当所述行的前一个分节的高度小于第三预设值时,将所述行设置为单栏,并将所述行的前一个分节的分栏调整为单栏。
通过本实施例,本申请可以在一行的分栏数量等于一的情况下确定所述行 的分栏。
可以理解地,本申请中不管是分节还是确定单双栏,皆是逐行进行并且依据每行元素中分栏数量来确定。
请结合参看图9,其为本申请第一实施例提供的步骤S105的子步骤流程图。步骤S105包括步骤S900-S908。在本实施例中,实现了如何构建出形状块。在本实施例中,构建的是显性的表格形状块,也就是说,在PDF文档中显示出边框线的表格的区域。
步骤S900,在每一预设页面中检测否存在一组或者多组相交的边框线,每一边框线对应一个元素。
例如,一组相交的边框线包括至少两个相交的边框线。
步骤S902,若存在多组相交的边框线,将所述多组相交的边框线所对应区域确定为潜在显式表格区域,得到一个形状块的区域信息。
步骤S904,根据所述一组或者多组相交的边框线确定出所述潜在显式表格区域的表格结构,得到一个或者多个单元格。
例如,可以利用假想的水平线和竖直线分别计算水平线和竖直线是否与竖直边框线或者水平边框线存在相交的点,如果不存在相交的点,说明在存在水平或者竖直方向上的合并单元格,从而得到显式表格区域的表格结构。
步骤S906,将每个所述单元格所对应的区域确认为所述每个所述单元格的区域信息。
步骤S908,根据所述形状块的区域信息、所有单元格的区域信息、所有边框线对应的元素得到对应的显性表格形状块。
在本实施例中,通过对显性表格区域的识别,将对应区域中的文本块设置成表格格式,从而提升在可编辑的第二文档中进行编辑的便利性和满足布局需求。
请结合参看图10,其为本申请第二实施例提供的步骤S105的子步骤流程图。步骤S105包括步骤S1001-S1009。在本实施例中,构建的是隐形的表格形状块,也就是说,在PDF文档中没有显示边框线但需要用来表格来对相应的文本块进行布局的区域。例如,有些区域的内容虽然不是表格,但是编辑的时候可以整体移动,例如未带边框但呈表格布局的文本块,在可编辑文档中,对这些文本块的编辑为了布局需要,需要用设置表格。
步骤S1001,根据所有文本块之间的位置关系确定出潜在隐式表格区域。
步骤S1003,确定出所述潜在隐式表格区域的假想边框线,每一假想边框线 用假想元素表示,所述假想元素包括位置和格式内容。
步骤S1005,根据所述假想边框线确定出所述潜在隐式表格区域的表格结构,得到一个或者多个单元格。实现方法同步骤S904。
步骤S1007,将每个所述单元格所对应的区域确认为所述每个所述单元格的区域信息。
步骤S1009,根据所述形状块的区域信息、所有单元格的区域信息、所有假想边框线对应的假想元素得到对应的隐形表格形状块。
在本实施例中,通过对隐形表格区域的识别和添加假想的边框线将该区域中的文本块内容设置成表格格式,从而提升在可编辑的第二文档中进行编辑的便利性和满足布局需求。
请参看图11,其为文档转换装置100的功能模块示意图。文档转换装置100设置为将不可编辑的第一文档转换为可编辑的第二文档。其中,文档转换装置100包括解析模块101、映射模块103、构建模块105、布局模块107、以及生成模块109。
解析模块101,设置为逐页解析第一文档获得所述第一文档每页内容的所有元素,每一元素具有位置和内容。解析模块101的实现过程可参照上述步骤S101中描述实现。
映射模块103,设置为将每页内容的所有元素对应映射于每一预设页面,使所述每一预设页面包含所述第一文档中对应页的所有元素。映射模块103可参照上述步骤S103的描述。
构建模块105,设置为根据在所述每一预设页面中所有元素的位置和内容构建出下述至少一项:至少一个文本块和至少一个形状块。构建模块105可参照上述步骤S105及其子步骤的描述。
布局模块107,设置为按照预设的布局规则确定出每个文本块和每个形状块中的至少之一在每一预设页面中的分节和分栏,得到每页内容的所有元素在对应的预设页面中的布局。布局模块107可参照上述步骤S107及其子步骤的描述。
生成模块109,设置为根据布局好所有元素的每个预设页面生成第二文档,且所述第二文档为可编辑文档;所述第二文档每页的元素布局与对应的预设页面的元素布局相同。生成模块109可参照上述步骤S109的描述。
请参看图14,其为本申请实施例提供的计算机设备的内部结构示意图。计算机设备10包括存储器11和处理器12。存储器11设置为存储程序指令,处理器12设置为执行程序指令以实现上述文档转换的方法。
其中,处理器12在一些实施例中可以是一中央处理器(Central Processing Unit,CPU)、控制器、微控制器、微处理器或其它数据处理芯片,设置为运行存储器11中存储的程序指令。
存储器11至少包括一种类型的可读存储介质,该可读存储介质包括闪存、硬盘、多媒体卡、卡型存储器(例如,SD或DX存储器等)、磁性存储器、磁盘、光盘等。存储器11在一些实施例中可以是计算机设备的内部存储单元,例如计算机设备的硬盘。存储器11在另一些实施例中也可以是计算机设备的外部存储设备,例如计算机设备上配备的插接式硬盘、智能存储卡(Smart Media Card,SMC)、安全数字(Secure Digital,SD)卡、闪存卡(Flash Card)等。存储器11还可以既包括计算机设备的内部存储单元也包括外部存储设备。存储器11不仅可以设置为存储安装于计算机设备的应用软件及各类数据,例如实现文档转换方法的代码等,还可以设置为暂时地存储已经输出或者将要输出的数据。

Claims (12)

  1. 一种文档转换方法,所述文档转换方法用于将不可编辑的第一文档转换为可编辑的第二文档,所述文档转换方法包括:
    逐页解析所述第一文档获得所述第一文档每页内容的所有元素,每一元素具有位置和内容;
    将每页内容的所有元素对应映射于每一预设页面,使所述每一预设页面包含所述第一文档中对应页的所有元素;
    根据在所述每一预设页面中所有元素的位置和内容构建出下述至少一项:至少一个文本块和至少一个形状块;
    按照预设的布局规则确定出每个文本块和每个形状块中的至少之一在每一预设页面中的分节和分栏,得到每页内容的所有元素在对应的预设页面中的布局;
    根据布局好所有元素的每个预设页面生成所述第二文档,所述第二文档每页的元素布局与对应的预设页面的元素布局相同。
  2. 根据权利要求1所述的文档转换方法,其中,所述按照预设的布局规则确定出每个文本块和每个形状块中的至少之一在每一预设页面中的分节和分栏,得到每页内容的所有元素在对应的预设页面中的布局,包括:
    按照预设的布局规则确定出每个文本块和每个形状块中的至少之一在每一预设页面中的分节;
    按照预设的布局规则确定出每个文本块和每个形状块中的至少之一在每一预设页面中的分栏。
  3. 根据权利要求2所述的文档转换方法,其中,所述文本块包括一行或者多行,每行的分栏为单栏或者双栏;所述按照预设的布局规则确定出每个文本块和每个形状块中的至少之一在每一预设页面中的分栏,包括:
    逐行计算出每一行的所有文本块之间的间隙;
    根据所有文本块之间的间隙确定出所述每一行的分栏数量,其中,当两个文本块之间的间隙大于第一预设值时,确定所述两个文本块位于两个不同分栏;当两个文本块之间的间隙小于或者等于第一预设值时,确定所述两个文本块位于同一分栏;
    响应于一行的分栏数量大于二,将所述行设置为单栏。
  4. 根据权利要求3所述的文档转换方法,其中,所述按照预设的布局规则确定出每个文本块和每个形状块中的至少之一在每一预设页面中的分栏,还包括:
    响应于一行的分栏数量等于二,且所述行存在一栏的分栏宽度小于第二预设值,将所述行设置为单栏;所述分栏宽度为同一栏的所有文本块的宽度;或者,
    响应于一行的分栏数量等于二,检测所述行的前一个分节的分栏数量、所述前一个分节的分栏分割线、所述行所在的分节的分栏分割线;
    响应于所述行的前一个分节的分栏数量等于二,且所述前一分节的分栏分割线与所述行所在的分节的分栏分割线不重合,将所述行设置为单栏。
  5. 根据权利要求4所述的文档转换方法,其中,所述按照预设的布局规则确定出每个文本块和每个形状块中的至少之一在每一预设页面中的分栏,还包括:
    响应于一行的分栏数量等于一且所述行的前一个分节的分栏数量为二,判断所述行的文本块是否完全位于所述行的前一个分节的左分栏,其中,在一个分节中从左到右排列,左边的分栏为左分栏,右边的分栏为右分栏;
    当所述行的文本块完全位于所述行的前一个分节的左分栏时,将所述行设置为双栏;或者,
    响应于一行的分栏数量等于一且所述行的前一个分节的分栏数量为二,检测所述行的前一个分节的高度;
    判断所述行的前一个分节的高度是否小于第三预设值;
    响应于确定所述行的前一个分节的高度小于第三预设值,将所述行的前一个分节的分栏调整为单栏。
  6. 根据权利要求1所述的文档转换方法,其中,所述形状块包括表格形状块。
  7. 根据权利要求6所述的文档转换方法,其中,所述根据在所述每一预设页面中所有元素的位置和内容构建出至少一个形状块,包括:
    在所述每一预设页面中检测是否存在一组或者多组相交的边框线,每一边框线对应一个元素,一组相交的边框线包括至少两个相交的边框线;
    响应于存在多组相交的边框线,将所述多组相交的边框线所对应区域确定为潜在显式表格区域,得到一个形状块的区域信息;
    根据所述一组或者多组相交的边框线确定出所述潜在显式表格区域的表格结构,得到一个或者多个单元格;
    将每个所述单元格所对应的区域确认为所述每个所述单元格的区域信息;
    根据所述形状块的区域信息、所有单元格的区域信息、所有边框线对应的元素得到对应的显性表格形状块。
  8. 根据权利要求7所述的文档转换方法,其中,所述根据在所述每一预设页面中所有元素的位置和内容构建出至少一个形状块,还包括:
    根据所有文本块之间的位置关系确定出潜在隐式表格区域;
    确定出所述潜在隐式表格区域的假想边框线,每一假想边框线用假想元素表示,所述假想元素包括位置和格式内容;
    根据所述假想边框线确定出所述潜在隐式表格区域的表格结构,得到一个或者多个单元格;
    将每个所述单元格所对应的区域确认为所述每个所述单元格的区域信息;
    根据所述形状块的区域信息、所有单元格的区域信息、所有假想边框线对应的假想元素得到对应的隐形表格形状块。
  9. 根据权利要求3所述的文档转换方法,其中,所述按照预设的布局规则确定出每个文本块和每个形状块中的至少之一在每一预设页面中的分节,包括:
    逐行检测每一行的分栏数量;
    响应于一行的分栏数量与所述行的前一行的分栏数量不同,将所述行与所述行的前一行分在不同的分节中;
    响应于一行的分栏数量与所述行的前一行的分栏数量相同,将所述行与所述行的前一行分在同一分节中。
  10. 一种实现文档转换的计算机设备,包括:
    存储器,设置为存储程序指令;以及
    处理器,设置为执行所述程序指令以实现如权利要求1~9中任一项所述的文档转换方法。
  11. 一种文档转换装置,所述文档转换装置设置为将不可编辑的第一文档转换为可编辑的第二文档,所述文档转换装置包括:
    解析模块,设置为逐页解析所述第一文档获得所述第一文档每页内容的所有元素,每一元素具有位置和内容;
    映射模块,设置为将每页内容的所有元素对应映射于每一预设页面,使所述每一预设页面包含所述第一文档中对应页的所有元素;
    构建模块,设置为根据在所述每一预设页面中所有元素的位置和内容构建出下述至少一项:至少一个文本块和至少一个形状块;
    布局模块,设置为按照预设的布局规则确定出每个文本块和每个形状块中的至少之一在每一预设页面中的分节和分栏,得到每页内容的所有元素在对应的预设页面中的布局;
    生成模块,设置为根据布局好所有元素的每个预设页面生成所述第二文档,所述第二文档每页的元素布局与对应的预设页面的元素布局相同。
  12. 一种计算机可读存储介质,所述计算机可读存储介质用于存储计算机程序指令,所述计算机程序指令由处理器执行以实现如权利要求1~9中任一项所述的文档转换方法。
PCT/CN2023/091535 2022-10-28 2023-04-28 文档转换方法及装置、计算机可读存储介质、计算机设备 WO2024087566A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211332538.3A CN115510821A (zh) 2022-10-28 2022-10-28 文档转换方法及装置、计算机可读存储介质、计算机设备
CN202211332538.3 2022-10-28

Publications (1)

Publication Number Publication Date
WO2024087566A1 true WO2024087566A1 (zh) 2024-05-02

Family

ID=84511518

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/091535 WO2024087566A1 (zh) 2022-10-28 2023-04-28 文档转换方法及装置、计算机可读存储介质、计算机设备

Country Status (2)

Country Link
CN (1) CN115510821A (zh)
WO (1) WO2024087566A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115510821A (zh) * 2022-10-28 2022-12-23 深圳市网旭科技有限公司 文档转换方法及装置、计算机可读存储介质、计算机设备

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109582934A (zh) * 2018-12-04 2019-04-05 万兴科技股份有限公司 版式文档的转换方法及装置
CN113361257A (zh) * 2021-06-29 2021-09-07 深圳壹账通智能科技有限公司 Pdf文档解析方法、系统、电子装置及存储介质
CN115114481A (zh) * 2022-06-09 2022-09-27 抖音视界有限公司 文档格式转换方法、装置、存储介质及设备
CN115510821A (zh) * 2022-10-28 2022-12-23 深圳市网旭科技有限公司 文档转换方法及装置、计算机可读存储介质、计算机设备

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109582934A (zh) * 2018-12-04 2019-04-05 万兴科技股份有限公司 版式文档的转换方法及装置
CN113361257A (zh) * 2021-06-29 2021-09-07 深圳壹账通智能科技有限公司 Pdf文档解析方法、系统、电子装置及存储介质
CN115114481A (zh) * 2022-06-09 2022-09-27 抖音视界有限公司 文档格式转换方法、装置、存储介质及设备
CN115510821A (zh) * 2022-10-28 2022-12-23 深圳市网旭科技有限公司 文档转换方法及装置、计算机可读存储介质、计算机设备

Also Published As

Publication number Publication date
CN115510821A (zh) 2022-12-23

Similar Documents

Publication Publication Date Title
US8321783B2 (en) Visualizing content positioning within a document using layers
CN101308488B (zh) 基于版式文件的文档流式信息处理方法及装置
JP4332477B2 (ja) レイアウト調整方法及び装置並びにプログラム
US7337393B2 (en) Methods and systems for providing an editable visual formatting model
JP5113909B2 (ja) 相対位置に基く制御によるページ上のグラフィックスオブジェクトの配置
US9043698B2 (en) Method for users to create and edit web page layouts
US20060224952A1 (en) Adaptive layout templates for generating electronic documents with variable content
US7188311B2 (en) Document processing method and apparatus, and print control method and apparatus
US20030070146A1 (en) Information processing apparatus and method
WO2024087566A1 (zh) 文档转换方法及装置、计算机可读存储介质、计算机设备
EP2544099A1 (en) Method for creating an enrichment file associated with a page of an electronic document
US20100131566A1 (en) Information processing method, information processing apparatus, and storage medium
JP5380040B2 (ja) 文書処理装置
US20230153516A1 (en) Systems and methods for generating webpage data for rendering a design
JP5612557B2 (ja) 表のセルの高さを決定する方法、コンピューター読取可能媒体及びシステム
US11714953B2 (en) Facilitating dynamic document layout by determining reading order using document content stream cues
CN112417826B (zh) Pdf在线编辑方法、装置、电子设备和可读存储介质
JP2009282969A (ja) 書籍掲載文書の電子的な編集・内容変更システム、書籍掲載文書の電子的な編集・内容変更プログラムおよび書籍作成システム
CN116702705A (zh) 页面及数据图表混合展现的可阅读文件的签批技术及装置
JP2637679B2 (ja) ワードイメージの再配置によりテキストの特性を自動的に変更する方法
CN110457659B (zh) 条款文档生成方法及终端设备
JP2004326567A (ja) 表コンテンツ作成支援システム、方法及びプログラム
JP3497974B2 (ja) テキスト文書データ高次元化表示システム及び方法及びその方法を記録した記録媒体
CN116451677A (zh) Pdf文档解析方法、装置、设备及存储介质
CN113505566A (zh) 一种版式文档的处理方法和装置