CN116882365A - Method and system for converting HTML (hypertext markup language) file into Word file - Google Patents

Method and system for converting HTML (hypertext markup language) file into Word file Download PDF

Info

Publication number
CN116882365A
CN116882365A CN202310756107.8A CN202310756107A CN116882365A CN 116882365 A CN116882365 A CN 116882365A CN 202310756107 A CN202310756107 A CN 202310756107A CN 116882365 A CN116882365 A CN 116882365A
Authority
CN
China
Prior art keywords
width
sub
list
height
elements
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310756107.8A
Other languages
Chinese (zh)
Inventor
宋雨伦
李大中
谭晟中
杨瞩远
黄娟娟
胡刘杰
刘正勇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China United Network Communications Group Co Ltd
Unicom Digital Technology Co Ltd
Original Assignee
China United Network Communications Group Co Ltd
Unicom Digital Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China United Network Communications Group Co Ltd, Unicom Digital Technology Co Ltd filed Critical China United Network Communications Group Co Ltd
Priority to CN202310756107.8A priority Critical patent/CN116882365A/en
Publication of CN116882365A publication Critical patent/CN116882365A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/151Transformation
    • G06F40/154Tree transformation for tree-structured or markup documents, e.g. XSLT, XSL-FO or stylesheets
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The application provides a method, a system, equipment and a storage medium for converting an HTML file into a Word file. Analyzing the HTML file to obtain a plurality of elements and element objects corresponding to the elements, wherein each element object comprises an attribute list, a sub-element list and a wide-high information list, and the nested structure relationship and CSS attribute in the HTML file are represented by the sub-element list and the attribute list, so that the elements in the HTML file can be prevented from being missed; traversing a sub-element list of the element object, calculating the height and width of each sub-element by adopting a preset recursion algorithm according to the mapping relation among the attribute list, the sub-element list and the width and height information list, combining the height and width of the sub-element as the target height and width of the element, and generating a Word file based on the element and the target height and width. The method is suitable for converting complex HTML files, and the method is low in error rate due to the fact that the widths of the sub-elements are calculated and combined to obtain the widths of the element objects.

Description

Method and system for converting HTML (hypertext markup language) file into Word file
Technical Field
The present application relates to communications technologies, and in particular, to a method, a system, an apparatus, and a storage medium for converting an HTML file into a Word file.
Background
The hypertext markup language (Hyper Text Markup Language, HTML) formatted file is a file browsed on a global wide area network Web browser. HTML is a markup language that displays text, graphics, animation, sound, forms, links, images, and the like by markup instructions. HTML plays an important role in the rapid development of Web browsers, and plays an important role. However, with the penetration of web applications, particularly electronic commerce applications, the shortcomings of HTML soon manifest themselves: HTML does not allow application developers to define custom tags for specific application environments, but can only be used for information display. HTML can set the text and picture display modes but has no semantic structure, i.e., HTML display data is by layout rather than semantic. However, with the development of network applications, various industries have different demands on information, and these different types of information are not necessarily displayed in the form of web pages, so that the conversion of HTML files into Word files is a necessary link in many works.
The current mainstream scheme of converting HTML files into Word files is mainly implemented by adopting some open source or commercial third party libraries, such as Aspose, openXML, etc., and the tools can convert simple HTML files into Word files, and retain original styles and formats.
However, these tools are not ideal for complex HTML document processing. Because the algorithms and the data structures adopted by the tools are relatively simple, each HTML label needs to be processed independently, the conversion speed is low, and because the difference between Word and HTML is large, especially because of the nested structure and the custom CSS style in the complex HTML file, the conversion error rate of the existing tools is high, and the flexibility is poor.
Disclosure of Invention
The application provides a method, a system, equipment and a storage medium for converting an HTML file into a Word file, which are used for solving the problems of high conversion error rate and poor flexibility in the prior art.
In a first aspect, the present application provides a method for converting an HTML file into a Word file, including:
analyzing the HTML file to obtain a plurality of elements and element objects corresponding to the elements, wherein each element object comprises an attribute list, a sub-element list and a wide-high information list, and mapping relations are arranged among the attribute list, the sub-element list and the wide-high information list;
traversing a sub-element list of the element object, calculating the height and width of each sub-element by adopting a preset recursion algorithm according to the mapping relation among the attribute list, the sub-element list and the width and height information list, and combining the height and width of the sub-element as the target height and width of the element;
and generating a Word file based on the element and the target height and width.
In a possible implementation manner, the parsing the HTML file to obtain a plurality of elements and element objects corresponding to the elements includes:
analyzing the HTML file to obtain an element set, wherein the element set comprises a plurality of elements and element tags corresponding to the elements;
and carrying out secondary analysis on the elements based on the element labels to obtain element objects corresponding to the elements.
In a possible implementation manner, the element tag includes a table element, the table element includes a plurality of cells, the second parsing is performed on the element based on the element tag, to obtain an element object corresponding to each element, including:
taking the cell as a sub-element of the table element to generate a sub-element list;
acquiring the cascading style sheet attribute of the cell and the cascading style sheet attribute of the table element, and generating an attribute list;
and acquiring the height and width information of the cells and the height and width information of the table elements, and generating a height and width information list.
In one possible implementation manner, the calculating, according to the mapping relationship among the attribute list, the subelement list and the width-height information list, the height-width of each subelement by using a preset recursive algorithm includes:
acquiring a cascading style sheet attribute corresponding to the sub-element according to the mapping relation between the attribute list and the sub-element list, and calculating the frame height and width of the sub-element according to the cascading style sheet attribute;
obtaining the standard height and width of the subelement according to the mapping relation between the subelement list and the width and height information list;
and accumulating the frame height and width with the standard height and width to obtain the height and width of the subelement.
In one possible implementation manner, the subelement list includes subelements and subelement arrangement rules, and the merging of the widths of the subelements as the target widths of the elements includes:
accumulating the high values of the sub-elements in each row based on the sub-element arrangement rule to obtain a high value set, and accumulating the wide values of the sub-elements in each row to obtain a wide value set;
selecting the maximum high value and the maximum wide value from the high value set and the wide value set as the expected high value and the expected wide value of the element object respectively;
and adjusting the expected high value and the expected wide value based on a preset limit value to obtain the target height and width of the element.
In one possible implementation manner, the adjusting the expected high value and the expected wide value based on a preset limit value to obtain the target high width of the element includes:
judging whether the expected width value exceeds a preset limit value or not;
if so, reducing the expected width value and the expected high value in an equal ratio until the expected width value is equal to a preset limit value, and taking the reduced expected width value and the reduced expected high value as the target height width of the element;
and if not, taking the expected high value and the expected wide value as target height and width of the element.
In a possible implementation manner, before traversing the sub-element list of the element object, the method further includes:
judging whether the subelement list is empty or not;
if so, acquiring the frame height and standard height of the element object based on the mapping relation between the attribute list and the width and height information list, and accumulating the frame height and standard height to acquire the target height of the element.
In another aspect, the present application provides an apparatus for converting an HTML file into a Word file, including:
the analysis module is used for analyzing the HTML file to obtain a plurality of elements and element objects corresponding to the elements, wherein each element object comprises an attribute list, a sub-element list and a wide-high information list, and mapping relations are arranged among the attribute list, the sub-element list and the wide-high information list;
the high-width calculation module is used for traversing the sub-element list of the element object, calculating the high width of each sub-element by adopting a preset recursion algorithm according to the mapping relation among the attribute list, the sub-element list and the wide-high information list, and combining the high widths of the sub-elements as the target high widths of the elements;
and the file generation module is used for generating a Word file based on the elements and the target height and width.
In a third aspect, the present application provides an electronic device comprising a memory, a processor and computer-executable instructions stored in the memory and executable on the processor, the processor implementing the method of any one of the first aspects when executing the computer-executable instructions.
In a fourth aspect, the present application provides a computer readable storage medium storing a computer program which when executed by a processor implements the method of any one of the first aspects.
According to the method, the system, the equipment and the storage medium for converting the HTML file into the Word file, the HTML file is analyzed to obtain a plurality of elements and element objects corresponding to the elements, each element object comprises an attribute list, a sub-element list and a wide-high information list, mapping relations are arranged among the attribute list, the sub-element list and the wide-high information list, and nesting structural relations and CSS attributes in the HTML file are expressed by the sub-element list and the attribute list, so that the elements in the HTML file can be prevented from being missed; traversing a sub-element list of the element object, calculating the height and width of each sub-element by adopting a preset recursion algorithm according to the mapping relation among the attribute list, the sub-element list and the width and height information list, combining the height and width of the sub-element as the target height and width of the element, and generating a Word file based on the element and the target height and width. The method is suitable for converting complex HTML files, and the method is low in error rate due to the fact that the widths of the sub-elements are calculated and combined to obtain the widths of the element objects.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application.
Fig. 1 is a flowchart of a method for converting an HTML file into a Word file according to an embodiment of the present application.
Fig. 2 is a flowchart of a method for calculating the width of each subelement by using a preset recursive algorithm according to an embodiment of the present application.
FIG. 3 is a flow chart of a method for merging sub-elements with high widths as target widths of the element objects according to an embodiment of the present application.
Fig. 4 is a schematic diagram of an apparatus for converting an HTML file into a Word file according to an embodiment of the present application.
Fig. 5 is a schematic structural diagram of an electronic device based on a device for converting an HTML file into a Word file according to an embodiment of the present application.
Specific embodiments of the present application have been shown by way of the above drawings and will be described in more detail below. The drawings and the written description are not intended to limit the scope of the inventive concepts in any way, but rather to illustrate the inventive concepts to those skilled in the art by reference to the specific embodiments.
Detailed Description
Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples do not represent all implementations consistent with the application. Rather, they are merely examples of apparatus and methods consistent with aspects of the application as detailed in the accompanying claims.
First, the terms involved in the present application will be explained:
in the related art, for HTML files to be converted into Word files, there are two main methods, firstly, through a Word importing function, selecting "open" in a menu bar of Word, selecting "all files" or "web page" types in file types, then opening the HTML files to be converted, converting the HTML files into Word files, and retaining most of original formats and arrangement patterns, but if the HTML files contain some online pictures or links, problems such as link failure or picture failure can occur; secondly, through conversion tools, such as Aspose, openXML, the tools can process simple HTML files, but for special format HTML files, such as elements with various CSS attributes in the HTML or complex nested structures in the HTML, the conversion tools cannot recognize or accurately analyze, so that the conversion quality and the conversion speed have certain limitations.
Aiming at the technical problems, the embodiment of the application aims to provide a method, a system, equipment and a storage medium for converting an HTML file into a Word file.
The following describes the technical scheme of the present application and how the technical scheme of the present application solves the above technical problems in detail with specific embodiments. The following embodiments may be combined with each other, and the same or similar concepts or processes may not be described in detail in some embodiments. Embodiments of the present application will be described below with reference to the accompanying drawings.
Fig. 1 is a flowchart of a method for converting an HTML file into a Word file according to an embodiment of the present application. As shown in fig. 1, the method of the present embodiment includes:
s101: analyzing the HTML file to obtain a plurality of elements and element objects corresponding to the elements, wherein each element object comprises an attribute list, a sub-element list and a wide-high information list, and mapping relations are arranged among the attribute list, the sub-element list and the wide-high information list.
The execution main body of the embodiment of the application can be a server or an HTML file conversion system in the server, wherein the HTML file conversion system can be realized by software.
It is understood that elements refer to all codes in the HTML language from a start tag to an end tag, the content of an element being the content between the start tag and the end tag, most elements may possess properties and may be nested.
In this step, element parsing is performed on the HTML file. It should be noted that, due to the tolerant nature of HTML itself, HTML files cannot be parsed by a general top-down or bottom-up parser, but rather require algorithms for symbolizing and building a tree. And analyzing the HTML file according to the symbolization and tree building algorithm, so that element content, element attribute, nested structure information and width and height information can be obtained.
S102: traversing a sub-element list of the element object, calculating the height and width of each sub-element by adopting a preset recursion algorithm according to the mapping relation among the attribute list, the sub-element list and the width and height information list, and combining the height and width of the sub-element as the target height and width of the element.
In this embodiment, the nested structure in the HTML file is parsed to obtain a child element list of the element object, that is, the hierarchy of each element in the HTML file is divided according to the inclusion or peer relationship between the tags of the element, and the element including another element is a parent element of the included element, and conversely, the included element is a child element of the included element. Therefore, the method comprises the steps of firstly calculating the height and width of each sub-element, and then merging the height and width of the sub-elements according to the arrangement of the sub-elements to obtain the target height and width of the elements.
S103: and generating a Word file based on the element and the target height and width.
It will be appreciated that converting an HTML file to a Word file begins with the presentation of the content of the element in the document in a prescribed format.
In the step, each element is arranged and displayed according to the target width of the element, so that the content and arrangement in the generated Word file can be ensured to correspond to the original HTML file, and error leakage or document content dislocation caused by the error of the height and the width is avoided.
According to the method for converting the HTML file into the Word file, the HTML file is analyzed, a plurality of elements and element objects corresponding to the elements are obtained, each element object comprises an attribute list, a sub-element list and a wide-high information list, and mapping relations are arranged among the attribute list, the sub-element list and the wide-high information list; traversing a sub-element list of the element object, calculating the height and width of each sub-element by adopting a preset recursion algorithm according to the mapping relation among the attribute list, the sub-element list and the width and height information list, and combining the height and width of the sub-element as the target height and width of the element object; and generating a Word document based on the element and the target height and width. The method comprises the steps of analyzing the HTML file, expressing nested structure relations and CSS attributes in the HTML file by using a subelement list and an attribute list, calculating the height and width of each subelement, and combining the subelements to obtain the height and width of an element object, so that the error rate is low, and the method is suitable for converting the complicated HTML file.
The technical scheme of the method for converting the HTML file into the Word file is described in detail below.
In a possible implementation manner, the method for converting the HTML file into the Word file provided in the embodiment includes analyzing the HTML to obtain element tags, and then performing secondary analysis on the elements according to the element tags to obtain element objects corresponding to the elements.
Specifically, the parsing the HTML file to obtain a plurality of elements and element objects corresponding to the elements includes: analyzing the HTML file to obtain an element set, wherein the element set comprises a plurality of elements and element tags corresponding to the elements; and carrying out secondary analysis on the elements based on the element labels to obtain element objects corresponding to the elements.
It will be appreciated that HTML includes a series of element tags by which the document format on the network can be unified, allowing the distributed network resources to be connected as a logical whole, e.g., common element tags include < p > representing paragraph tags, < em > representing italic tags, < img > representing image tags, etc. Each element has its tag associated with it; and analyzing the HTML file to obtain the element labels of the elements and the element corresponding to the elements.
In this embodiment, the element is subjected to secondary parsing, that is, attribute information, sub-element information and bandwidth information of the element are obtained according to the element tag corresponding to the element. Illustratively, for an element whose element tag is < object >, the tag indicates that the element has an embedded object, so that sub-element information for the element can be obtained. From another aspect, the labels are inclusion or peer, and if the labels are inclusion, that is, one of the two elements is a parent element and one is a child element, for example, an html element defines the entire html document, and the element has a start label < html >, and an end label. And the element content of the html document is another element body, and the label of the element body is < body >. The body element defines the body part of an html document, and, relative to the body, html is the parent element of the body, and the body is the child element of html.
In the present embodiment, the CSS is a style for rendering the HTML element tag, for example, the < style > element is used in the HTML document header < head > region to contain the CSS.
In this embodiment, the element tag may also include high-width information, for example, the image tag < img > may include high values and wide values corresponding to the image element.
According to the method, the device and the system, the HTML is firstly analyzed to obtain the element tag, then the element is secondarily analyzed according to the element tag to obtain the element object corresponding to each element, the nested structure in the HTML file can be accurately analyzed, the information such as the attribute and the sub-element of each element in the HTML file is obtained, the applicability is high, and the problem of error analysis of the wide and high information caused by label nesting is effectively avoided.
In one possible implementation, considering a table element that may exist in the HTML file, each cell is a sub-element for the table element, and each cell has its attribute information and its width information; the element tag comprises a table element, and the table element comprises a plurality of cells.
Specifically, based on the element tag, performing secondary parsing on the element to obtain an element object corresponding to each element, including: taking the cell as a sub-element of the table element to generate a sub-element list; acquiring the cascading style sheet attribute of the cell and the cascading style sheet attribute of the table element, and generating an attribute list; and acquiring the height and width information of the cells and the height and width information of the table elements, and generating a height and width information list.
It will be appreciated that for a form element, its labels include < table > representing the form itself, < tr > representing the lateral form, < th > representing the header, and < td > for representing the text content of the form, it being noted that other attributes may be added to these labels to further modify the content.
In this step, each cell is taken as a subelement, i.e. a subelement list of the table cells can be generated from < tr >; and generating a corresponding attribute list according to the CSS attribute of each cell, and generating a corresponding height-width information list according to the height-width information of each cell.
According to the embodiment, aiming at the table elements in the HTML file, the nested structure of the table elements can be accurately analyzed, the information such as the attribute, the width and the like of each sub-element in the table elements is obtained, and the conversion error rate is reduced.
In a possible implementation manner, the method for converting an HTML file into a Word file according to the present embodiment performs a recursive calculation on the widths of sub-elements in a nested tag structure, and fig. 2 is a flowchart of a method for calculating the widths of all sub-elements by using a preset recursive algorithm according to the mapping relationship among the attribute list, the sub-element list, and the wide-high information list, as shown in fig. 2, and the calculating the widths of all sub-elements by using a preset recursive algorithm includes:
s201: and obtaining the cascading style sheet attribute corresponding to the sub-element according to the mapping relation between the attribute list and the sub-element list, and calculating the frame height and width of the sub-element according to the cascading style sheet attribute.
It can be understood that the label of the sub-element may include CSS attributes, such as a left margin, a right margin, and the like, and the frame height and width of the sub-element may be calculated according to the CSS attribute corresponding to the sub-element.
S202: and acquiring the standard height and width of the subelement according to the mapping relation between the subelement list and the width and height information list.
It will be appreciated that in HTML text, the size can be set using the height attribute and width attribute, only the "height" needs to be set for the element: length value "and" width: the length value is just the pattern; the unit of the length value can be px, cm, etc., and can also be "%" based on the percentage height of the block-level object containing the length value.
In this step, the label of the subelement may have content including its height=2cm and width=3cm, and these height and width information may be put into the height and width information list. Therefore, the standard height and width of the subelement can be obtained from the height and width information list according to the mapping relation between the subelement list and the width and height information list.
S203: and accumulating the frame height and width with the standard height and width to obtain the height and width of the subelement.
In this step, for each sub-element, the high value of its frame is accumulated with its standard high value, and the wide value of its frame is accumulated with its standard wide value, thereby obtaining the height and width of the sub-element.
In this embodiment, according to the mapping relationship among the attribute list, the subelement list and the width-height information list, a preset recursive algorithm is adopted to calculate the height-width of each subelement, so that the nested structure in the HTML markup language can be precisely matched, the conversion precision and efficiency are improved, the universality is strong, and more complex HTML documents can be processed.
In a possible implementation manner, the sub-element list includes each sub-element and a sub-element arrangement rule, and the method for converting the HTML file into the Word file according to the sub-element arrangement rule merges the widths of the sub-elements to obtain the target width of the element. Fig. 3 is a flowchart of a method for merging widths of sub-elements as target widths of the element objects according to the present embodiment, where, as shown in fig. 3, based on a rule of sub-element arrangement, merging widths of sub-elements as target widths of the element objects includes:
s301: and accumulating the high values of the sub-elements in each column based on the sub-element arrangement rule to obtain a high value set, and accumulating the wide values of the sub-elements in each row to obtain a wide value set.
It will be appreciated that the arrangement is sequential for the sub-elements of each element, and that the order or arrangement may be based on the labels of the sub-elements.
By way of example and not limitation, in a table element, the number of sub-elements in each row or column is not necessarily equal, nor is the height and width values equal, by a first number of sub-elements arranged laterally and by a second number of sub-elements arranged longitudinally. Thus, for each column of subelements, its high value is accumulated; for each row of subelements, its wide value is accumulated.
S302: and respectively selecting the maximum high value and the maximum wide value from the high value set and the wide value set as the expected high value and the expected wide value of the element object.
It will be appreciated that in order to be able to reveal all rows and all columns of sub-elements, it is necessary to use their maximum value as the expected value of the element, otherwise no suitable space is reserved for the element, and therefore the layout of the page will change when the file format is converted.
In this step, for a column of subelements, its largest high value is taken as the expected high value of the element; for a row of subelements, its largest width value is taken as the expected width value of the element.
S303: and judging whether the expected width value exceeds a preset limit value.
It will be appreciated that the height and width attributes have a hidden property in that one does not need to specify the actual size of the element, i.e., the two values may be larger or smaller than the actual size. The browser will automatically adjust the element to the size of this reserved space. For example, when the element is an image, a thumbnail thereof can be easily created for a large image using this method, and a very small image can be enlarged.
However, if the HTML file is converted into the Word file according to the original size, the width of the element cannot adapt to the standard of Word, but the original width and height of the element are not modified proportionally, so that the effect presented by the element may be distorted.
Therefore, in this embodiment, the size of the expected width of the element needs to be determined, and in view of the fact that the default format of Word is that the width value is smaller than the high value, the expected width value is used as a determination criterion, and in practical application, the expected width value and/or the expected high value may be used as a determination criterion according to needs, and the application is not limited to the determination of the expected width value and the high value.
S304: and if so, reducing the expected width value and the expected high value in an equal ratio until the expected width value is equal to a preset limit value, and taking the reduced expected width value and the reduced expected high value as the target height and width of the element object.
In this step, the expected width value exceeds a preset limit, and the expected width value needs to be scaled down at this time, and similarly, the expected high value needs to be scaled down in an equal proportion so that the element is not distorted.
It will be appreciated that scaling down the expected width and height values of an element is also true for scaling down the height and width of sub-elements in the element.
S305: and if not, taking the expected high value and the expected wide value as target height and width of the element object.
In this step, the expected width value does not exceed the preset limit value, which means that the expected width of the element meets the requirement of Word system, and no reduction processing is required for the expected width.
In this embodiment, for the element containing the sub-element, the height and width of the sub-element can be combined as the target height and width of the element object based on the sub-element arrangement rule, so that the problem of error analysis of the width and height information caused by label nesting is effectively avoided, multiple complex labels and CSS attributes are compatible, and the applicability is strong.
In a possible implementation manner, considering that part of the elements have no sub-elements, before traversing the sub-element list of the element object, the method further comprises: judging whether the subelement list is empty or not; if so, acquiring the frame height and the standard height of the element based on the mapping relation between the attribute list and the width and height information list, and accumulating the frame height and the standard height to acquire the target height of the element.
It will be appreciated that if an element has no subelements, its width needs only to be calculated from its own width and margin values.
In this step, the standard width of the element may be obtained from its content according to its tag, such as < img height= "value"/>, and < img width= "value"/>, respectively, representing the high value and the wide value of the image, that is, the standard width of the element. If the element has CSS attribute, calculating the outer edge distance, inner edge distance, frame width, distance from the element above, and the like of the element according to the content in the attribute, and obtaining the frame height and width of the element.
It can be understood that the step of accumulating the frame height and the standard height refers to accumulating the corresponding frame height and the corresponding standard height respectively according to the class pair of the CSS attribute, for example, accumulating the distance from the element above and the high value.
In this embodiment, for the element without sub-elements, the height and width of the element are calculated, the margin is calculated according to the CSS attribute, and the accumulated value of the height and width of the element and the margin is used as the target width, so that the typesetting of the document is more suitable, and the disorder of format is avoided.
Fig. 4 is a schematic diagram of an apparatus for converting an HTML file into a Word file according to an embodiment of the present application. As shown in fig. 4, the device for converting the HTML file into the Word file includes:
the parsing module 41 is configured to parse the HTML file to obtain a plurality of elements and element objects corresponding to the elements, where each element object includes an attribute list, a sub-element list, and a wide-high information list, and mapping relationships exist among the attribute list, the sub-element list, and the wide-high information list;
the width calculation module 42 is configured to traverse the sub-element list of the element object, calculate the width of each sub-element by using a preset recursive algorithm according to the mapping relationship among the attribute list, the sub-element list and the width-height information list, and combine the widths of the sub-elements as the target width of the element object;
the document generation module 43 generates a Word document based on the element and the target width.
In one possible design, parsing module 41 is specifically configured to:
analyzing the HTML file to obtain an element set, wherein the element set comprises a plurality of elements and element tags corresponding to the elements;
and carrying out secondary analysis on the elements based on the element labels to obtain element objects corresponding to the elements.
In one possible design, the element tag includes a table element, where the table element includes a plurality of cells, and the parsing module 41 is further specifically configured to:
taking the cell as a sub-element of the table element to generate a sub-element list;
acquiring the cascading style sheet attribute of the cell and the cascading style sheet attribute of the table element, and generating an attribute list;
and acquiring the height and width information of the cells and the height and width information of the table elements, and generating a height and width information list.
In one possible design, the aspect ratio calculation module 42 is specifically configured to:
acquiring a cascading style sheet attribute corresponding to the sub-element according to the mapping relation between the attribute list and the sub-element list, and calculating the frame height and width of the sub-element according to the cascading style sheet attribute;
obtaining the standard height and width of the subelement according to the mapping relation between the subelement list and the width and height information list;
and accumulating the frame height and width with the standard height and width to obtain the height and width of the subelement.
In one possible design, the aspect ratio calculation module 42 is specifically configured to:
accumulating the high values of the sub-elements in each row based on the sub-element arrangement rule to obtain a high value set, and accumulating the wide values of the sub-elements in each row to obtain a wide value set;
selecting the maximum high value and the maximum wide value from the high value set and the wide value set as the expected high value and the expected wide value of the element object respectively;
and adjusting the expected high value and the expected wide value based on a preset limit value to obtain the target height and width of the element object.
In one possible design, the aspect ratio calculation module 42 is also specifically configured to:
judging whether the expected width value exceeds a preset limit value or not;
if so, reducing the expected width value and the expected high value in an equal ratio until the expected width value is equal to a preset limit value, and taking the reduced expected width value and expected high value as the target height and width of the element object;
and if not, taking the expected high value and the expected wide value as target height and width of the element object.
In one possible design, the aspect ratio calculation module 42 is also specifically configured to:
judging whether the subelement list is empty or not;
if so, acquiring the frame height and the standard height of the element object based on the mapping relation between the attribute list and the width and height information list, and accumulating the frame height and the standard height to acquire the target height of the element object.
Fig. 5 is a schematic structural diagram of an electronic device based on a device for converting an HTML file into a Word file according to an embodiment of the present application. As shown in fig. 5, the electronic device of this embodiment includes: at least one processor 50 (only one shown in fig. 5), a memory 51, and a computer program stored in the memory 51 and executable on the at least one processor 50, the processor 50 implementing the steps in any of the various method embodiments described above when executing the computer program.
The electronic device may include, but is not limited to, a processor 50, a memory 51. It will be appreciated by those skilled in the art that fig. 5 is merely an example of an electronic device and is not meant to be limiting, and may include more or fewer components than shown, or may combine certain components, or different components, such as may also include input-output devices, network access devices, etc.
The processor 50 may be a central processing unit (Central Processing Unit, CPU), the processor 50 may also be other general purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), off-the-shelf programmable gate arrays (Field Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The specific implementation process of the processor 501 may refer to the above-mentioned method embodiment, and its implementation principle and technical effects are similar, and this embodiment will not be described herein again.
The memory 51 may in some embodiments be an internal storage unit of the electronic device, such as a memory of the electronic device. The memory 51 may also be an external storage device of the electronic device in other embodiments, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash Card (Flash Card) or the like. Further, the memory 51 may also include both an internal storage unit and an external storage device of the electronic device. The memory 51 is used to store an operating system, application programs, boot loader (BootLoader), data, and other programs, etc., such as program codes of computer programs, etc. The memory 51 may also be used to temporarily store data that has been output or is to be output.
The embodiments of the present application also provide a computer readable storage medium storing a computer program, which when executed by a processor implements steps of the above-described respective method embodiments.
The computer readable storage medium described above may be implemented by any type of volatile or non-volatile memory device or combination thereof, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic disk, or optical disk. A readable storage medium can be any available medium that can be accessed by a general purpose or special purpose computer.
An exemplary readable storage medium is coupled to the processor such the processor can read information from, and write information to, the readable storage medium. In the alternative, the readable storage medium may be integral to the processor. The processor and the readable storage medium may reside in an application specific integrated circuit (Application Specific Integrated Circuits, ASIC for short). The processor and the readable storage medium may reside as discrete components in the electronic device described above.
Those of ordinary skill in the art will appreciate that: all or part of the steps for implementing the method embodiments described above may be performed by hardware associated with program instructions. The foregoing program may be stored in a computer readable storage medium. The program, when executed, performs steps including the method embodiments described above; and the aforementioned storage medium includes: various media that can store program code, such as ROM, RAM, magnetic or optical disks.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus/network device and method may be implemented in other manners. For example, the apparatus/network device embodiments described above are merely illustrative, e.g., the division of the modules or units is merely a logical functional division, and there may be additional divisions in actual implementation, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection via interfaces, devices or units, which may be in electrical, mechanical or other forms.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Other embodiments of the application will be apparent to those skilled in the art from consideration of the specification and practice of the application disclosed herein. This application is intended to cover any variations, uses, or adaptations of the application following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the application pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.
It is to be understood that the application is not limited to the precise arrangements and instrumentalities shown in the drawings, which have been described above, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims (10)

1. A method for converting an HTML document to a Word document, comprising:
analyzing the HTML file to obtain a plurality of elements and element objects corresponding to the elements, wherein each element object comprises an attribute list, a sub-element list and a wide-high information list, and mapping relations are arranged among the attribute list, the sub-element list and the wide-high information list;
traversing a sub-element list of the element object, calculating the height and width of each sub-element by adopting a preset recursion algorithm according to the mapping relation among the attribute list, the sub-element list and the width and height information list, and combining the height and width of the sub-element as the target height and width of the element;
and generating a Word file based on the element and the target height and width.
2. The method of claim 1, wherein the parsing the HTML file to obtain a plurality of elements and element objects corresponding to the elements includes:
analyzing the HTML file to obtain an element set, wherein the element set comprises a plurality of elements and element tags corresponding to the elements;
and carrying out secondary analysis on the elements based on the element labels to obtain element objects corresponding to the elements.
3. The method of claim 2, wherein the element tag includes a table element, the table element includes a plurality of cells, the performing secondary parsing on the element based on the element tag to obtain an element object corresponding to each element includes:
taking the cell as a sub-element of the table element to generate a sub-element list;
acquiring the cascading style sheet attribute of the cell and the cascading style sheet attribute of the table element, and generating an attribute list;
and acquiring the height and width information of the cells and the height and width information of the table elements, and generating a height and width information list.
4. The method according to claim 1, wherein calculating the height and width of each sub-element by using a preset recursive algorithm according to the mapping relationship among the attribute list, the sub-element list and the width and height information list comprises:
acquiring a cascading style sheet attribute corresponding to the sub-element according to the mapping relation between the attribute list and the sub-element list, and calculating the frame height and width of the sub-element according to the cascading style sheet attribute;
obtaining the standard height and width of the subelement according to the mapping relation between the subelement list and the width and height information list;
and accumulating the frame height and width with the standard height and width to obtain the height and width of the subelement.
5. The method according to claim 1, wherein the sub-element list includes sub-elements and sub-element arrangement rules, and the merging the widths of the sub-elements as the target widths of the elements includes:
accumulating the high values of the sub-elements in each row based on the sub-element arrangement rule to obtain a high value set, and accumulating the wide values of the sub-elements in each row to obtain a wide value set;
selecting the maximum high value and the maximum wide value from the high value set and the wide value set as the expected high value and the expected wide value of the element object respectively;
and adjusting the expected high value and the expected wide value based on a preset limit value to obtain the target height and width of the element.
6. The method of claim 5, wherein adjusting the expected high value and the expected wide value based on a preset limit value results in a target height-width of the element, comprising:
judging whether the expected width value exceeds a preset limit value or not;
if so, reducing the expected width value and the expected high value in an equal ratio until the expected width value is equal to a preset limit value, and taking the reduced expected width value and the reduced expected high value as the target height width of the element;
and if not, taking the expected high value and the expected wide value as target height and width of the element.
7. The method of claim 1, wherein prior to traversing the list of sub-elements of the element object, further comprising:
judging whether the subelement list is empty or not;
if so, acquiring the frame height and standard height of the element object based on the mapping relation between the attribute list and the width and height information list, and accumulating the frame height and standard height to acquire the target height of the element.
8. An apparatus for converting an HTML file into a Word file, comprising:
the analysis module is used for analyzing the HTML file to obtain a plurality of elements and element objects corresponding to the elements, wherein each element object comprises an attribute list, a sub-element list and a wide-high information list, and mapping relations are arranged among the attribute list, the sub-element list and the wide-high information list;
the high-width calculation module is used for traversing the sub-element list of the element object, calculating the high width of each sub-element by adopting a preset recursion algorithm according to the mapping relation among the attribute list, the sub-element list and the wide-high information list, and combining the high widths of the sub-elements as the target high widths of the elements;
and the file generation module is used for generating a Word file based on the elements and the target height and width.
9. An electronic device, comprising: a processor, and a memory communicatively coupled to the processor;
the memory stores computer-executable instructions;
the processor executes computer-executable instructions stored in the memory to implement the method of any one of claims 1 to 7.
10. A computer readable storage medium having stored therein computer executable instructions which when executed by a processor are adapted to carry out the method of any one of claims 1 to 7.
CN202310756107.8A 2023-06-25 2023-06-25 Method and system for converting HTML (hypertext markup language) file into Word file Pending CN116882365A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310756107.8A CN116882365A (en) 2023-06-25 2023-06-25 Method and system for converting HTML (hypertext markup language) file into Word file

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310756107.8A CN116882365A (en) 2023-06-25 2023-06-25 Method and system for converting HTML (hypertext markup language) file into Word file

Publications (1)

Publication Number Publication Date
CN116882365A true CN116882365A (en) 2023-10-13

Family

ID=88253914

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310756107.8A Pending CN116882365A (en) 2023-06-25 2023-06-25 Method and system for converting HTML (hypertext markup language) file into Word file

Country Status (1)

Country Link
CN (1) CN116882365A (en)

Similar Documents

Publication Publication Date Title
US10318614B2 (en) Transformation of marked-up content into a file format that enables automated browser based pagination
CN107656914B (en) Configurable table generation method and device, terminal equipment and storage medium
US9489385B1 (en) Exact look and feel for sharepoint archived content
US20100312803A1 (en) Apparatus and method for identifying and abstracting a visualization point from an arbitrary two-dimensional dataset into a unified metadata for further consumption
US20120331375A1 (en) Dynamically updating a running page
US10540416B2 (en) Linking source code to running element
US10650559B2 (en) Methods and systems for simplified graphical depictions of bipartite graphs
US20220027740A1 (en) Auto-formatting of a data table
US8745027B1 (en) Jslayout cascading style sheets optimization
US9208249B2 (en) Profiler for page rendering
CN108256716B (en) Method and apparatus for configuring/executing processes and/or atomic processes based on tables
WO2012012949A1 (en) Visual separator detection in web pages by using code analysis
CN114816404A (en) Method and device for dynamically rendering view page, computer equipment and storage medium
CN116383546A (en) File processing method, system, computer device and computer readable storage medium
CN116644729A (en) Table file processing method, apparatus, computer device and storage medium
CN111104117A (en) Page theme style switching method and device, electronic equipment and computer storage medium
US9594737B2 (en) Natural language-aided hypertext document authoring
CN115659087B (en) Page rendering method, equipment and storage medium
CN116882365A (en) Method and system for converting HTML (hypertext markup language) file into Word file
CN116227454A (en) Universal automatic report generation method and system
CN113139145B (en) Page generation method and device, electronic equipment and readable storage medium
CN111104160B (en) Membership exhibiting method and device, electronic equipment and storage medium
CN106484759B (en) Method and device for analyzing storage file of interactive electronic whiteboard
CN116009863B (en) Front-end page rendering method, device and storage medium
US20240126978A1 (en) Determining attributes for elements of displayable content and adding them to an accessibility tree

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination