CN111104557A

CN111104557A - Heterogeneous document processing system and method based on standard document markup language specification

Info

Publication number: CN111104557A
Application number: CN201911155894.0A
Authority: CN
Inventors: 黄琴; 王龙娟
Original assignee: Individual
Current assignee: Individual
Priority date: 2019-11-22
Filing date: 2019-11-22
Publication date: 2020-05-05

Abstract

The invention discloses a heterogeneous document processing system and method based on standard document markup language specification. The processing method comprises the following steps: s1, converting the input heterogeneous document into a standard document markup language file; s2, analyzing the standard document markup language file into a standard content object; s3, carrying out layout and view display by using standard content objects; s4, editing the standard content object according to the user operation; and S5, storing the edited standard content object. The invention defines the intuitive document content organization with strong readability and the format definition method by using the standard document markup language, converts the heterogeneous document into the uniform standard content object, and the editor only needs to care about the processing of the standard content object to realize the editing of the heterogeneous document.

Description

Heterogeneous document processing system and method based on standard document markup language specification

Technical Field

The invention belongs to the technical field of computer processing, and particularly relates to a heterogeneous document processing system and method based on standard document markup language specifications.

Background

Scenes that contents with document formats need to be exchanged among different platforms in different industries are more and more, but the operations of analyzing, displaying and editing the contents with the document formats are very troublesome. The original document data format of Word and WPS is not public, and only a specific document editor can be used for editing and displaying, and background text content cannot be directly edited to achieve the purpose of modifying the document. The contents of readable and writable text documents customized by Markdown and other text editing companies such as various electronic medical record manufacturers, which disclose the text in the document format, also have the problems of non-uniform background document data format, poor readability, incapability of universal editors and the like. Therefore, there is a need to develop a standard document content organization and format definition specification and processing system and method, which enable a user to intuitively read and edit document format content, and enable heterogeneous documents to be edited by an editor supporting standard document markup language specification.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provides a heterogeneous document processing system and a method based on standard document markup language specification, which utilize the standard document markup language to define the intuitive and highly readable document content organization and format definition.

The purpose of the invention is realized by the following technical scheme: a heterogeneous document processing system based on a standard document markup language specification, comprising:

the content analysis and conversion module is used for converting the input heterogeneous document into a standard document markup language file;

the analysis module is used for converting the standard document markup language file into a standard content object;

the event module is used for adding, deleting and modifying the standard content object according to the input event of the system;

the generating module is used for generating the standard content object processed by the event module into a corresponding standard document markup language file;

the layout module is used for sequentially laying out the standard content objects according to the display area rule and the paper size;

the drawing module regularly delivers the document content to different devices for display or printing according to the layout information of the document;

and the content construction conversion module is used for converting the standard document markup language file into a heterogeneous document required by a user.

Further, the standard document markup language format is defined as:

using a Doc tag to represent a document element, a start tag < Doc > and an end tag </Doc > to identify the start and end range of a document;

header elements are represented using Header tags, a start tag < Header > and an end tag </Header > identifying start and end ranges of a Header, and one document may contain a plurality of different headers;

using a Main label to represent a text area element, wherein a starting label < Main > and an ending label </Main > identify the starting range and the ending range of the text area;

using Footer labels to represent the Footer elements, wherein a starting label < Footer > and an ending label </Footer > identify the starting and ending range of the Footer;

representing content styles using a < Format/> self-closing tag;

representing Paragraph elements using Paragraph tags, a start tag < Paragraph > and an end tag </Paragraph > identifying the start and end extent of a Paragraph; a paragraph is a first level child element of a header, a body, a footer, a cell element;

table tags are used to represent Table elements, the start tag < Table > and the end tag </Table > identify the start and end ranges of the Table;

row elements are represented using Row tags, the start tag < Row > and the end tag </Row > identifying the start and end range of a Row;

cell elements are represented using Cell tags, the start tag < Cell > and the end tag </Cell > identifying the start and end extent of the paragraph;

formula elements are represented using Formula tags, a start tag < Formula > and an end tag </Formula > identify the starting and ending ranges of the Formula;

using formula tags to represent sub-formula elements, the start tag < formula and end tag </formula and identify the starting and ending range of the formula; the sub-formula is subordinate to the formula element, and the sub-formula represents the minimum unit of the formula branch;

using < Image/> self-closing tag to represent the picture;

lines are represented using < Line/> self-closing labels;

using the Attach tag to represent the additional format element, the start tag < Attach > and the end tag </Attach > identifying the start and end ranges of the additional format element;

using List tags to represent the List, the start tag < List > and the end tag </List > identify the start and end ranges of the List; the list comprises list items, and the starting range and the ending range of the list items are identified by using a starting tag < Item > and an ending tag </Item >;

the CheckBox label is used for representing the CheckBox, a starting label < CheckBox > and an ending label </CheckBox > identify the starting and ending range of the CheckBox, if the CheckBox has a group-id attribute, the CheckBox is represented as a radio box, otherwise, the CheckBox is represented as a check box;

element tags are used to represent elements, a start tag < Element > and an end tag </Element > identify the start and end ranges of an Element;

an Expression is represented using an < Expression/> self-close tag;

an annotation is represented using the < Antotate/> self-closing tag;

editor/> self-closing tag is used to represent Editor information.

The line feed is represented by using a character "\\ r \ n", "\\ r", "\\ n" or < LF/> self-closing tag;

the Tab is represented using the character "\ t" or < Tab/> self-closing tag;

spaces are represented using the Space character "" or < Space/> self-closing label.

Further, the standard document markup language file has the following structural feature points:

a. a standard document markup language file can have a plurality of document nodes, a document node can have a plurality of header and footer nodes, and a document node can only have one text node or use a plurality of page nodes to replace the text node;

b. the container of the paragraph is called an editing panel, and comprises a header, a text, a footer and a cell which are all the editing panels; editing any type of element in the panel, wherein the attribute of the element should include all attributes of the element when the element appears for the first time, and the subsequent elements of the same type only need to identify the attribute different from the previous element of the same type, so that the property can improve the readability of the standard document markup language and save the storage space;

c. the first-level child elements of the editing panel can only be paragraph nodes, and any other element nodes except document nodes, header nodes, text nodes, page nodes, footer nodes and paragraph nodes can be contained in the paragraph nodes as required;

d. in the text mode, one document node only has one text node and does not include page information;

e. in page mode, a plurality of page nodes are used in place of a body node to provide page information.

f. The element organizes a plurality of bit parameters by using a bit control attribute so as to save storage space; the single bit parameters may also be represented using independent attributes to improve readability.

Further, when the content analysis conversion module converts the heterogeneous document, the following judgment is firstly performed on the type of the heterogeneous document:

if the input heterogeneous document per se meets the standard document markup language specification, no conversion is performed;

if the input heterogeneous document is the heterogeneous document which can be read and analyzed visually, generating a standard document markup language file according to the mapping conversion of the original document format;

if the input heterogeneous document is a heterogeneous document which can not be directly read and analyzed in the document text format, an official type library or a method of the heterogeneous document is called to analyze the document, and the document content and the structural format information are obtained and then converted into a standard document markup language document.

The invention also discloses a heterogeneous document processing method based on the standard document markup language specification, which comprises the following steps:

s1, judging whether conversion is needed according to the format type of the input heterogeneous document, if so, converting the input heterogeneous document into a standard document markup language file, otherwise, executing the step S2;

s2, analyzing the standard document markup language file into a standard content object;

s3, carrying out layout and view display by using standard content objects;

s4, editing the standard content object according to the user operation;

s5, storing the edited standard content object, comprising the following substeps:

s51, saving the edited standard content object as a standard document markup language file;

s52, aiming at the need of saving as heterogeneous documents in other formats, the standard document markup language file is generated into a heterogeneous file in a specific format through a corresponding content construction converter;

and S53, storing the converted specific format file.

Further, the step S1 includes the following sub-steps:

s11, loading the heterogeneous document;

s12, judging the type of the heterogeneous document: if the input heterogeneous document per se meets the standard document markup language specification, no conversion is performed; if the input heterogeneous document is the heterogeneous document which can be read and analyzed visually, generating a standard document markup language file according to the mapping conversion of the original document format; if the input heterogeneous document is a heterogeneous document which can not be directly read and analyzed in the document text format, an official type library or a method of the heterogeneous document is called to analyze the document, and the document content and the structural format information are obtained and then converted into a standard document markup language document.

Further, the step S2 includes the following sub-steps:

s21, reading a standard document markup language file or character string based on XML, and calling an XML parser to quickly parse the file or character string into an XML DOM object;

s22, traversing, iterating and converting the XML object hierarchical relationship into a standard content object, and for the content or the style information which cannot be expressed through the XML object hierarchical relationship, associating other standard objects as the content or the style information of the standard objects through the ID identifiers;

s23, obtaining document content, structure and style information according to the hierarchical relation of the XML objects and the subordination relation of the standard objects by the standard content objects analyzed in the step S22.

The invention has the beneficial effects that: the method for organizing the document content and defining the format with strong intuitive readability is defined by using a Standard Document Markup Language (SDML). The document content organization mode is defined by the XML DOM elements, and the characteristics of the document elements are defined by the XML DOM attributes. The heterogeneous document is finally converted into a uniform Standard Content Object (SCO), and the editor only needs to care about processing the Standard Content Object (SCO), so that the heterogeneous document is edited. For the newly added heterogeneous document which needs to support editing, only a corresponding content analysis converter and a corresponding content construction converter module or plug-in are required to be developed, and other functional modules do not need to be modified, so that the reusability of application layer codes is greatly improved, the development cost is saved, and the applicability of an editor is greatly improved.

Drawings

FIG. 1 is a flow chart of a heterogeneous document processing method based on standard document markup language specifications.

Detailed Description

The technical scheme of the invention is further explained by combining the attached drawings.

The invention relates to a heterogeneous document processing system based on standard document markup language specification, comprising:

the analysis module is used for converting the standard document markup language file into a standard content object; standard Document Markup Language (SDML) is essentially an XML-based document content organization and format definition text representation. The method has the characteristic of manually and directly editing and quickly modifying the content and the format, and is also beneficial to analyzing the content and the format into the memory objects which can be used by the program through the XML parser by the software. Defining the Standard format of the memory objects, and facilitating the objects processed by the program, which are called as "Standard Content Object" (SCO for short);

the event module is used for adding, deleting and modifying the standard content object according to the input event of the system; the event module performs addition, deletion and change (style) processing on a Standard Content Object (SCO) according to an input event of the system, thereby realizing the editing function of the document.

the layout module is used for sequentially laying out the standard content objects according to the display area rule and the paper size after the Standard Content Objects (SCO) are analyzed; processing related to content layout positioning, line division, paging, punctuation mark position normalization and the like; the layout module determines the sequential arrangement display mode of the contents in the display area;

the drawing module regularly delivers the document content to different devices for display or printing according to the layout information of the document; such as screen rendering, picture rendering, print rendering, etc.;

the content construction conversion module is used for converting the standard document markup language file into a heterogeneous document required by a user, wherein the heterogeneous document comprises any format capable of representing the document, such as various custom files of markdown, wps, word, electronic medical record and the like.

The standard document markup language format of the present invention is defined as:

(1) specification of document format: representing a document element by using a Doc (document) tag, wherein a starting tag < Doc > and an ending tag </Doc > identify the starting range and the ending range of a document; is represented as follows:

the document field specification is as follows:

document semantic description: width and height are respectively the width and height of the layout paper of the current document; top-margin, bottom-margin, left-margin and right-margin are reserved for the upper margin, the lower margin, the left margin and the right margin respectively. The three attributes of upper margin reservation, left margin reservation and right margin reservation are used to determine the content area of Header (Header). The four attributes of upper margin reservation, lower margin reservation, left margin reservation and right margin reservation are used for determining a text (Main) content area. The three attributes, lower margin reservation, left margin reservation, and right margin reservation, may determine the content area of the Footer (Footer). unit means the unit used for the document numerical parameter and can be cm, pound, pixel, etc.

The document has the following bit-control attributes:

(2) specification of header format: header elements are represented using Header tags, and a start tag < Header > and an end tag </Header > identify the start and end ranges of a Header. A document may contain a number of different headers.

The header is a container of paragraph elements, represented as follows:

the header field specifies the following table:

header semantic description: top-margin, bottom-margin, left-margin and right-margin are reserved for the upper, lower, left and right inner space margins respectively. The space reservation of the upper, lower, left and right sides of the internal space determines the layout space of the content in the Header (Header). The header start application start page number (application-page) is used to determine the header used by the page. A document may have multiple headers, and the use of this attribute may be used to distinguish the application page number ranges of different headers.

(3) And (3) text format specification: the Main area elements are represented using Main tags, and a start tag < Main > and an end tag </Main > identify the start and end ranges of the Main area. All the text contents can be put into the whole text area, and page elements can also be added into the text area, so that the text contents are laid out in different pages to achieve the effect of paging display.

The body is a container of paragraph elements, represented as follows:

the text field specifies the following table:

text semantic description: top-margin, bottom-margin, left-margin and right-margin are reserved for the upper, lower, left and right inner space margins respectively, and the layout space of the content in the text area is determined.

(4) The format specification of the footer: a Footer element is represented using Footer tags, the start tag < Footer > and the end tag </Footer > identifying the starting and ending extents of the Footer, and a document may contain a number of different headers.

The footer is a container for paragraph elements, represented as follows:

the footer field is specified as shown in the following table:

footer semantic description: top-margin, bottom-margin, left-margin and right-margin are reserved for the upper, lower, left and right margins of the internal space respectively, and the layout space of the content in a Footer (Footer) is determined. The starting page number (application-page) at which the footer starts to be applied is used to determine the footer used by the corresponding page. A document may have multiple footers and the use of this property can be used to distinguish the range of application page numbers for different footers. The page-num-style is used to determine the style of the displayed page number, and is used to identify the page number style that the current footer displays when displaying the page number.

More styles can be supplemented by self-defining, and the editor determines the drawing effect of the corresponding page number style. The following are four examples of common patterns:

(5) specification of content style format: content styles are represented using a < Format/> self-closing tag, as follows:

the content style field specification is as shown in the following table:

semantic description of content style: the content style number (id) is used for other paragraphs, and the content refers to the format information. The font name (fontname) identifies the font used when displaying the content. size indicates the font size used when displaying the content. color denotes a font color used when displaying content. scale denotes unit content scaling. spacing represents the content-to-content spacing. offset-x represents the offset of the content in the direction of the horizontal axis relative to the original starting point of the original layout. The sub-formula specific content, such as a formula, may use the attribute to effect adjustment of the content in the direction of the horizontal axis. offset-y represents the offset of the content in the direction of the longitudinal axis relative to the line in the original layout height; sub-formula specific content, such as a formula, may use the attribute to effect adjustment of the content in the direction of the vertical axis. In the revision mode, add indicates that the content is newly added by the reviser, and is used for recording the revision sequence. In the revision mode, del indicates that the content is deleted by the reviser and is used to record the revision number.

The position control attribute specific semantics of the content style are shown in the following table:

(6) paragraph format specification: paragraph elements are represented using Paragraph tags, the start tag < Paragraph > and the end tag </Paragraph > identifying the beginning and ending extent of a Paragraph.

A paragraph is the first level child element of a header, body (page), footer, cell element. The paragraph nodes are represented as follows:

the paragraph field is specified as shown in the following table:

paragraph semantic description: the specific meanings of the alignment attribute values are shown in the following table:

left-index and right-index are the left and right indent distances respectively, and are used to determine the inward offset value when the content inside the paragraph begins to be laid out. space-before and space-after are respectively the front and back distances of the paragraph and are used for determining the interval value between the paragraphs. The special indentation style (specification-format) contains the styles shown in the following table, and different styles determine the indentation effect of each line in combination with the special indentation value.

specific-value represents an indentation value determined by a special indentation pattern.

The line spacing rule (linespace-rule) contains the rules shown in the table below, with different rules determining the line spacing effect in combination with the line spacing values.

linespace-value represents the actual value of the line spacing determined by the line spacing rule.

The higher the level value (level), the lower the level, the more logically the contents of subsequent paragraphs belong to the preceding hierarchically higher paragraphs until the same level paragraph or other higher level paragraphs or paragraphs without level values are encountered. The paragraph default level value is 0.

The specific semantics of the position control attribute of a paragraph are shown in the following table:

(7) specification of a table format: table tags are used to represent Table elements, the start tag < Table > and the end tag </Table > identify the start and end ranges of the Table;

a 3 row 3 column table as shown below:

the corresponding "standard document markup language" format is as follows:

a 3-row, 3-column table with cells in the first row and the first column merging two cells down and cells in the second row and the second column merging one cell to the right as follows:

the corresponding "standard document markup language" format is expressed as follows:

the table fields specify the following table:

semantic description of tables: ID denotes the identification ID of this table. name defines the entity name of the current form object. rows define the number of rows in the table, which can be determined by the number of row elements. cols defines the number of columns in the table, or the number of columns can be determined by combining the number of cells and the colspan attribute of the cells.

The table has the following bit control attributes:

(8) line format specification: row tags are used to represent the line elements, with the start tag < Row > and the end tag </Row > identifying the start and end range of the line.

The row field specifies the following table:

description of the Row language: height represents the current line height parameter. The split-id indicates that when a table is split to two pages in view mode, the attribute can identify that the table row is split.

(9) Cell format specification: cell elements are represented using Cell tags, the start tag < Cell > and the end tag </Cell > identifying the beginning and ending extent of a paragraph.

The cell field is specified as shown in the following table:

cell semantic description: width represents the width of the cell, combined with the actual width of the column merge parameter. height represents the height of the cell, combined with the actual height of the row merge parameter. The rowspan represents a row merging parameter, defaults to 1, and represents that other cells are not merged; greater than 1 indicates that N cells are merged to the right. colspan represents a column merging parameter, defaults to 1, and represents that other cells are not merged; greater than 1 indicates that N cells are merged downward. The border-width, the border-color and the border-style are respectively the border width, the border color and the border style and are used for determining the display effect of the cell border.

The cells have the bit-control attributes shown in the following table:

(10) specification of formula format: formula elements are represented using Formula tags, the start tag < Formula > and the end tag </Formula > identify the starting and ending range of the Formula.

An example formula:

is represented as follows:

the formula field is specified in the following table:

formula semantics illustrate: ID represents the identification ID of the formula, class represents the formula type, the system can determine the formula display effect according to the internal content style of the formula elements and the like, and the system can also use the formula type to add richer styles to optimize the formula display effect. The name is used to identify the current formula entity name. width and height are respectively the design layout width and height of the formula. Scale represents the resizing of the formula instance by this scale attribute.

(11) Subformula format specification: using formula tags to represent the sub-formula elements, the start tag < formula and end tag </formula arc > identify the start and end ranges of the sub-formula; the sub-formula is subordinate to the formula element, and the sub-formula represents the minimum unit of the formula branch.

Example subformula:

is represented as follows:

the sub-formulas are dependent on formula elements and are used only to represent the minimum unit of a formula line, and no additional attribute field definition is required.

(12) The format of the picture is standardized: pictures are represented using the < Image/> self-closing tag as follows:

the picture field is specified as shown in the following table:

and (3) semantic description of pictures: ID identifies the ID of the picture. name identifies the current photo entity name. width, height identifies the display width and length dimensions of the current picture. src represents a picture Data source, which may be a unique identifier of a picture in a storage system, or a storage path of a picture on a file server, or a Data Url format Data source. The system can correctly find the picture content by combining the picture data source according to the system configuration.

(13) Specification of a wire format: lines are represented using a < Line/> self-closing label, as follows:

the line field specification is shown in the following table:

line semantic description: class denotes a kind of line, and for example, class ═ split line denotes a dividing line. type represents a style type of a line, and for example, type ═ solid represents a solid line. start-x, start-y represent the coordinates of the starting point relative to the parent node layout. end-x, end-y represent the coordinates of the end point relative to the parent node layout.

(14) Additional format specifications: using the Attach tag to represent the additional format element, the start tag < Attach > and the end tag </Attach > identifying the start and end ranges of the additional format element; the additional object does not participate in the layout process, and when the layout is completed, the additional object is drawn relative to the layout parameters of the parent element. As shown below, this indicates that the watermark effect is achieved in the body page.

The additional fields are specified in the following table:

additional semantic descriptions: class denotes the additional type. offset-x, offset-y are the horizontal and vertical offset values, respectively, relative to the parent element layout parameter.

(15) List format specification: using List tags to represent a List, wherein a starting tag < List > and an ending tag </List > identify the starting and ending range of the List, the List contains List items, and the starting tag < Item > and the ending tag </Item > identify the starting and ending range of the List items; as follows:

the list field is specified in the following table:

list semantic description: id is used to identify List or Item. The style attribute of the List is used to identify the style of the List. text represents list item identification text. value is a weight value and is used for specifying a corresponding numerical value or score when the list item is selected. tip represents the hint information. The group-id represents a group id, and the same group id value represents the same group with the same attribute and cannot be selected at the same time when being selected.

(16) And (3) frame selection format specification: the CheckBox tags are used to represent the box, the start tag < CheckBox > and the end tag </CheckBox > identify the start and end range of the box; as follows:

the box field is specified in the following table:

and (3) selecting a frame semantic description: id denotes the box id. style represents a box style. text represents a box-identifying text. value represents a weight value and is used for specifying a corresponding numerical value or score when the selection box is selected. tip represents the hint information. The group-id represents the group id, and the same group id value represents the same group of attributes. If the check box has a group-id attribute, the check box is represented as a radio box, otherwise, the check box is represented as a check box.

(17) Element format specification: element tags are used to represent elements, the start tag < Element > and the end tag </Element > identify the start and end scope of the Element;

an element is a combination of a piece of content. An element may determine whether other elements may be included according to the configuration. As follows:

the element field is specified as shown in the following table:

element semantic description: the ID is used to identify the element ID. class represents the element class. name represents the current element entity name. note represents background text. tip represents the hint information. default-text denotes an initial default value. value represents the weight value and is used for the corresponding numerical value when the elements are subjected to comprehensive operation. before-tag indicates the tag content added before the element content. The after-tag represents the tag content added after the element content. start-border denotes an element start boundary identifier. end-border denotes an element end boundary identifier. input-mode represents the input type. Such as direct input, selection input, specific time format. width means the set element width dimension. An include-cfg representation may contain the content type.

The significance of the bit-control attribute of an element is shown in the following table:

(18) specification of expression format: an Expression is represented using an < Expression/> self-close tag; as follows:

the expression field is specified as shown in the following table:

semantic description of expressions: the event represents a triggering event. action means to execute an action, and is an action having a specific meaning. source represents a data source. field denotes a field. write-back indicates that the content update is written back to the data source mode. mode is an execution mode, including "always execute", "execute once only", and "do not execute".

(19) And (3) specification of an annotation format: an annotation is represented using the < Antotate/> self-closing tag; as follows:

the annotation fields are specified in the following table:

annotating semantic descriptions: the annotation element does not store specific annotation content, but uses the session id to map specific annotation interaction information in the system. The annotation contents can be nested with each other, and an annotation start is represented by < Annotate id ═ xx ═ type ═ start/>. The end of an annotation is indicated by the use of < Annotate id ═ xx ═ type ═ end/>.

(20) Revising the information format specification: editor/> self-close tag is used to represent Editor information as follows:

the revision information field is specified as shown in the following table:

and (3) revising information semantic description: the serial-id represents a revision serial number, the serial number is sequentially increased from 0, and the correction time is later as the serial number is larger. id is used for reviser identification. name is used to identify the reviser name. time is used to identify the revision time.

(21) Page format specification: the page is a page display virtual concept inside the body, namely the page size is determined by using the paper width and the paper height of the document attributes, and the layout relation of a header and a footer in the page is determined by using the document margin reservation related parameters. In the normal mode, page information does not need to be saved in a standard document markup language file. If a third party application has a need to extract page information, the pages may be defined in the following format when saved to a standard document markup language file:

the page field specifies: pages are virtual view concepts whose attributes may determine the parameters of the page using document (Doc), body (Main) related attributes. As shown in the following table:

page semantic description: width and height respectively represent the paper width and the paper height to represent the layout paper size of the current document. top-margin, bottom-margin, left-margin and right-margin respectively represent the upper, lower, left and right margin reservation of the internal space of the page, and the layout space of the content in the text area in the page is determined. Page-num identifies the page number of the current page.

(22) Specification of special function character format: the parser of the standard document markup language supports the following null character presentation:

using characters '\\ r \ n', '\ r', '\\ n' in the element content text segment to represent line feed; the character "\ t" is used to represent tab characters; space is represented by a space character. Given that some XML libraries may not resolve these null characters as expected, for example, multiple consecutive space characters may be resolved into a space character. The standard document markup language also defines a special function character format specification as follows to represent special content.

The Space character is represented using a < Space count ═ 1/' > self-closing label. Wherein the attribute count represents the number of space characters; the wrap is represented using a < LF/> self-close tag; tab count ═ 8"/> self-close label is used to denote Tab. Wherein the attribute count represents the number of tab units;

(23) attribute usage rule supplementation

The bit control attribute is as follows: a bit parameter uses a binary value of 1 bit or multiple bits (bits) to represent the corresponding integer value of the attribute. For example: the 1-bit attribute value can represent two states of 0 and 1, the 2-bit attribute value can represent four states of 00, 01, 10 and 11, and the multi-bit parameter can be extended according to binary meaning.

The attribute name of the bit-controlled attribute is not limited to cfg, and any attribute following the concept of the bit-controlled attribute is the bit-controlled attribute, and for example, include-cfg may be used as the name of the bit-controlled attribute.

In the format definition of the present invention, a plurality of bit parameters can be organized using a bit-control attribute to save memory space. Independent attribute representations may also be used with single bit parameters to improve readability.

An example of the use of bit-controlled attribute splitting is shown below:

an example of the use of bit-controlled attribute consolidation is shown below:

the cfg attribute value calculating method comprises the following steps: the combination from low bit to high bit by bit control is binary: 100010 corresponding to decimal data 34.

The numerical value representation after the bit control combination is not limited to decimal representation, and other binary numerical values can be used for representation.

Units of numerical attributes: the units of the attributes may be any units, such as units supporting metric and units in the English system. Example (c): a width of 21 may represent a width of 21 centimeters or 21 pounds. The unit may be determined by the parser of the content or may be determined by the attribute unit.

Attribute values default: the self-closing tag can have any default attribute, like a normal tag, that is, only necessary attributes needed to influence the content need to be written. For example, if the following word "five" is bolded, only the text of the content of the designated area needs to be limited:

the content style after the word "five" is bold is expressed as:

The 'heterogeneous document' referred by the invention can be in any other document format, and a module or a plug-in which is responsible for converting various heterogeneous documents into standard document markup language files is called as a content analysis conversion module. When the content analysis conversion module converts the heterogeneous document, firstly judging the type of the heterogeneous document:

if the input heterogeneous document is the heterogeneous document which can be read and analyzed visually, generating a standard document markup language file according to the mapping conversion of the original document format; common heterogeneous documents such as markdown (a markup language that can be written using a common text editor), or other xml-based document custom formats, other regular content format texts, etc., and other enterprise platforms (such as various electronic medical record vendors) use defined document formats, which are directly readable and analyzable (directly visible texts), and the content parsing conversion module can generate Standard Document Markup Language (SDML) files according to the mapping conversion of the original document formats.

If the input heterogeneous document is a heterogeneous document which can not be directly read and analyzed in the document text format, an official type library or a method of the heterogeneous document is called to analyze the document, and the document content and the structural format information are obtained and then converted into a standard document markup language document. For example, documents such as Word and WPS which do not have a document text format are converted into Standard Document Markup Language (SDML) files by obtaining the structure and content information of the documents through interfaces provided by Word and WPS.

As shown in FIG. 1, the present invention discloses a heterogeneous document processing method based on standard document markup language specification, comprising the following steps:

s1, judging whether conversion is needed according to the format type of the input heterogeneous document, if so, converting the input heterogeneous document into a standard document markup language file, otherwise, executing the step S2; the method comprises the following substeps:

s11, loading the heterogeneous document;

S2, analyzing the standard document markup language file into a standard content object; the method comprises the following substeps:

S3, carrying out layout and view display by using standard content objects;

s4, editing the standard content object according to the user operation;

s52, aiming at the need of saving as other format isomeric documents, the standard document mark language file is generated into isomeric files with specific format through corresponding content construction converter, the step is opposite to the conversion operation of the step S1, and SDML file is converted into corresponding file format through the content construction converter of Markdown file, Word, WPS or other specific document format files;

and S53, storing the converted specific format file.

The following is an example of a process of performing a heterogeneous document parsing conversion process using an original document in markdown format.

Paragraph processing: the Markdown Paragraph has no special format, the line feed of the Paragraph uses more than two spaces plus carriage returns, and the Paragraph element is used in the system to represent a Paragraph. The characters "\\ r \ n", "\\ r", "\\ n" or the label < LF/> can be used as line-feed marks.

And (3) content style conversion processing: markdown supports functions of font italics, bolding and the like.

The format correspondence is shown in the following table:

and (3) processing a table: table tags are used to represent Table elements, with a start tag < Table > and an end tag </Table > identifying the beginning and ending extent of a paragraph.

The table conversion correspondence is shown in the following table:

the Standard Document Markup Language (SDML) specification main elements are exemplified as follows:

it will be appreciated by those of ordinary skill in the art that the embodiments described herein are intended to assist the reader in understanding the principles of the invention and are to be construed as being without limitation to such specifically recited embodiments and examples. Those skilled in the art can make various other specific changes and combinations based on the teachings of the present invention without departing from the spirit of the invention, and these changes and combinations are within the scope of the invention.

Claims

1. A heterogeneous document processing system based on a standard document markup language specification, comprising:

2. The standard document markup language specification based heterogeneous document processing system according to claim 1, wherein the standard document markup language format is defined as:

header tags are used to represent Header elements, a start tag < Header > and an end tag </Header > identify the start and end ranges of a Header, and a document contains a plurality of different headers;

using Footer tags to represent Footer elements, wherein a starting tag < Footer > and an ending tag </Footer > identify the starting and ending range of the Footer, and a document comprises a plurality of different footers;

representing content styles using a < Format/> self-closing tag;

using < Image/> self-closing tag to represent the picture;

lines are represented using < Line/> self-closing labels;

an Expression is represented using an < Expression/> self-close tag;

an annotation is represented using the < Antotate/> self-closing tag;

editor/> self-closing tag is used to represent Editor information.

3. The standard document markup language specification based heterogeneous document processing system according to claim 1, wherein the standard document markup language file has the following feature points in structure:

b. the container of the paragraph is called an editing panel, and comprises a header, a text, a footer and a cell which are all the editing panels; when any type of element in the editing panel appears for the first time, the attribute of the element should contain all attributes of the element or all attributes different from default attributes, and the subsequently appearing elements of the same type only need to identify the attribute different from the previous element of the same type;

f. The element organizes multiple bit parameters using the bit-control attribute or represents a single bit parameter using an independent attribute.

4. The system of claim 1, wherein the content parsing and converting module determines the type of the heterogeneous document when converting the heterogeneous document:

5. The heterogeneous document processing method based on the standard document markup language specification is characterized by comprising the following steps of:

s3, carrying out layout and view display by using standard content objects;

s4, editing the standard content object according to the user operation;

and S53, storing the converted specific format file.

6. The method of processing a heterogeneous document according to claim 5, wherein the step S1 includes the sub-steps of:

s11, loading the heterogeneous document;

7. The method of processing a heterogeneous document according to claim 5, wherein the step S2 includes the sub-steps of:

s21, reading a standard document markup language file or character string based on XML, and calling an XML parser to quickly parse the file or character string into an XMLDOM object;