CN112364604A

CN112364604A - XML document digitization method and system

Info

Publication number: CN112364604A
Application number: CN202011156122.1A
Authority: CN
Inventors: 吴海涛; 郭丽红; 杨洁
Original assignee: Nanjing Institute of Technology
Current assignee: Nanjing Institute of Technology
Priority date: 2020-10-26
Filing date: 2020-10-26
Publication date: 2021-02-12

Abstract

The invention discloses a digitalization method of XML documents, which is suitable for comparing the similarity between the XML documents and comprises the following steps: s1, extracting a trunk structure tree; s2, filling pseudo nodes and unifying tree structures; s3, extracting the full path and generating the tuple string stage. The invention can realize the digital processing of the XML document by extracting the three steps of the trunk structure tree, unifying the structure tree type and the tuple string conversion and combining the structure characteristic and the semantic characteristic of the XML document, has the characteristics of high efficiency and rapidness in the processing process, high similarity detection sensitivity of the digital result and the like, can digitally express a large amount of XML documents in a complex network environment, not only simplifies the XML documents, but also facilitates the subsequent document classification and application processing.

Description

XML document digitization method and system

Technical Field

The invention relates to the technical field of XML document digitization processing, in particular to a method and a system for digitizing an XML document.

Background

With the rapid development of the network, a large amount of semi-structured data stored in an XML form is generated on the Internet, and the data accumulated in different fields has infinite potential and great value. XML documents, as a representative of semi-structured data, are used by more and more enterprises and public institutions due to their features such as platform independence, convenient data processing, and flexible Web applications. Therefore, in the face of huge XML data, the digital representation of the document is the basis for data analysis, classification and various data processing, and the quality of the document directly affects the subsequent various operations. For example, the invention with the patent number CN108984713A discloses an XML file processing method and apparatus, which can solve the problem that a single table is too large, resulting in long time consumption for query or other operations, by splitting and storing an XML file into a plurality of database tables according to a structure tree.

However, with the advent of XML documents on the internet, which have grown year by year at exponential speeds, a burden is imposed on the classification process of semi-structured data. Therefore, in the face of massive XML documents, finding a quick and efficient digitization method for XML documents is a necessary trend to facilitate subsequent processing of XML information. The digital representation of the XML document can greatly improve the classification speed of the XML document and can provide guarantee for further application and processing of the XML document. However, in the current XML document digitization method, the following problems exist: firstly, the digitization result is simple and rough, and cannot accurately reflect XML document information, secondly, the representation method is complex, the conversion efficiency is low, and thirdly, the XML structural feature is emphasized, and the semantic feature is ignored.

Disclosure of Invention

Aiming at the defects in the prior art, the invention provides a method and a system for digitizing an XML document, which realize the digitized processing of the XML document by three steps of extracting a trunk structure tree, unifying the structure tree type and tuple string conversion and combining the structural characteristics and semantic characteristics of the XML document, have the characteristics of high efficiency and rapidness in processing process, high similarity detection sensitivity of digitized results and the like, can digitally express a large amount of XML documents in a complex network environment, not only simplifies the XML document, but also facilitates the subsequent document classification and application processing.

In order to achieve the purpose, the invention adopts the following technical scheme:

a digitization method of XML documents, which is suitable for similarity comparison between the XML documents, and comprises the following steps:

s1, extracting a trunk structure tree:

preprocessing the imported XML document, finding out a trunk structure tree, removing redundant nodes, and realizing that the same path appears only once in the trunk structure tree;

s2, filling pseudo nodes, unifying tree structures:

carrying out pseudo node filling on the trunk structure trees of the XML documents extracted in the preprocessing stage, wherein the trunk structure trees corresponding to the XML documents for classification have the same layer number and tree depth, and the number of children of each node in the same layer of the tree is the same;

s3, extracting the full path, generating the tuple string stage:

and aiming at the trunk structure tree filled with the pseudo nodes, respectively extracting all different full paths contained in each XML document, and combining the full paths into different element group strings from the root node to the leaf nodes in sequence according to element names, so that each XML document corresponds to a group of element group string set, and the structural transformation of the XML document is completed.

In order to optimize the technical scheme, the specific measures adopted further comprise:

further, the digitization method further comprises the following steps:

s4, application verification of the digital result:

the similarity between any two XML documents is compared by adopting the following formula:

in the formula: p (T)₁)∪P(T₂) Is OR operation, representing two document trees T₁And T₂The total number of all nonrepeating tuple strings in the corresponding tuple string; p (T)₁)∩P(T₂) Is an AND operation, representing two document trees T₁And T₂The same tuple string number in the corresponding tuple string; calculated Delta (T)₁,T₂) The smaller the two XML document trees are.

Further, in step S3, the node name of each part in the tuple string is used to reflect the semantics of the corresponding part of the XML document.

Further, in step S2, the process of performing pseudo node filling on the trunk structure tree of the XML document extracted in the preprocessing stage includes the following steps:

and searching the maximum child number of the nodes in each layer of the trunk structure tree, taking the maximum child number as the child number of all the nodes in the layer, and filling and completing the nodes with insufficient child numbers.

Further, in step S3, the process of separately extracting all the different full paths contained in each XML document includes the following steps:

and traversing all the trunk structure paths from the root node to the leaf nodes in the sequence from top to bottom and from left to right to form a full path set.

Further, in step S3, the tuple string set refers to a set of all node name connections from the root to the leaf nodes along the full path, and each part in each tuple string is separated by a comma; in a tuple string set, only one of the same tuple strings remains.

Based on the foregoing digitization method, the present invention also provides a digitization system for XML documents, where the digitization system includes:

(1) a module for extracting a tree of trunk structures:

(2) module for filling dummy nodes, unifying tree structures:

and performing pseudo node filling on the trunk structure tree of the XML document extracted in the preprocessing stage, wherein the trunk structure trees corresponding to the XML documents for classification have the same layer number and tree depth, and the number of children of each node in the same layer in the tree is the same.

(3) A module for extracting the full path and generating the tuple string stage:

The invention has the beneficial effects that:

meanwhile, the structural characteristics and the semantic characteristics of the XML document are combined, the structural characteristics are taken as the main characteristics, the semantic characteristics are taken as the auxiliary characteristics, the digital processing of the XML document is realized, the processing process is efficient and quick, the digital result has the characteristics of high similarity detection sensitivity and the like, the accurate digital representation can be carried out on a large number of XML documents in a complex network environment, the XML document is simplified, and the subsequent document classification and application processing are facilitated.

Drawings

FIG. 1 is a flow chart of the method of digitizing an XML document of the present invention.

FIG. 2 is a diagram of one example of an XML document of the present invention.

Fig. 3 is a schematic diagram of the corresponding tree of fig. 2 and the tree with dummy nodes added.

Fig. 4 is a schematic diagram of full-path extraction of the corresponding trunk structure tree of fig. 3 and its corresponding tuple string set.

FIG. 5 is a flowchart of the operation of two XML document similarity comparisons.

Detailed Description

The present invention will now be described in further detail with reference to the accompanying drawings.

It should be noted that the terms "upper", "lower", "left", "right", "front", "back", etc. used in the present invention are for clarity of description only, and are not intended to limit the scope of the present invention, and the relative relationship between the terms and the terms is not limited by the technical contents of the essential changes.

With reference to fig. 1, the present invention provides a method for digitizing XML documents, which is suitable for comparing the similarity between the XML documents, and the method includes the following steps:

s1, extracting a trunk structure tree:

preprocessing the imported XML document, finding out a trunk structure tree, removing redundant nodes, and realizing that the same path appears only once in the trunk structure tree.

S2, filling pseudo nodes, unifying tree structures:

S3, extracting the full path, generating the tuple string stage:

Preferably, in step S3, the node name of each part in the tuple string is used to reflect the semantics of the corresponding part of the XML document. The digital representation of the XML document is particularly suitable for similarity detection of the XML document, and facilitates document classification. Under the classification scenario, the processing procedure pays more attention to the structure of the XML document, and reduces the attention to the content of the XML document. For example, for a document storing medical data in an XML format, the structure of the document reflects that the document is an XML document storing medical data, and the content of the leaf node reflects the specific information of the patient, in the process of classifying the XML document, people are more concerned about the structure, that is, what kind of information the document stores, and the specific content is ignored, so the structure is the core of such a problem, and at this time, the node name only needs to simply reflect the content meaning or type of the node part. However, in other scenarios, if the content is focused at the same time as the structure, the node names need to be further processed in a more complicated way, for example, the nodes are further subdivided according to the semantic content.

FIG. 2 is an example of an XML document, where a circular node is a structure node of the XML document and reflects the frame information stored in the XML document, and a rectangular box stores the content information of leaf nodes. For an XML document, different information needs to be known and different focuses need to be distinguished under different scenes. If a user wants to know the frame information stored in the document, only the circular nodes reflecting the structure need to be known and mastered, namely the position relation reflected among the nodes of the structure and the node names reflecting the semantics of the nodes are known; if the user wants to know the specific content stored in the XML document, the content of the leaf node, which is the core point of information storage, needs to be stored in combination with the XML document structure. For convenience of explanation, the present invention only explains the digitization process of XML documents under the application of classification processing, and on this premise, we focus more on the structure of the document, i.e. all the circular nodes, i.e. the structure nodes.

Because the semi-structured document of the XML contains a large amount of redundant structural information, the semantics of the documents are similar, and the structure is similar, the redundant information of the XML document needs to be removed first, and a backbone structure tree is extracted, so that the structure of the XML document is simplified. Fig. 3(a) is a trunk structure tree corresponding to the XML document in fig. 2. The extraction of the trunk structure tree removes a large number of redundant structure nodes and all leaf node contents existing in the XML document, and ensures that the same path only appears in the trunk structure tree once.

In order to realize the classification of the XML documents, a pseudo node needs to be added to the extracted trunk structure tree, the purpose of adding the pseudo node is to solve the problem that the number of nodes in a plurality of XML trunk structure trees participating in the classification is different, the standardization of the structure is realized, a plurality of XML documents have a unified structure frame, the unified and standardized structure frames can be further subjected to similarity comparison, the set corresponds to the documents one to one, and further the digital conversion of the XML documents is realized. The method for adding the pseudo node comprises the following steps: and searching the maximum child number of the nodes in each layer of the trunk structure tree, and taking the maximum child number as the child number of all the nodes in the layer, wherein the deficiency is filled and supplemented. Fig. 3(b) is an extended tree of the trunk structure of fig. 3(a) with dummy nodes added.

FIG. 4 is a full path extraction and its corresponding set of tuple strings performed with respect to FIG. 3 (b). In the present invention, the meaning of the full path and tuple string set is as follows:

full path: in an XML tree of a tree's full path is defined as the set of all paths from the root to all leaf nodes. Node v_iThe full path of (a) is defined as: from root to v_iThe ordered set of labels of (a), noted:

wherein v is_i∈V_DT，i∈[1，…，m]。v₀Is the root node, v_iIs a leaf node, and v₁，v₂… … is from root to v_iThe intermediate node of (a) is,

representing the structure and hierarchy information of an XML document.

Meta-string aggregation: and connecting all node names from the root to the leaf nodes according to the track of the full path, wherein each part in the tuple is spaced by commas. In a tuple string set, only one is kept in the set for the same duplicate tuple string.

Fig. 4(a) is the full path corresponding to fig. 3, and fig. 4(b) is the tuple string corresponding to fig. 3, so that the digital representation of the XML document is realized, and a complex and large XML document can be replaced by a tuple set at present. Subsequent study of the structure of an XML document, which can be started from tuple chaining, can perform various classification processes on behalf of the XML document. The set represents the information of the whole path, namely the structure information, and the node name of each part in the tuple string reflects the semantics of the XML document. For example, the following formula can be directly used to compare the similarity between any two XML documents:

in the formula: p (T)₁)∪P(T₂) Is OR operation, representing two document trees T₁And T₂The total number of all nonrepeating tuple strings in the corresponding tuple string; p (T)₁)∩P(T₂) Is an AND operation, representingTwo document tree T₁And T₂The same number of tuple strings in the corresponding tuple strings. After two XML documents are respectively converted into two element group strings through the steps of extracting a trunk structure tree and filling and converting pseudo nodes, the more the same element group strings in the two documents are, the larger the intersection number is, the more the two documents are similar, namely, the calculated delta (T)₁,T₂) The smaller the two XML document trees are. FIG. 5 is a flowchart of the operation of two XML document similarity comparisons.

(1) a module for extracting a tree of trunk structures:

(2) module for filling dummy nodes, unifying tree structures:

The above is only a preferred embodiment of the present invention, and the protection scope of the present invention is not limited to the above-mentioned embodiments, and all technical solutions belonging to the idea of the present invention belong to the protection scope of the present invention. It should be noted that modifications and embellishments within the scope of the invention may be made by those skilled in the art without departing from the principle of the invention.

Claims

1. A digitization method of XML documents, which is suitable for similarity comparison between XML documents, and is characterized by comprising the following steps:

s1, extracting a trunk structure tree:

s2, filling pseudo nodes, unifying tree structures:

carrying out pseudo node filling on the trunk structure trees of the XML documents extracted in the preprocessing stage, wherein the trunk structure trees corresponding to the XML documents for classification have the same layer number and tree depth, and the number of children of each node in the tree is the same;

s3, extracting the full path, generating the tuple string stage:

2. A method of digitizing an XML document according to claim 1, characterized in that it further comprises the steps of:

s4, application verification of the digital result:

in the formula: p (T)₁)∪P(T₂) Is OR operation, representing two document trees T₁And T₂The total number of all nonrepeating tuple strings in the corresponding tuple string; p (T)₁)∩P(T₂) Is an AND operation, representing two document treesT₁And T₂The same tuple string number in the corresponding tuple string; calculated Delta (T)₁,T₂) The smaller the two XML document trees are.

3. The method for digitizing XML document according to claim 1, wherein in step S3, the node name of each part in said tuple string is used to reflect the semantic meaning of the corresponding part of the XML document.

4. The method for digitizing an XML document according to claim 1, wherein in step S2, the process of performing pseudo node filling on the trunk structure tree of the XML document extracted in the preprocessing stage comprises the following steps:

5. The method for digitizing XML documents according to claim 1, wherein in step S3, the process of separately extracting all the different full paths contained in each XML document comprises the following steps:

6. The method of digitizing an XML document according to claim 1, wherein in step S3, the tuple string set refers to a set of all node name connections from root to leaf node according to a full path trajectory, and each part in each tuple string is spaced by commas; in a tuple string set, only one of the same tuple strings remains.

7. A system for digitizing an XML document based on the method of digitizing as claimed in any of claims 1 to 6, characterized in that the system comprises:

a module for extracting a tree of trunk structures:

module for filling dummy nodes, unifying tree structures:

a module for extracting the full path and generating the tuple string stage: