CN112364604A - XML document digitization method and system - Google Patents

XML document digitization method and system Download PDF

Info

Publication number
CN112364604A
CN112364604A CN202011156122.1A CN202011156122A CN112364604A CN 112364604 A CN112364604 A CN 112364604A CN 202011156122 A CN202011156122 A CN 202011156122A CN 112364604 A CN112364604 A CN 112364604A
Authority
CN
China
Prior art keywords
xml document
tree
xml
nodes
trunk structure
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011156122.1A
Other languages
Chinese (zh)
Inventor
吴海涛
郭丽红
杨洁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Institute of Technology
Original Assignee
Nanjing Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Institute of Technology filed Critical Nanjing Institute of Technology
Priority to CN202011156122.1A priority Critical patent/CN112364604A/en
Publication of CN112364604A publication Critical patent/CN112364604A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/151Transformation
    • G06F40/154Tree transformation for tree-structured or markup documents, e.g. XSLT, XSL-FO or stylesheets

Abstract

The invention discloses a digitalization method of XML documents, which is suitable for comparing the similarity between the XML documents and comprises the following steps: s1, extracting a trunk structure tree; s2, filling pseudo nodes and unifying tree structures; s3, extracting the full path and generating the tuple string stage. The invention can realize the digital processing of the XML document by extracting the three steps of the trunk structure tree, unifying the structure tree type and the tuple string conversion and combining the structure characteristic and the semantic characteristic of the XML document, has the characteristics of high efficiency and rapidness in the processing process, high similarity detection sensitivity of the digital result and the like, can digitally express a large amount of XML documents in a complex network environment, not only simplifies the XML documents, but also facilitates the subsequent document classification and application processing.

Description

XML document digitization method and system
Technical Field
The invention relates to the technical field of XML document digitization processing, in particular to a method and a system for digitizing an XML document.
Background
With the rapid development of the network, a large amount of semi-structured data stored in an XML form is generated on the Internet, and the data accumulated in different fields has infinite potential and great value. XML documents, as a representative of semi-structured data, are used by more and more enterprises and public institutions due to their features such as platform independence, convenient data processing, and flexible Web applications. Therefore, in the face of huge XML data, the digital representation of the document is the basis for data analysis, classification and various data processing, and the quality of the document directly affects the subsequent various operations. For example, the invention with the patent number CN108984713A discloses an XML file processing method and apparatus, which can solve the problem that a single table is too large, resulting in long time consumption for query or other operations, by splitting and storing an XML file into a plurality of database tables according to a structure tree.
However, with the advent of XML documents on the internet, which have grown year by year at exponential speeds, a burden is imposed on the classification process of semi-structured data. Therefore, in the face of massive XML documents, finding a quick and efficient digitization method for XML documents is a necessary trend to facilitate subsequent processing of XML information. The digital representation of the XML document can greatly improve the classification speed of the XML document and can provide guarantee for further application and processing of the XML document. However, in the current XML document digitization method, the following problems exist: firstly, the digitization result is simple and rough, and cannot accurately reflect XML document information, secondly, the representation method is complex, the conversion efficiency is low, and thirdly, the XML structural feature is emphasized, and the semantic feature is ignored.
Disclosure of Invention
Aiming at the defects in the prior art, the invention provides a method and a system for digitizing an XML document, which realize the digitized processing of the XML document by three steps of extracting a trunk structure tree, unifying the structure tree type and tuple string conversion and combining the structural characteristics and semantic characteristics of the XML document, have the characteristics of high efficiency and rapidness in processing process, high similarity detection sensitivity of digitized results and the like, can digitally express a large amount of XML documents in a complex network environment, not only simplifies the XML document, but also facilitates the subsequent document classification and application processing.
In order to achieve the purpose, the invention adopts the following technical scheme:
a digitization method of XML documents, which is suitable for similarity comparison between the XML documents, and comprises the following steps:
s1, extracting a trunk structure tree:
preprocessing the imported XML document, finding out a trunk structure tree, removing redundant nodes, and realizing that the same path appears only once in the trunk structure tree;
s2, filling pseudo nodes, unifying tree structures:
carrying out pseudo node filling on the trunk structure trees of the XML documents extracted in the preprocessing stage, wherein the trunk structure trees corresponding to the XML documents for classification have the same layer number and tree depth, and the number of children of each node in the same layer of the tree is the same;
s3, extracting the full path, generating the tuple string stage:
and aiming at the trunk structure tree filled with the pseudo nodes, respectively extracting all different full paths contained in each XML document, and combining the full paths into different element group strings from the root node to the leaf nodes in sequence according to element names, so that each XML document corresponds to a group of element group string set, and the structural transformation of the XML document is completed.
In order to optimize the technical scheme, the specific measures adopted further comprise:
further, the digitization method further comprises the following steps:
s4, application verification of the digital result:
the similarity between any two XML documents is compared by adopting the following formula:
Figure BDA0002742828420000021
in the formula: p (T)1)∪P(T2) Is OR operation, representing two document trees T1And T2The total number of all nonrepeating tuple strings in the corresponding tuple string; p (T)1)∩P(T2) Is an AND operation, representing two document trees T1And T2The same tuple string number in the corresponding tuple string; calculated Delta (T)1,T2) The smaller the two XML document trees are.
Further, in step S3, the node name of each part in the tuple string is used to reflect the semantics of the corresponding part of the XML document.
Further, in step S2, the process of performing pseudo node filling on the trunk structure tree of the XML document extracted in the preprocessing stage includes the following steps:
and searching the maximum child number of the nodes in each layer of the trunk structure tree, taking the maximum child number as the child number of all the nodes in the layer, and filling and completing the nodes with insufficient child numbers.
Further, in step S3, the process of separately extracting all the different full paths contained in each XML document includes the following steps:
and traversing all the trunk structure paths from the root node to the leaf nodes in the sequence from top to bottom and from left to right to form a full path set.
Further, in step S3, the tuple string set refers to a set of all node name connections from the root to the leaf nodes along the full path, and each part in each tuple string is separated by a comma; in a tuple string set, only one of the same tuple strings remains.
Based on the foregoing digitization method, the present invention also provides a digitization system for XML documents, where the digitization system includes:
(1) a module for extracting a tree of trunk structures:
preprocessing the imported XML document, finding out a trunk structure tree, removing redundant nodes, and realizing that the same path appears only once in the trunk structure tree;
(2) module for filling dummy nodes, unifying tree structures:
and performing pseudo node filling on the trunk structure tree of the XML document extracted in the preprocessing stage, wherein the trunk structure trees corresponding to the XML documents for classification have the same layer number and tree depth, and the number of children of each node in the same layer in the tree is the same.
(3) A module for extracting the full path and generating the tuple string stage:
and aiming at the trunk structure tree filled with the pseudo nodes, respectively extracting all different full paths contained in each XML document, and combining the full paths into different element group strings from the root node to the leaf nodes in sequence according to element names, so that each XML document corresponds to a group of element group string set, and the structural transformation of the XML document is completed.
The invention has the beneficial effects that:
meanwhile, the structural characteristics and the semantic characteristics of the XML document are combined, the structural characteristics are taken as the main characteristics, the semantic characteristics are taken as the auxiliary characteristics, the digital processing of the XML document is realized, the processing process is efficient and quick, the digital result has the characteristics of high similarity detection sensitivity and the like, the accurate digital representation can be carried out on a large number of XML documents in a complex network environment, the XML document is simplified, and the subsequent document classification and application processing are facilitated.
Drawings
FIG. 1 is a flow chart of the method of digitizing an XML document of the present invention.
FIG. 2 is a diagram of one example of an XML document of the present invention.
Fig. 3 is a schematic diagram of the corresponding tree of fig. 2 and the tree with dummy nodes added.
Fig. 4 is a schematic diagram of full-path extraction of the corresponding trunk structure tree of fig. 3 and its corresponding tuple string set.
FIG. 5 is a flowchart of the operation of two XML document similarity comparisons.
Detailed Description
The present invention will now be described in further detail with reference to the accompanying drawings.
It should be noted that the terms "upper", "lower", "left", "right", "front", "back", etc. used in the present invention are for clarity of description only, and are not intended to limit the scope of the present invention, and the relative relationship between the terms and the terms is not limited by the technical contents of the essential changes.
With reference to fig. 1, the present invention provides a method for digitizing XML documents, which is suitable for comparing the similarity between the XML documents, and the method includes the following steps:
s1, extracting a trunk structure tree:
preprocessing the imported XML document, finding out a trunk structure tree, removing redundant nodes, and realizing that the same path appears only once in the trunk structure tree.
S2, filling pseudo nodes, unifying tree structures:
and performing pseudo node filling on the trunk structure tree of the XML document extracted in the preprocessing stage, wherein the trunk structure trees corresponding to the XML documents for classification have the same layer number and tree depth, and the number of children of each node in the same layer in the tree is the same.
S3, extracting the full path, generating the tuple string stage:
and aiming at the trunk structure tree filled with the pseudo nodes, respectively extracting all different full paths contained in each XML document, and combining the full paths into different element group strings from the root node to the leaf nodes in sequence according to element names, so that each XML document corresponds to a group of element group string set, and the structural transformation of the XML document is completed.
Preferably, in step S3, the node name of each part in the tuple string is used to reflect the semantics of the corresponding part of the XML document. The digital representation of the XML document is particularly suitable for similarity detection of the XML document, and facilitates document classification. Under the classification scenario, the processing procedure pays more attention to the structure of the XML document, and reduces the attention to the content of the XML document. For example, for a document storing medical data in an XML format, the structure of the document reflects that the document is an XML document storing medical data, and the content of the leaf node reflects the specific information of the patient, in the process of classifying the XML document, people are more concerned about the structure, that is, what kind of information the document stores, and the specific content is ignored, so the structure is the core of such a problem, and at this time, the node name only needs to simply reflect the content meaning or type of the node part. However, in other scenarios, if the content is focused at the same time as the structure, the node names need to be further processed in a more complicated way, for example, the nodes are further subdivided according to the semantic content.
FIG. 2 is an example of an XML document, where a circular node is a structure node of the XML document and reflects the frame information stored in the XML document, and a rectangular box stores the content information of leaf nodes. For an XML document, different information needs to be known and different focuses need to be distinguished under different scenes. If a user wants to know the frame information stored in the document, only the circular nodes reflecting the structure need to be known and mastered, namely the position relation reflected among the nodes of the structure and the node names reflecting the semantics of the nodes are known; if the user wants to know the specific content stored in the XML document, the content of the leaf node, which is the core point of information storage, needs to be stored in combination with the XML document structure. For convenience of explanation, the present invention only explains the digitization process of XML documents under the application of classification processing, and on this premise, we focus more on the structure of the document, i.e. all the circular nodes, i.e. the structure nodes.
Because the semi-structured document of the XML contains a large amount of redundant structural information, the semantics of the documents are similar, and the structure is similar, the redundant information of the XML document needs to be removed first, and a backbone structure tree is extracted, so that the structure of the XML document is simplified. Fig. 3(a) is a trunk structure tree corresponding to the XML document in fig. 2. The extraction of the trunk structure tree removes a large number of redundant structure nodes and all leaf node contents existing in the XML document, and ensures that the same path only appears in the trunk structure tree once.
In order to realize the classification of the XML documents, a pseudo node needs to be added to the extracted trunk structure tree, the purpose of adding the pseudo node is to solve the problem that the number of nodes in a plurality of XML trunk structure trees participating in the classification is different, the standardization of the structure is realized, a plurality of XML documents have a unified structure frame, the unified and standardized structure frames can be further subjected to similarity comparison, the set corresponds to the documents one to one, and further the digital conversion of the XML documents is realized. The method for adding the pseudo node comprises the following steps: and searching the maximum child number of the nodes in each layer of the trunk structure tree, and taking the maximum child number as the child number of all the nodes in the layer, wherein the deficiency is filled and supplemented. Fig. 3(b) is an extended tree of the trunk structure of fig. 3(a) with dummy nodes added.
FIG. 4 is a full path extraction and its corresponding set of tuple strings performed with respect to FIG. 3 (b). In the present invention, the meaning of the full path and tuple string set is as follows:
full path: in an XML tree of a tree's full path is defined as the set of all paths from the root to all leaf nodes. Node viThe full path of (a) is defined as: from root to viThe ordered set of labels of (a), noted:
Figure BDA0002742828420000041
wherein v isi∈VDT,i∈[1,…,m]。v0Is the root node, viIs a leaf node, and v1,v2… … is from root to viThe intermediate node of (a) is,
Figure BDA0002742828420000042
representing the structure and hierarchy information of an XML document.
Meta-string aggregation: and connecting all node names from the root to the leaf nodes according to the track of the full path, wherein each part in the tuple is spaced by commas. In a tuple string set, only one is kept in the set for the same duplicate tuple string.
Fig. 4(a) is the full path corresponding to fig. 3, and fig. 4(b) is the tuple string corresponding to fig. 3, so that the digital representation of the XML document is realized, and a complex and large XML document can be replaced by a tuple set at present. Subsequent study of the structure of an XML document, which can be started from tuple chaining, can perform various classification processes on behalf of the XML document. The set represents the information of the whole path, namely the structure information, and the node name of each part in the tuple string reflects the semantics of the XML document. For example, the following formula can be directly used to compare the similarity between any two XML documents:
Figure BDA0002742828420000051
in the formula: p (T)1)∪P(T2) Is OR operation, representing two document trees T1And T2The total number of all nonrepeating tuple strings in the corresponding tuple string; p (T)1)∩P(T2) Is an AND operation, representingTwo document tree T1And T2The same number of tuple strings in the corresponding tuple strings. After two XML documents are respectively converted into two element group strings through the steps of extracting a trunk structure tree and filling and converting pseudo nodes, the more the same element group strings in the two documents are, the larger the intersection number is, the more the two documents are similar, namely, the calculated delta (T)1,T2) The smaller the two XML document trees are. FIG. 5 is a flowchart of the operation of two XML document similarity comparisons.
Based on the foregoing digitization method, the present invention also provides a digitization system for XML documents, where the digitization system includes:
(1) a module for extracting a tree of trunk structures:
preprocessing the imported XML document, finding out a trunk structure tree, removing redundant nodes, and realizing that the same path appears only once in the trunk structure tree;
(2) module for filling dummy nodes, unifying tree structures:
carrying out pseudo node filling on the trunk structure trees of the XML documents extracted in the preprocessing stage, wherein the trunk structure trees corresponding to the XML documents for classification have the same layer number and tree depth, and the number of children of each node in the same layer of the tree is the same;
(3) a module for extracting the full path and generating the tuple string stage:
and aiming at the trunk structure tree filled with the pseudo nodes, respectively extracting all different full paths contained in each XML document, and combining the full paths into different element group strings from the root node to the leaf nodes in sequence according to element names, so that each XML document corresponds to a group of element group string set, and the structural transformation of the XML document is completed.
The above is only a preferred embodiment of the present invention, and the protection scope of the present invention is not limited to the above-mentioned embodiments, and all technical solutions belonging to the idea of the present invention belong to the protection scope of the present invention. It should be noted that modifications and embellishments within the scope of the invention may be made by those skilled in the art without departing from the principle of the invention.

Claims (7)

1. A digitization method of XML documents, which is suitable for similarity comparison between XML documents, and is characterized by comprising the following steps:
s1, extracting a trunk structure tree:
preprocessing the imported XML document, finding out a trunk structure tree, removing redundant nodes, and realizing that the same path appears only once in the trunk structure tree;
s2, filling pseudo nodes, unifying tree structures:
carrying out pseudo node filling on the trunk structure trees of the XML documents extracted in the preprocessing stage, wherein the trunk structure trees corresponding to the XML documents for classification have the same layer number and tree depth, and the number of children of each node in the tree is the same;
s3, extracting the full path, generating the tuple string stage:
and aiming at the trunk structure tree filled with the pseudo nodes, respectively extracting all different full paths contained in each XML document, and combining the full paths into different element group strings from the root node to the leaf nodes in sequence according to element names, so that each XML document corresponds to a group of element group string set, and the structural transformation of the XML document is completed.
2. A method of digitizing an XML document according to claim 1, characterized in that it further comprises the steps of:
s4, application verification of the digital result:
the similarity between any two XML documents is compared by adopting the following formula:
Figure FDA0002742828410000011
in the formula: p (T)1)∪P(T2) Is OR operation, representing two document trees T1And T2The total number of all nonrepeating tuple strings in the corresponding tuple string; p (T)1)∩P(T2) Is an AND operation, representing two document treesT1And T2The same tuple string number in the corresponding tuple string; calculated Delta (T)1,T2) The smaller the two XML document trees are.
3. The method for digitizing XML document according to claim 1, wherein in step S3, the node name of each part in said tuple string is used to reflect the semantic meaning of the corresponding part of the XML document.
4. The method for digitizing an XML document according to claim 1, wherein in step S2, the process of performing pseudo node filling on the trunk structure tree of the XML document extracted in the preprocessing stage comprises the following steps:
and searching the maximum child number of the nodes in each layer of the trunk structure tree, taking the maximum child number as the child number of all the nodes in the layer, and filling and completing the nodes with insufficient child numbers.
5. The method for digitizing XML documents according to claim 1, wherein in step S3, the process of separately extracting all the different full paths contained in each XML document comprises the following steps:
and traversing all the trunk structure paths from the root node to the leaf nodes in the sequence from top to bottom and from left to right to form a full path set.
6. The method of digitizing an XML document according to claim 1, wherein in step S3, the tuple string set refers to a set of all node name connections from root to leaf node according to a full path trajectory, and each part in each tuple string is spaced by commas; in a tuple string set, only one of the same tuple strings remains.
7. A system for digitizing an XML document based on the method of digitizing as claimed in any of claims 1 to 6, characterized in that the system comprises:
a module for extracting a tree of trunk structures:
preprocessing the imported XML document, finding out a trunk structure tree, removing redundant nodes, and realizing that the same path appears only once in the trunk structure tree;
module for filling dummy nodes, unifying tree structures:
carrying out pseudo node filling on the trunk structure trees of the XML documents extracted in the preprocessing stage, wherein the trunk structure trees corresponding to the XML documents for classification have the same layer number and tree depth, and the number of children of each node in the tree is the same;
a module for extracting the full path and generating the tuple string stage:
and aiming at the trunk structure tree filled with the pseudo nodes, respectively extracting all different full paths contained in each XML document, and combining the full paths into different element group strings from the root node to the leaf nodes in sequence according to element names, so that each XML document corresponds to a group of element group string set, and the structural transformation of the XML document is completed.
CN202011156122.1A 2020-10-26 2020-10-26 XML document digitization method and system Pending CN112364604A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011156122.1A CN112364604A (en) 2020-10-26 2020-10-26 XML document digitization method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011156122.1A CN112364604A (en) 2020-10-26 2020-10-26 XML document digitization method and system

Publications (1)

Publication Number Publication Date
CN112364604A true CN112364604A (en) 2021-02-12

Family

ID=74512183

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011156122.1A Pending CN112364604A (en) 2020-10-26 2020-10-26 XML document digitization method and system

Country Status (1)

Country Link
CN (1) CN112364604A (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101876995A (en) * 2009-12-18 2010-11-03 南开大学 Method for calculating similarity of XML documents
CN103123646A (en) * 2012-12-11 2013-05-29 北京航空航天大学 Conversion method for automatically converting XML document into OML document and device
CN103377175A (en) * 2012-04-26 2013-10-30 Sap股份公司 Structured document converting based on partition
CN111512315A (en) * 2017-12-01 2020-08-07 国际商业机器公司 Block-wise extraction of document metadata

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101876995A (en) * 2009-12-18 2010-11-03 南开大学 Method for calculating similarity of XML documents
CN103377175A (en) * 2012-04-26 2013-10-30 Sap股份公司 Structured document converting based on partition
CN103123646A (en) * 2012-12-11 2013-05-29 北京航空航天大学 Conversion method for automatically converting XML document into OML document and device
CN111512315A (en) * 2017-12-01 2020-08-07 国际商业机器公司 Block-wise extraction of document metadata

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
吴海涛 等: "基于矩阵存储的 XML相似度检测算法", 计算机应用研究, vol. 35, no. 7, pages 2026 - 2028 *
吴海涛: "一种改进的XML关键字查询算法", 南京工程学院学报(自然科学版), vol. 9, no. 2, pages 34 - 37 *

Similar Documents

Publication Publication Date Title
CN107766371B (en) Text information classification method and device
CN106776711B (en) Chinese medical knowledge map construction method based on deep learning
CN107391677B (en) Method and device for generating Chinese general knowledge graph with entity relation attributes
CN106156365A (en) A kind of generation method and device of knowledge mapping
CN111753099A (en) Method and system for enhancing file entity association degree based on knowledge graph
CN110162591B (en) Entity alignment method and system for digital education resources
CN108897778B (en) Image annotation method based on multi-source big data analysis
CN103870506B (en) Webpage information extraction method and system
CN111967761B (en) Knowledge graph-based monitoring and early warning method and device and electronic equipment
CN108922633A (en) A kind of disease name standard convention method and canonical system
CN112650848A (en) Urban railway public opinion information analysis method based on text semantic related passenger evaluation
CN104899340B (en) A kind of IETM technical information fragment retrieval device and its search method based on fragment of most compacting
CN106502991B (en) Publication treating method and apparatus
CN103871402B (en) Language model training system, speech recognition system and correlation method
CN107967290A (en) A kind of knowledge mapping network establishing method and system, medium based on magnanimity scientific research data
CN104615734B (en) A kind of community management service big data processing system and its processing method
CN109033132A (en) The method and device of text and the main body degree of correlation are calculated using knowledge mapping
CN111061828B (en) Digital library knowledge retrieval method and device
CN111966940B (en) Target data positioning method and device based on user request sequence
CN117095419A (en) PDF document data processing and information extracting device and method
CN106933844B (en) Construction method of reachability query index facing large-scale RDF data
CN112148938A (en) Cross-domain heterogeneous data retrieval system and retrieval method
CN112364604A (en) XML document digitization method and system
CN114238735B (en) Intelligent internet data acquisition method
CN112395292B (en) Data feature extraction and matching method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination