CN112559767A

CN112559767A - Method for automatically constructing RDF data based on XML data

Info

Publication number: CN112559767A
Application number: CN202011445817.1A
Authority: CN
Inventors: 刘玉春; 马宗民
Original assignee: Nanjing University of Aeronautics and Astronautics
Current assignee: Nanjing University of Aeronautics and Astronautics
Priority date: 2020-12-09
Filing date: 2020-12-09
Publication date: 2021-03-26

Abstract

The invention discloses a method for automatically constructing RDF data based on XML data, which comprises the following steps of firstly, extracting semantics of different types of XML data; aggregating elements with the same label name for XML data without format limitation in a traversal mode, then sorting the aggregation classes to obtain abstract models corresponding to different aggregation classes, and constructing an RDF Schema according to corresponding mapping rules; for XML data limited by XML Schema, obtaining relevant classes and attributes through the analysis processing of the XML Schema, and constructing an RDF Schema body according to the obtained classes and attributes; then, screening repeated data entities for the element numbers in the XML, traversing the repeated elements in the XML, adding unique codes to different elements according to equivalent element judgment conditions, and giving the same codes to the repeated elements; finally, corresponding mapping rules are constructed for different aggregation classes, and RDF triples corresponding to the elements are constructed; the method realizes the purpose of converting the RDF data into the RDF data, and has higher universality.

Description

Method for automatically constructing RDF data based on XML data

Technical Field

The invention relates to the technical field of knowledge maps, in particular to a method for automatically constructing RDF data based on XML data.

Background

The development of the world wide web technology changes the development process of the human society, almost all aspects of human life exist today, and the revolution of the world wide web technology drives the progress of the human society. The semantic web technology has been a great progress since its birth as one of the directions of such revolution. The semantic WEB technology adopts a representation method which is easier to be understood by a machine to describe data information on the WEB, so that a computer can process data more intelligently. RDF is a data model for describing the relationship between objects (resources), and the data model is used as a data description model to endow semantics to the data, and the semantic data can realize logic reasoning in a semantic network, so that the network application is more intelligent. RDF (resource description frame) is composed of a series of statements, namely "object-attribute-value" triples. RDF is domain independent, and a user can use the RDF Schema to define terms used in a certain domain, can use the terms as a vocabulary description language for describing classes and attributes, and can describe the hierarchical semantics related to the classes and attributes.

XML is a document markup language, which can effectively describe the interrelation between data through the user-defined tags and the nesting relationship between the tags, and as a standard format suitable for describing network semi-structured data, XML has been developed as a main medium for data representation and data exchange in the information field and has been applied in many fields. XML provides support in grammar for data construction through tag nesting and self-defining, but semantics hidden in data can only be understood through manual analysis, and the purpose of processing data in an intelligent agent mode depicted by a semantic network cannot be achieved, so that data described based on XML needs to be converted, semantics between related data and data are described through an RDF data model, and the converted data can reach a data construction standard required by the semantic network.

The invention starts from the structure and the content of the XML and extracts the implicit semantics in the data. In order to unify the structure of the XML data in a specific field, dtd (document type Definition) or XML Schema Definition is generally used to specify the elements and attributes used in the XML document and the organization of the data. Most XML documents are also DTD or XML Schema free. The invention focuses on various different types of XML documents and realizes a universal conversion method, and because DTD is gradually replaced by XML Schema, the invention does not discuss DTD.

The invention is based on XML Schema XML document, through the XML Schema analysis to obtain XML document structure information, mainly element and attribute mutual nesting relation; for XML documents without structure specification, the mutual nesting relation of elements and attributes is obtained by directly analyzing the XML documents to classify and aggregate the elements and attributes in the XML documents. No matter which type of XML is used, the obtained nested relation of the elements and the attributes is classified and defined, and a mapping rule which is converted into corresponding RDF domain vocabularies (classes and attributes) is constructed, namely a conceptual model-ontology of the related domain is obtained, and an RDF Schema is used as a description language of the ontology in the invention. The ontology is the basis of logical reasoning, and the building of the ontology conforming to the semantics contained in the source data is the basis of building RDF built based on XML. In many existing conversion methods, only the conversion method of XML with structural description (i.e. XML Schema) is considered, and there are some cases that the semantics are unreasonable in the process of constructing domain vocabulary (RDF Schema), and some semantics which are helpful for the conversion process but not in the source data are artificially added. The invention constructs a set of ontology vocabulary which accords with the semantics of source data (XML documents) based on the structure of XML and the contents of elements and attributes.

And converting the data in the XML instance into RDF instance data based on the constructed ontology, namely the RDF Schema, wherein the RDF describes the relationship between the entities and the attribute values, the entities are instances corresponding to the classes contained in the RDF Schema, and the relationship is the attribute specified in the relationships. XML is a semi-structured data in which elements and attribute tags appear repeatedly, especially in large-scale documents. Through analysis of related XML data documents, the embedded content of a part of tags in the same element tags and attribute tags can repeatedly appear in the XML documents, and the other part of tags is different, the repeatedly appearing data is only expressed and stored once in other data models (such as relational databases and RDFs), if the data is used in the same document for multiple times, a reference mechanism is used, but the model of the XML expressing the data in a nested relation does not have the reference mechanism. Therefore, if the data with the same label is not identified and screened in the process of converting the XML into the RDF, the constructed RDF data can have data redundancy and even data contradiction, and further processing (query and ontology-based reasoning) on the constructed RDF data is affected. Aiming at the situation, the invention provides an identification and screening mechanism in the construction process of RDF data, and ensures the effectiveness and completeness of the constructed data.

Disclosure of Invention

The purpose of the invention is as follows: the invention provides a method for automatically constructing RDF data based on XML data. The method achieves the purpose of converting XML data into RDF type data on the basis of processing various types of XML data, has higher universality, eliminates the redundancy and the contradiction of the constructed RDF data by identifying the repeated elements in the XML, constructs the data which is more suitable for knowledge engineering, and has good effect.

The technical scheme is as follows: in order to achieve the purpose, the invention adopts the technical scheme that:

a method for automatically constructing RDF data based on XML data comprises the following steps:

step S1, analyzing the tree model of the XML data document; clustering elements according to the names of initial labels, determining sub-models corresponding to the elements of the same type, and integrating the sub-models of all the elements of the same type to obtain an abstract model corresponding to the cluster; constructing a glossary RDF Schema according to the obtained abstract model; in particular, the amount of the solvent to be used,

for in XML data documents

Classifying into several classes E_n1,E_n2,L,E_ni,L(n_i∈N_e) An element e_iCategorizing into corresponding aggregation class E according to the name of the start tag_ni：

f₁:(e∈E)→E_n1,L E_niL,(n_i∈N_e)

Step S2, step E_niOf all elements in mod_ej∈MOD_niAbstract integration is carried out to obtain an abstract model smod corresponding to the aggregation class_ni＝(n_i,{CN_ni,AN_ni,v_ni}); wherein n is_iIs of the polymerization class E_niTag name of all elements in, CN_niIs a set of names of sub-elements, AN_niIs the name set of the attribute contained by the element's start tag, where v_niIs a logical variable, v_niPresence description elements inline content contains text values, specifically:

(1) when CN_ni＝φ、AN_ni＝φ、v_niWhen present, polymerising class E_niThe corresponding abstract model is smod_ni＝(n_i,v_ni) At this time

Element e_jCorresponding submodel mod_ejIs a simple submodel, i.e. mod_ejE is S; wherein S represents a set of sub-models, at this time, the RDF triples are constructed as follows:

f_pi:n_i→pi_ni,(n_i∈Ne,pi_ni∈PI)

Type(？pi_ni,Property)

PropVal(range,？pi_ni,？DateTypeIRI)

where PI represents the collection of attributes in the RDF vocabulary, DateTypeIRI represents the built-in data type, PI_niRepresents n_iThe attributes mapped to;

(2) when v is_niIn the absence of polymerization class E_niThe corresponding abstract model is smod_ni＝(n_i,{CN_ni,AN_ni}); the RDF triples are constructed at this point as follows:

f_ci:n_i→ci_ni,(n_i∈N_e,ci_ni∈CI)

Type(？ci_ni,Class)

f_pi:{cn₁,cn₂,L,cn_j,L}→{pi₁,pi₂,L,pi_j,L}(cn_j∈CN_ni,pi_j∈PI,j＝1,2,L,n)

Type(？pi_j,Property)(j＝1,2,L,n)

PropVal(domain,？pi_j,？ci_ni)

f_ci:cn_j→ci_cnj(cn_j∈CN_ni,ci_cnj∈CI)

PropVal(range,？pi_j,？ci_cnj)

PropVal(range,？pi_j,？DateTypeIRI)

f_pi:{an₁,an₂,L,an_k,L}→{pi₁,pi₂,L,pi_k,L}(an_k∈AN_ni,pi_k∈PI,k＝1,2,L,n)

Type(？pi_k,Property)(k＝1,2,L,n)

PropVal(domain,？pi_k,？ci_ni)

PropVal(range,？pi_k,？DateTypeIRI)

where CI represents a collection of classes in the RDF vocabulary. Let cn assume_j∈CN_niThen based on cn_jAn attribute pi can be constructed_j。

(3) And polymerization class E_niThe corresponding abstract model is smod_ni＝(n_i,{CN_ni,AN_ni,v_ni}); based on n at this time_i、CN_ni、AN_niThe class and attribute rules in the RDF Schema are respectively constructed as follows:

f_ci:n_i→ci_ni,(n_i∈N_e,ci_ni∈CI)

Type(？ci_ni,Class)

f_pi:{cn₁,cn₂,L,cn_m,L}→{pi₁,pi₂,L,pi_m,L}(cn_m∈CN_ni,pi_m∈PI,m＝1,2,L,n)

Type(？pi_m,Property)(m＝1,2,L,n)

PropVal(domain,？pi_m,？ci_ni)

f_ci:cn_q→ci_cnq(cn_q∈CN_ni,ci_cnq∈CI)

PropVal(range,？pi_m,？ci_cnm)

PropVal(range,？pi_m,？DateTypeIRI)

f_pi:{an₁,an₂,L,an_p,L}→{pi₁,pi₂,L,pi_p,L}(an_p∈AN_ni,pi_p∈PI,p＝1,2,L,n)

Type(？pi_p,Property)(p＝1,2,L,n)

PropVal(domain,？pi_p,？ci_ni)

PropVal(range,？pi_p,？DateTypeIRI)

Type(value,Property)

step S3, according to the mapping rule corresponding to the abstract model in the step S2, constructing a glossary RDF Schema of the current field as follows:

f_rdfs:{E_n1,L E_niL}→RDF Schema(n_i∈N_e)

wherein the set of all aggregated classes of the XML document is XSD ═ E_n1,L E_niL}(n_i∈N_e)；

Step S4, recognizing the repetitive elements in the XML document data, specifically,

traversing all elements E and attributes A of the XML document, and attaching a unique ID; adjusting the IDs of the elements and the attributes in the current XML document to ensure that the IDs of the equivalent elements and the equivalent attributes are the same; traversing the tree model of the XML again by adopting a back root traversal method, adjusting from the bottom of the tree model to the root, identifying equivalent elements and equivalent attributes in the document, and adjusting the IDs of the equivalent elements and the equivalent attributes uniformly; the method comprises the following specific steps:

e_m∈CE_ei,e_n∈CE_ej,a_m∈EA_ei,a_n∈EA_ej

CL_ei＝{ID_e1,ID_e2,L ID_em,L,ID_a1,ID_a2,L ID_ap,L}

CL_ej＝{ID_e1,ID_e2,L ID_en,L,ID_a1,ID_a2,L ID_aq,L}

e_m∈CE_ei,e_n∈CE_ej,a_p∈EA_ei,a_q∈EA_ej,e_i→v_i,e_j→v_j

CL_ei＝{ID_e1,ID_e2,L ID_em,L,ID_a1,ID_a2,L ID_ap,L}

CL_ej＝{ID_e1,ID_e2,L ID_en,L,ID_a1,ID_a2,L ID_aq,L}

s5, after clustering the elements in the current XML document, mapping the XML document into an RDF triple sequence based on the step S2; according to step S4, traversing the XML tree model after the ID adjustment of the elements and attributes is completed, storing the mapped element ID set as OID, and constructing an RDF ternary sequence as follows:

f_r:(n_i,ID_ei)→r_ei,(n_i∈N_e,r_ei∈R)

t₁＝t_v＝(r_ei,rdf:value,v_i)

PropVal(value,？r_ei,？v_i)

{t₁,t₂,L}＝{t_em|m＝1,2,L}∪{t_an|n＝1,2,L}(pi_em→t_em、pi_an→t_an)

f_r:(n_i,ID_ei)→r_ei,(n_i∈N_e,r_ei∈R)

f_r:(n_m,ID_em)→r_em,(n_m∈N_e,r_em∈R)

t_an＝(r_ei,pi_an,v_n)

PropVal(？pi_an,？r_ei,？v_i)

f_r:(n_i,ID_ei)→r_ei,(n_i∈N_e,r_ei∈R)

t_v＝(r_ei,rdf:value,v_i)

t_v∈{t₁,t₂,L}

PropVal(value,？r_ei,？v_i)。

has the advantages that: the invention has the following advantages:

(1) the invention has good effect on the conversion construction of the XML data with larger scale, and is different from the prior scheme which only aims at the XML data with smaller scale.

(2) The invention optimizes the mapping rule, solves the problem of semantic accuracy in the existing method, provides a corresponding conversion scheme aiming at different types of XML data, and provides a more uniform scheme for constructing RDF data based on XML.

(3) The invention greatly reduces the redundancy of RDF data obtained by subsequent conversion by identifying repeated data entities in XML data, and is different from the prior method in that the prior method generally processes the redundancy condition of the RDF data.

Drawings

FIG. 1 is a schematic diagram of an RDF triple data model provided by the present invention;

FIG. 2 is a schematic diagram of XML document tree model parsing provided by the present invention;

fig. 3 is a schematic diagram of an XML document data repetitive element identification process provided by the present invention.

Detailed Description

The present invention will be further described with reference to the accompanying drawings.

The method for automatically constructing RDF data based on XML data comprises the following three parts:

(1) extracting semantics using different methods for different types of XML data

As shown in fig. 2, aggregating elements with the same tag name for XML data without format limitation in a traversal manner, then sorting the aggregation classes to obtain abstract models corresponding to different aggregation classes, and constructing an RDF Schema according to corresponding mapping rules; for XMl data with XML Schema limitation, relevant classes and attributes are obtained through parsing the XML Schema, and an ontology RDF Schema is constructed according to the obtained classes and attributes.

(2) Screening XMl for duplicate data entities

As shown in fig. 3, traversal is performed on the repeated elements in the XML, unique codes are added to different elements according to equivalent element judgment conditions, the same codes are given to the repeated elements, and whether conversion processing is performed or not is determined by identifying element codes in the later conversion process of the repeated elements, so that redundancy and contradiction of RDF data are solved.

(3) Building RDF triples

Based on the analysis of the XML tree model, elements in the XML tree model are aggregated to form an aggregation class, the aggregation class is divided into three abstract models, corresponding mapping rules are established for different aggregation classes, and RDF triples corresponding to different elements in the XML are established according to the mapping rules corresponding to the aggregation abstract models to which the elements belong, so that the conversion from the XML to the RDF is completed.

The method comprises the following specific steps:

for in XML data documents

f₁:(e∈E)→E_n1,L E_niL,(n_i∈N_e)

f_pi:n_i→pi_ni,(n_i∈Ne,pi_ni∈PI)

Type(？pi_ni,Property)

PropVal(range,？pi_ni,？DateTypeIRI)

where PI represents the collection of attributes in the RDF vocabulary, DateTypeIRI represents the built-in data type, PI_niRepresents n_iThe attribute mapped to, Type (_ni,Property)，PropVal(range,？pi_niIs it? DateTypeIRI) is an axiomatic expression of RDF triples.

f_ci:n_i→ci_ni,(n_i∈N_e,ci_ni∈CI)

Type(？ci_ni,Class)

Type(？pi_j,Property)(j＝1,2,L,n)

PropVal(domain,？pi_j,？ci_ni)

f_ci:cn_j→ci_cnj(cn_j∈CN_ni,ci_cnj∈CI)

PropVal(range,？pi_j,？ci_cnj)

PropVal(range,？pi_j,？DateTypeIRI)

Type(？pi_k,Property)(k＝1,2,L,n)

PropVal(domain,？pi_k,？ci_ni)

PropVal(range,？pi_k,？DateTypeIRI)

f_ci:n_i→ci_ni,(n_i∈N_e,ci_ni∈CI)

Type(？ci_ni,Class)

Type(？pi_m,Property)(m＝1,2,L,n)

PropVal(domain,？pi_m,？ci_ni)

f_ci:cn_q→ci_cnq(cn_q∈CN_ni,ci_cnq∈CI)

PropVal(range,？pi_m,？ci_cnm)

PropVal(range,？pi_m,？DateTypeIRI)

Type(？pi_p,Property)(p＝1,2,L,n)

PropVal(domain,？pi_p,？ci_ni)

PropVal(range,？pi_p,？DateTypeIRI)

Type(value,Property)

f_rdfs:{E_n1,L E_niL}→RDF Schema(n_i∈N_e)

e_m∈CE_ei,e_n∈CE_ej,a_m∈EA_ei,a_n∈EA_ej

CL_ei＝{ID_e1,ID_e2,L ID_em,L,ID_a1,ID_a2,L ID_ap,L}

CL_ej＝{ID_e1,ID_e2,L ID_en,L,ID_a1,ID_a2,L ID_aq,L}

e_m∈CE_ei,e_n∈CE_ej,a_p∈EA_ei,a_q∈EA_ej,e_i→v_i,e_j→v_j

CL_ei＝{ID_e1,ID_e2,L ID_em,L,ID_a1,ID_a2,L ID_ap,L}

CL_ej＝{ID_e1,ID_e2,L ID_en,L,ID_a1,ID_a2,L ID_aq,L}

f_r:(n_i,ID_ei)→r_ei,(n_i∈N_e,r_ei∈R)

t₁＝t_v＝(r_ei,rdf:value,v_i)

PropVal(value,？r_ei,？v_i)

f_r:(n_i,ID_ei)→r_ei,(n_i∈N_e,r_ei∈R)

f_r:(n_m,ID_em)→r_em,(n_m∈N_e,r_em∈R)

t_an＝(r_ei,pi_an,v_n)

PropVal(？pi_an,？r_ei,？v_i)

f_r:(n_i,ID_ei)→r_ei,(n_i∈N_e,r_ei∈R)

t_v＝(r_ei,rdf:value,v_i)

t_v∈{t₁,t₂,L}

PropVal(value,？r_ei,？v_i)。

the above description is only of the preferred embodiments of the present invention, and it should be noted that: it will be apparent to those skilled in the art that various modifications and adaptations can be made without departing from the principles of the invention and these are intended to be within the scope of the invention.

Claims

1. A method for automatically constructing RDF data based on XML data is characterized by comprising the following steps:

for in XML data documents

Classifying and aggregating into several classes E_n1,E_n2,L,E_ni,L(n_i∈N_e) An element e_iCategorizing into corresponding aggregation class E according to the name of the start tag_ni：

f₁:(e∈E)→E_n1,L E_niL,(n_i∈N_e)

Step S2, submodel mod for all elements_ej∈MOD_niAbstract integration is carried out to obtain an abstract model smod corresponding to the aggregation class_ni＝(n_i,{CN_ni,AN_ni,v_ni}); wherein n is_iIs of the polymerization class E_niTag name of all elements in, CN_niIs a set of names of sub-elements, AN_niIs the name set of the attribute contained by the element's start tag, where v_niIs a logical variable, v_niPresence description elements inline content contains text values, specifically:

f_pi:n_i→pi_ni,(n_i∈N_e,pi_ni∈PI)

Type(？pi_ni,Property)

PropVal(range,？pi_ni,？DateTypeIRI)

f_ci:n_i→ci_ni,(n_i∈N_e,ci_ni∈CI)

Type(？ci_ni,Class)

Type(？pi_j,Property)(j＝1,2,L,n)

PropVal(domain,？pi_j,？ci_ni)

f_ci:cn_j→ci_cnj(cn_j∈CN_ni,ci_cnj∈CI)

PropVal(range,？pi_j,？ci_cnj)

PropVal(range,？pi_j,？DateTypeIRI)

f_pi:{an1,an₂,L,an_k,L}→{pi₁,pi₂,L,pi_k,L}(an_k∈AN_ni,pi_k∈PI,k＝1,2,L,n)

Type(？pi_k,Property)(k＝1,2,L,n)

PropVal(domain,？pi_k,？ci_ni)

PropVal(range,？pi_k,？DateTypeIRI)

f_ci:n_i→ci_ni,(n_i∈N_e,ci_ni∈CI)

Type(？ci_ni,Class)

Type(？pi_m,Property)(m＝1,2,L,n)

PropVal(domain,？pi_m,？ci_ni)

f_ci:cn_q→ci_cnq(cn_q∈CN_ni,ci_cnq∈CI)

PropVal(range,？pi_m,？ci_cnm)

PropVal(range,？pi_m,？DateTypeIRI)

Type(？pi_p,Property)(p＝1,2,L,n)

PropVal(domain,？pi_p,？ci_ni)

PropVal(range,？pi_p,？DateTypeIRI)

Type(value,Property)

f_rdfs:{En1,…Eni…}→RDF Schema(n_i∈N_e)

wherein the set of all aggregated classes of the XML document is XSD ═ En1, … En … (n)_i∈N_e)；

e_m∈CE_ei,e_n∈CE_ej,a_m∈EA_ei,a_n∈EA_ej

CL_ei＝{ID_e1,ID_e2,L ID_em,L,ID_a1,ID_a2,L ID_ap,L}

CL_ej＝{ID_e1,ID_e2,L ID_en,L,ID_a1,ID_a2,L ID_aq,L}

e_m∈CE_ei,e_n∈CE_ej,a_p∈EA_ei,a_q∈EA_ej,e_i→v_i,e_j→v_j

CL_ei＝{ID_e1,ID_e2,L ID_em,L,ID_a1,ID_a2,L ID_ap,L}

CL_ej＝{ID_e1,ID_e2,L ID_en,L,ID_a1,ID_a2,L ID_aq,L}

f_r:(n_i,ID_ei)→r_ei,(n_i∈N_e,r_ei∈R)

t₁＝t_v＝(r_ei,rdf:value,v_i)

PropVal(value,？r_ei,？v_i)

f_r:(n_i,ID_ei)→r_ei,(n_i∈N_e,r_ei∈R)

f_r:(n_m,ID_em)→r_em,(n_m∈N_e,r_em∈R)

t_an＝(r_ei,pi_an,v_n)

PropVal(？pi_an,？r_ei,？v_i)

f_r:(n_i,ID_ei)→r_ei,(n_i∈N_e,r_ei∈R)

t_v＝(r_ei,rdf:value,v_i)

t_v∈{t₁,t₂,L}

PropVal(value,？r_ei,？v_i)。