CN103294791A

CN103294791A - Extensible markup language pattern matching method

Info

Publication number: CN103294791A
Application number: CN201310192029XA
Authority: CN
Inventors: 霍红卫; 郭海涛; 高培; 张懿璞; 于强; 孙春晓; 郭鸿志
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2013-05-13
Filing date: 2013-05-13
Publication date: 2013-09-11

Abstract

The invention discloses an Extensible Markup Language pattern matching method, which is used to solve the problems of the prior art in terms of pattern representation, discovery of complex matching, matching efficiency, etc. The steps include: inputting an Extensible Markup Language pattern; constructing a pattern tree; Construct a sequence structure; perform name, data type and cardinality constraint matching on all elements to obtain the language similarity value of all element pairs; for complex elements, perform child similarity, leaf similarity, sibling similarity and ancestor similarity matching to obtain The structure and overall similarity of complex element pairs, filter to find matching complex element pairs; for each matched complex element pair, find the corresponding atomic set for each element, and apply non-complex element structure matching to the elements in the atomic set Method, calculate the structure and overall similarity value of atomic elements, filter to find matching element pairs; output all matching element pairs. The invention is fully automated, and the matching efficiency is improved under the premise of ensuring the matching quality.

Description

An Extensible Markup Language Pattern Matching Method

技术领域technical field

本发明属于通信技术领域，更进一步涉及数据处理技术领域中的一种可扩展标记语言(eXtensible Markup Language XML)模式匹配方法。本发明可根据模式的名称和结构信息，对两个输入可扩展标记语言模式文档自动进行可扩展标记语言模式匹配，找出两个文档中所有相似元素之间的映射，用于确定不同可扩展标记语言数据之间的相似性。The invention belongs to the technical field of communication, and further relates to an extensible markup language (eXtensible Markup Language XML) pattern matching method in the technical field of data processing. The present invention can automatically perform Extensible Markup Language pattern matching on two input Extensible Markup Language pattern documents according to the name and structure information of the patterns, find out the mapping between all similar elements in the two documents, and use it to determine different Extensible Markup Language pattern documents. Similarity between markup language data.

背景技术Background technique

随着Internet的发展，可扩展标记语言应运而生并成为了网络中数据表示、数据分析和数据交换的标准。由于可扩展标记语言数据描述的灵活性，可扩展标记语言文档数量和规模的日益增大，如何高效的管理大规模可扩展标记语言数据以及集成大量的可扩展标记语言数据资源变得十分重要。因此，用于识别可扩展标记语言模式之间元素一致性的可扩展标记语言模式匹配技术成为研究热点。With the development of Internet, Extensible Markup Language came into being and became the standard of data representation, data analysis and data exchange in the network. Due to the flexibility of extensible markup language data description, the number and scale of extensible markup language documents are increasing day by day, how to efficiently manage large-scale extensible markup language data and integrate a large number of extensible markup language data resources has become very important. Therefore, XML pattern matching technology for identifying element consistency between XML patterns has become a research hotspot.

可扩展标记语言模式匹配以两个可扩展标记语言模式作为输入，使用不同的相似值计算方法得到两个可扩展标记语言模式之间的一个映射。可扩展标记语言模式匹配在数据共享应用领域发挥着重要作用：在数据集成中，它可用于识别并标记多个模式之间的内部模式关系；在数据仓库中，它能够将一个数据资源映射到仓库模式；在电子商务中，它可以实现不同可扩展标记语言格式之间的消息映刺；在语义网络中，它可以用来建立不同网站的本体概念之间的语义对应关系；在数据迁移中，它能够将来自多个资源的遗留数据迁移为一个新的数据；在数据转换中，它能够将一个源对象映射为目标对象；在XML数据集群中，它可以用来确定不同可扩展标记语言数据之间的语义相似性。The XML pattern matching takes two XML patterns as input, and uses different similarity value calculation methods to obtain a mapping between the two XML patterns. Extensible Markup Language pattern matching plays an important role in data sharing applications: in data integration, it can be used to identify and mark the internal schema relationship between multiple schemas; in data warehouse, it can map a data resource to Warehouse mode; in e-commerce, it can realize message mapping between different extensible markup language formats; in semantic network, it can be used to establish semantic correspondence between ontology concepts of different websites; in data migration , it can migrate legacy data from multiple resources into a new data; in data transformation, it can map a source object to a target object; in XML data clusters, it can be used to determine different Extensible Markup Language Semantic similarity between data.

早期的模式匹配通常是手工完成的，手动指定模式匹配是一个浪费时间、容易出错并且开销很大的过程。当前，大量自动模式匹配算法和匹配系统相继提出，如LSD(Learning Source Descriptions)，Cupid，COMA(COmbination of MAtchingalgorithims)，Similarity Flooding，AgreementMaker，ASMOV(Automated SemanticMatching of Ontologies with Verification)，OII Harmony等。现有的大量模式匹配算法和系统虽然实现了模式的半自动或全自动匹配，匹配质量也较高，但可扩展标记语言自动模式匹配中仍然存在许多缺陷。首先，大部分匹配算法仅发现简单匹配(1:1匹配)，发现复杂匹配仅有较少的方法。其次，大部分匹配算法主要考虑模式之间的整体相似性，忽略了独立元素之间的相似度，而可扩展标记语言模式的元素相似性研究能够很好的支持半自动和劳动密集型活动，比如可扩展标记语言模式集成。最后，也是最重要的是大部分匹配系统仅仅关注匹配质量，忽略了匹配效率，使得大规模数据的匹配效率极低。比如元素名称匹配中借助外部词典(WordNet等)进行语义相似匹配，这虽然提高了名称匹配的准确率，但频繁的查词会大大增加匹配时间。Early pattern matching was usually done manually, and specifying pattern matches manually was a time-consuming, error-prone, and expensive process. At present, a large number of automatic pattern matching algorithms and matching systems have been proposed, such as LSD (Learning Source Descriptions), Cupid, COMA (COmbination of MAtting algorithms), Similarity Flooding, AgreementMaker, ASMOV (Automated Semantic Matching of Ontologies with Verification), OII Harmony, etc. Although a large number of existing pattern matching algorithms and systems have realized semi-automatic or fully automatic pattern matching, and the matching quality is also high, there are still many defects in the automatic pattern matching of XML. First, most matching algorithms only find simple matches (1:1 matches), and there are only a few ways to find complex matches. Secondly, most matching algorithms mainly consider the overall similarity between patterns, ignoring the similarity between independent elements, while the element similarity research of XML patterns can well support semi-automatic and labor-intensive activities, such as Extensible Markup Language schema integration. Finally, and most importantly, most matching systems only focus on matching quality, ignoring matching efficiency, making the matching efficiency of large-scale data extremely low. For example, in element name matching, external dictionaries (WordNet, etc.) are used to perform semantic similarity matching. Although this improves the accuracy of name matching, frequent word searches will greatly increase the matching time.

南开大学提出的专利申请“基于扩展邻接矩阵的XML文档结构及语义相似性计算方法”(申请号201010118060.5申请公布号CN101799825A)公开了一种基于扩展邻接矩阵的可扩展标记语言文档结构及语义相似性计算方法。该方法的具体步骤是：第一，输入可扩展标记语言文档，并对可扩展标记语言文档树进行编码；第二，对于编码后的两个文档，生成模式文档节点列表和数据源文档节点列表；第三，基于所生成的两节点列表，生成模式扩展邻接矩阵和数据源扩展邻接矩阵；第四，使用余弦定理计算两邻接矩阵的距离，得出两个可扩展标记语言文档的相似值。该专利申请存在的不足是：首先，该方法仅在文档层次上度量模式的相似性，而未深入到文档的元素这一更细的粒度上，这就使得该方法不能用于基于可扩展标记语言模式元素间映射的数据处理应用中；其次，该方法仅使用节点标签、节点层次信息、节点编码信息和节点的父节点信息这些有限的信息，作为度量节点的相似性的依据，可能会在相似值计算中产生较大的误差。The patent application "XML document structure and semantic similarity calculation method based on extended adjacency matrix" (application number 201010118060.5 application publication number CN101799825A) filed by Nankai University discloses an Extensible Markup Language document structure and semantic similarity based on extended adjacency matrix Calculation method. The specific steps of the method are as follows: first, input the XML document, and encode the XML document tree; second, for the encoded two documents, generate a schema document node list and a data source document node list ;Thirdly, based on the generated two-node list, generate a pattern extended adjacency matrix and a data source extended adjacency matrix; fourthly, use the cosine theorem to calculate the distance between the two adjacency matrices, and obtain the similarity value of two XML documents. The disadvantages of this patent application are: firstly, this method only measures the similarity of patterns at the document level, but does not go deep into the finer granularity of document elements, which makes this method cannot be used for scalable markup based In data processing applications of mapping between language model elements; secondly, this method only uses limited information such as node labels, node hierarchy information, node encoding information and node parent node information as the basis for measuring the similarity of nodes, which may be used in A large error occurs in the calculation of the similarity value.

MITRE CORP[US]提出的专利申请“TOOLS AND METHODS FOR SEMI-AUTOMATIC SCHEMAMATCHING”(申请号US20060491167申请公布号US2008021912A1)公开了一种半自动的可扩展标记语言模式匹配工具和方法，具体步骤是：第一，输入待匹配的源和目标可扩展标记语言模式；第二，图形化显示源和目标可扩展标记语言模式；第三，询问用户是否希望手工指定某些匹配；如果是，则让用户在所显示的可扩展标记语言模式图上手工指定某些匹配；否则，执行第四步至第七步；第四，对源和目标可扩展标记语言模式，进行语言预处理，并由一组匹配投票器进行打分；第五，由投票合并器对第四步所得到的所有得分进行合并，生成匹配矩阵；第六，加入结构信息进一步调整分值；第七，图形化显示匹配的结果；第八，重复执行第三步至第七步，直到计算出所有的匹配得分。该专利申请存在的不足是：尽管在人工干预下，半自动化的可扩展标记语言模式匹配方法在某种程度上可以提高模式匹配的质量，但是这只能适应于小规模的数据处理，对于较大规模的可扩展标记语言模式文档，手工指定模式匹配是一个单调乏味，浪费时间并且容易出错的过程，因此，这可能会限制该方法所能处理的可扩展标记语言模式数据的规模。The patent application "TOOLS AND METHODS FOR SEMI-AUTOMATIC SCHEMAMATCHING" (application number US20060491167 application publication number US2008021912A1) filed by MITER CORP [US] discloses a semi-automatic Extensible Markup Language pattern matching tool and method. The specific steps are: first , enter the source and target XML patterns to be matched; second, graphically display the source and target XML patterns; third, ask the user whether he wants to manually specify some matches; if yes, let the user select the Manually specify certain matches on the displayed XML schema graph; otherwise, perform steps four through seven; fourth, perform language preprocessing on the source and target XML schemas, and vote by a set of matches Fifth, the voting combiner combines all the scores obtained in the fourth step to generate a matching matrix; sixth, adding structural information to further adjust the score; seventh, graphically display the matching results; eighth , repeat steps 3 to 7 until all matching scores are calculated. The disadvantage of this patent application is that although the semi-automatic Extensible Markup Language pattern matching method can improve the quality of pattern matching to some extent under manual intervention, it can only be adapted to small-scale data processing, and for relatively For large-scale XML schema documents, manually specifying pattern matches is a tedious, time-consuming and error-prone process, thus, this may limit the scale of XML schema data that the method can handle.

MICROSOFT CORP[US]提出的专利申请“METHODS AND SYSTEMS FOR MODEL MATCHING”(申请号US20010028912申请公布号US2003120651A1)公开了一种模型或模式匹配的方法和系统，具体步骤是：第一，输入待匹配的源和目标可扩展标记语言模式；第二，对输入的两个可扩展标记语言模式进行文档对象模型解析；第三，将生成的文档对象模型转化为通用对象模型；第四，进行根属性匹配和结构匹配；第五，返回匹配结果。该专利申请存在的不足是：结构匹配主要利用叶子节点的相似性，而未全面考虑节点的其它如孩子、兄弟等结构相关信息，这可能会降低结构匹配的质量；其次，在结构匹配中，为改善结构匹配效果，需要重复遍历子树，进行多遍节点相似值的更新，这在一定程度上提高匹配的准确性，但在处理大规模可扩展标记语言模式时，可能会造成很大的系统开销，从而降低了匹配效率。The patent application "METHODS AND SYSTEMS FOR MODEL MATCHING" (application number US20010028912 application publication number US2003120651A1) proposed by MICROSOFT CORP [US] discloses a method and system for model or pattern matching. The specific steps are: first, input the Source and target XML schemas; second, document object model parsing for the two input XML schemas; third, transform the generated document object model into a common object model; fourth, perform root attribute matching Match the structure; fifth, return the matching result. The disadvantage of this patent application is that structural matching mainly utilizes the similarity of leaf nodes, but does not fully consider other structural related information of nodes such as children and brothers, which may reduce the quality of structural matching; secondly, in structural matching, In order to improve the structure matching effect, it is necessary to repeatedly traverse the subtrees and update the similar value of the nodes for multiple passes, which can improve the matching accuracy to a certain extent, but it may cause great problems when dealing with large-scale Extensible Markup Language patterns. System overhead, thus reducing the matching efficiency.

发明内容Contents of the invention

本发明的目的是针对上述现有技术的不足，提出一种可扩展标记语言模式匹配方法，采用加强普吕弗序列作为可扩展标记语言模式的中间表示，并充分利用语言相关的信息和结构相关的信息，找出两个文档中所有相似元素之间的映射。该方法全程自动化，并在保证匹配质量的前提下提高了匹配效率，解决现有模式匹配在模式表示形式、发现复杂匹配、匹配效率等方面遇到的问题。The purpose of the present invention is to address the deficiencies in the prior art above, and propose a method for matching XML patterns, which adopts enhanced Prufer sequences as the intermediate representation of XML patterns, and makes full use of language-related information and structure-related information to find the mapping between all similar elements in the two documents. The method is fully automated, and improves the matching efficiency under the premise of ensuring the matching quality, and solves the problems encountered in the existing pattern matching in terms of pattern representation, finding complex matches, and matching efficiency.

为了实现上述目的，本发明的具体步骤包括如下：In order to achieve the above object, the concrete steps of the present invention include as follows:

(1)输入两个待匹配的可扩展标记语言模式文档。(1) Input two XML schema documents to be matched.

(2)构建模式树：(2) Build a pattern tree:

将两个待匹配的可扩展标记语言模式文档进行文档对象模型解析，生成两个待匹配的可扩展标记语言模式文件的模式树。The document object model parsing is performed on the two extensible markup language pattern documents to be matched, and a pattern tree of the two extensible markup language pattern files to be matched is generated.

(3)构造序列结构：(3) Construct sequence structure:

分别对两个模式树进行普吕弗序列构造，获得由编号普吕弗序列和标记普吕弗序列组成的两个加强普吕弗序列。The Prüfer sequences are constructed on the two pattern trees respectively, and two strengthened Prüfer sequences composed of the numbered Prüfer sequences and the marked Prüfer sequences are obtained.

(4)语言匹配：(4) Language matching:

4a)分别从两个加强普吕弗序列的标记普吕弗序列中任意选取一个元素s和元素t；4a) Randomly select an element s and an element t from the marked Prufer sequences of the two strengthened Prufer sequences respectively;

4b)采用名称相似值计算方法，获得元素s和元素t的名称相似值；4b) Obtain the name similarity value of element s and element t by adopting the name similarity value calculation method;

4c)采用数据类型相似值计算方法，获得元素s和元素t的数据类型相似值；4c) Obtain the data type similarity value of element s and element t by adopting a data type similar value calculation method;

4d)采用基数约束相似值计算方法，获得元素s和元素t的基数约束相似值；4d) Obtain the cardinality-constrained similarity value of element s and element t by using the cardinality-constrained similarity value calculation method;

4e)将元素s和元素t的名称相似值、数据类型相似值、基数约束相似值的加权平均数作为元素s和元素t的语言相似值；4e) The weighted average of the name similarity value, data type similarity value, cardinality constraint similarity value of element s and element t is taken as the language similarity value of element s and element t;

4f)重复执行步骤4a)至步骤4e)，直到得到两个标记普吕弗序列中所有元素两两之间的语言相似值。4f) Repeat steps 4a) to 4e) until the linguistic similarity values between all elements in the two tagged Prüfer sequences are obtained.

(5)复杂元素结构匹配：(5) Complex element structure matching:

5a)按照节点在模式树中的后序号从小到大的顺序，分别对两个加强普吕弗序列中的编号普吕弗序列的所有节点进行排序；5a) respectively sorting all the nodes of the numbered Prüfer sequences in the two enhanced Prüfer sequences according to the sequence numbers of the nodes in the pattern tree from small to large;

5b)分别从两个排序后的编号普吕弗序列中任意选取一个元素i和元素j；5b) Randomly select an element i and an element j from the two sorted numbered Prüfer sequences respectively;

5c)采用孩子相似值计算方法，获得元素i和元素j的孩子相似值；5c) Obtain the child similarity value of element i and element j by using the child similarity value calculation method;

5d)采用叶子相似值计算方法，获得元素i和元素j的叶子相似值；5d) using a leaf similarity value calculation method to obtain the leaf similarity values of element i and element j;

5e)采用兄弟相似值计算方法，获得元素i和元素j的兄弟相似值；5e) Obtain the sibling similarity value of element i and element j by adopting the sibling similarity calculation method;

5f)采用祖先相似值计算方法，获得元素i和元素j的祖先相似值；5f) using the ancestor similarity value calculation method to obtain the ancestor similarity value of element i and element j;

5g)将元素i和元素j的孩子相似值、叶子相似值、兄弟相似值、祖先相似值的加权平均数作为元素i和元素j的结构相似值；5g) Use the weighted average of the child similarity value, leaf similarity value, sibling similarity value, and ancestor similarity value of element i and element j as the structural similarity value of element i and element j;

5h)将元素i和元素j的结构相似值和步骤(4)获得的语言相似值的加权平均数作为元素i和元素j的整体相似值；5h) taking the weighted average of the structural similarity value of element i and element j and the language similarity value obtained in step (4) as the overall similarity value of element i and element j;

5i)重复执行步骤5c)至步骤5h)，直到得到两个排序后的编号普吕弗序列中所有元素两两之间的整体相似值；5i) Repeat step 5c) to step 5h), until the overall similarity value between all elements in the two sorted numbered Prufer sequences is obtained;

5j)对两个排序后的编号普吕弗序列中所有元素两两之间的整体相似值，使用阈值法进行过滤，得到所有匹配的复杂节点对，组成匹配的复杂节点对集。5j) Filtering the overall similarity values between all elements in the two sorted numbered Prüfer sequences using a threshold method to obtain all matching complex node pairs to form a matching complex node pair set.

(6)非复杂元素结构匹配：(6) Non-complex element structure matching:

6a)从复杂节点结构匹配所得到的匹配元素对中任取一个元素对，将元素对中的元素分别记为元素e和元素f：6a) Randomly select an element pair from the matching element pairs obtained by complex node structure matching, and record the elements in the element pair as element e and element f respectively:

6b)分别搜索元素e和元素f所在的加强普吕弗序列，找出元素e和元素f的所有原子，组成元素e和元素f的原子集；6b) search for the reinforced Prüfer sequence where the element e and the element f are located, find out all the atoms of the element e and the element f, and form the atomic set of the element e and the element f;

6c)从元素e的原子集中，任取一个元素c，采用非复杂元素结构匹配方法，获得元素c与元素f的原子集中所有元素的结构相似值；6c) Randomly select an element c from the atomic set of element e, and use the non-complex element structure matching method to obtain the structural similarity value of all elements in the atomic set of element c and element f;

6d)判断元素e的原子集中是否还有元素，如果有，则执行步骤6a)；否则，认为已得到了元素e与元素f的原子集中所有元素两两之间的整体相似值，执行步骤6e)：6d) Determine whether there are elements in the atomic set of element e, and if so, perform step 6a); otherwise, consider that the overall similarity value between all elements in the atomic set of element e and element f has been obtained, and perform step 6e ):

6e)重复执行步骤6a)、步骤6b)、步骤6c)、步骤6d)，直到得到所有复杂节点结构匹配所得到的匹配元素对所对应的原子集中所有元素两两之间的整体相似值；6e) Repeat step 6a), step 6b), step 6c), step 6d), until the overall similarity value between all elements in the atomic set corresponding to the matching element pairs obtained by all complex node structure matching is obtained;

6f)对所得到的所有元素对的整体相似值，使用阈值法进行过滤，得到匹配的非复杂节点对，组成匹配的非复杂节点对集。6f) Filtering the obtained overall similarity values of all element pairs using a threshold method to obtain matching non-complex node pairs to form a matching non-complex node pair set.

(7)输出匹配结果：(7) Output matching results:

输出步骤(5)得到的匹配的复杂节点对集和步骤(6)得到的匹配的非复杂节点对集的并集。Output the union of the matched complex node pair set obtained in step (5) and the matched non-complex node pair set obtained in step (6).

本发明与现有技术相比具有以下优点：Compared with the prior art, the present invention has the following advantages:

第一，本发明在可扩展标记语言文档元素的层次上进行可扩展标记语言模式匹配。克服了现有技术中仅在文档层次上度量模式的相似性，而未深入到文档的元素这一更细的粒度上的不足，使得本发明能更好的用于半自动和劳动密集型的任务，如可扩展标记语言模式集成等。First, the present invention performs XML pattern matching at the level of XML document elements. It overcomes the shortcomings of the existing technology that only measures the similarity of the pattern at the document level, but does not penetrate into the finer granularity of the elements of the document, so that the present invention can be better used for semi-automatic and labor-intensive tasks , such as Extensible Markup Language schema integration, etc.

第二，本发明在语言匹配中充分利用名字、数据类型和基数约束信息，在结构匹配中充分考虑节点的孩子、兄弟、叶子和祖先节点这些结构相关信息。克服了现有技术中仅根据很少的节点相关的语言和结构信息来计算可扩展标记语言模式相似性的不足，使得本发明提高了可扩展标记语言模式匹配的质量。Second, the present invention makes full use of name, data type and cardinality constraint information in language matching, and fully considers structural related information such as children, siblings, leaves and ancestor nodes of nodes in structural matching. The invention overcomes the disadvantage of calculating the similarity of the XML pattern based on only a small amount of node-related language and structure information in the prior art, so that the present invention improves the quality of the XML pattern matching.

第三，本发明使用高效的序列结构——加强普吕弗序列表示可扩展标记语言模式，并在语言匹配最关键的部分——名称匹配中采用决策树的原理合并多种字符串匹配算法，以提高匹配效率。此外，在结构匹配中，仅把结构匹配方法应用到匹配复杂元素对的原子元素，而不是计算所有原子元素的结构相似值，这种结构匹配方法易于发现复杂匹配，而又同时能够保证匹配效率。克服了现有技术中只关注匹配质量，而忽略匹配效率的不足，本发明在匹配质量和性能达到一种平衡，使得本发明能适用于更广泛的应用。Third, the present invention uses an efficient sequence structure—enhanced Prüfer sequence to represent an extensible markup language pattern, and adopts the principle of decision tree to merge multiple string matching algorithms in the most critical part of language matching—name matching, to improve matching efficiency. In addition, in the structure matching, the structure matching method is only applied to the atomic elements that match complex element pairs, instead of calculating the structural similarity value of all atomic elements, this structure matching method is easy to find complex matches, and at the same time can ensure the matching efficiency . Overcoming the deficiency in the prior art that only pays attention to matching quality but ignores matching efficiency, the present invention achieves a balance between matching quality and performance, making the present invention applicable to wider applications.

附图说明Description of drawings

图1为本发明的流程图。Fig. 1 is a flowchart of the present invention.

具体实施方式Detailed ways

下面结合附图1对本发明作进一步的详细描述。The present invention will be described in further detail below in conjunction with accompanying drawing 1 .

步骤1，输入两个待匹配的可扩展标记语言模式文档。Step 1, input two XML schema documents to be matched.

从终端输入两个待匹配的可扩展标记语言模式文档。Enter two XML schema documents to match from the terminal.

步骤2，构建模式树。Step 2, build the pattern tree.

分别对两个待匹配的可扩展标记语言模式文档进行文档对象模型解析，生成两个待匹配的可扩展标记语言模式文件的模式树。The document object model analysis is performed on the two XML schema documents to be matched respectively, and the schema trees of the two XML schema files to be matched are generated.

模式树可表示为一个三元组T＝{N_T，E_T，Lab_NT}，其中N_T＝{n₁，n₂，...，n_n}是节点集，节点集中的每个节点唯一的表示模式中的一个对象；E_T＝{(n_i，n_j)|n_i，n_j∈N_T}是边集，n_i是n_j的父节点，每条边表示两个节点之间的父子关系；Lab_NT是节点标签集，所述标签是描述节点属性的字符串；模式树中的节点分为两种，原子节点和复杂节点；原子节点是没有出边的叶子节点，表示模式中的简单元素和属性，复杂节点是模式树的内部节点，表示复杂元素。The pattern tree can be expressed as a triplet T={N _T , E _T , Lab _NT }, where N _T ={n ₁ , n ₂ ,...,n _n } is a node set, and each node in the node set The only one represents an object in the schema; E _T = {(n _i , n _j )|n _i , n _j ∈ N _T } is an edge set, n _i is the parent node of n _j , and each edge represents two nodes The parent-child relationship between; Lab _NT is a node label set, and the label is a string describing the attributes of the node; the nodes in the pattern tree are divided into two types, atomic nodes and complex nodes; atomic nodes are leaf nodes without edges, Represents simple elements and attributes in the schema, and complex nodes are internal nodes of the schema tree that represent complex elements.

可扩展标记语言模式的结构非常复杂，主要表现为：元素和属性的出现次数往往是一个复杂的正则表达式，基数约束不易确定；共享的全局声明元素、属性和复杂类型的存在使得可扩展标记语言模式中存在环。因此，在构建模式树之前还应包括简化可扩展标记语言模式。简化步骤依次序包括：建立副本以解决模式中的共享，限定递归次数来解决模式中的无穷递归，用规则简化元素和属性的基数约束。模式树是一种根有向标记树，它反映了可扩展标记语言模式文档的层次结构。The structure of Extensible Markup Language mode is very complex, mainly as follows: the number of occurrences of elements and attributes is often a complex regular expression, cardinality constraints are not easy to determine; the existence of shared global declaration elements, attributes and complex types makes Extensible Markup There are loops in the language model. Therefore, you should also include Simplified Extensible Markup Language schemas before building your schema tree. The simplification steps include, in sequence: creating copies to solve the sharing in the schema, limiting the number of recursions to solve the infinite recursion in the schema, and using rules to simplify the cardinality constraints of elements and attributes. A schema tree is a rooted directed markup tree that reflects the hierarchical structure of an Extensible Markup Language schema document.

步骤3，构造序列结构。Step 3, construct the sequence structure.

对两个模式树分别采用普吕弗序列(Prüfer Sequences)生成方法，生成对应的加强普吕弗序列(Consolidated Prüfer Sequence CPS)。其中，加强普吕弗序列又由编号普吕弗序列(Number Prüfer Sequence NPS)和标已普吕弗序列(Label PrüferSequence LPS)构成，即CPS＝{NPS，LPS}，它们分别了表示模式树中完全不同的信息，其中，编号普吕弗序列表示模式树的结构信息，标记普吕弗序列表示模式树的语义信息，因此，加强普吕弗序列唯一表示了一棵模式树。加强普吕弗序列有自己独特的优点，包含了节点特性和关系特性。节点特性是指：假设n_i是模式树中的一个节点，n_i在模式树中的后序号为k，则n_i是原子节点当且仅当k不属于NPS；n_i是复杂节点当且仅当k属于NPS。关系特性又包括：父子关系特性和兄弟关系特性。父子关系特性是指：设NPS_i为NPS中索引为i的元素，LPS_i为LPS中索引为i的元素，那么则有NPS_i-节点是LPS_i节点的父亲节点，LPS_i是NPS_i的直接孩子节点。兄弟关系特性是指：设NPS_i和NPS_j分别为NPS中索引为i和j的元素，LPS_i和LPS_j分别为NPS中索引为i和j的元素，则LPS_i和LPS_j是兄弟节点当且仅当NPS_i＝NPS_j。The Prüfer Sequences (Prüfer Sequences) generation method is used for the two pattern trees to generate the corresponding Consolidated Prüfer Sequence (CPS). Among them, the enhanced Prüfer sequence is composed of Number Prüfer Sequence (Number Prüfer Sequence NPS) and Label Prüfer Sequence (Label Prüfer Sequence LPS), that is, CPS={NPS, LPS}, which respectively represent the Completely different information, wherein the numbered Prüfer sequence represents the structural information of the pattern tree, and the marked Prüfer sequence represents the semantic information of the pattern tree. Therefore, the enhanced Prüfer sequence uniquely represents a pattern tree. The enhanced Prüfer sequence has its own unique advantages, including node characteristics and relationship characteristics. The node characteristics refer to: suppose n _i is a node in the pattern tree, and the post sequence number of n _i in the pattern tree is k, then n _i is an atomic node if and only if k does not belong to NPS; n _i is a complex node if and Only if k belongs to NPS. The relationship characteristics include: parent-child relationship characteristics and brother relationship characteristics. The parent-child relationship feature refers to: Let NPS _i be the element with index i in NPS, and LPS _i be the element with index i in LPS, then NPS _i -node is the parent node of LPS _i node, and LPS _i is the parent node of NPS _i direct child nodes. The characteristics of sibling relationship refer to: Let NPS _i and NPS _j be the elements with index i and j in NPS respectively, and LPS _i and LPS _j are the elements with indexes i and j in NPS respectively, then LPS _i and LPS _j are sibling nodes If and only if NPS _i =NPS _j .

采用普吕弗序列构造方法为待匹配可扩展标记语言模式文件的模式树构造相应加强普吕弗序列，具体实施过程如下：Using the Prüfer sequence construction method to construct the pattern tree of the extensible markup language pattern file to be matched correspondingly strengthens the Prüfer sequence, and the specific implementation process is as follows:

第1步，搜索两个待匹配的可扩展标记语言模式文档的模式树，从中找到具有最小后序遍历顺序号的叶子节点；Step 1: Search the schema tree of two XML schema documents to be matched, and find the leaf node with the smallest postorder traversal sequence number;

第2步，将所找到的具有最小后序遍历顺序号的叶子节点，存储于标记普吕弗序列中，同时将该叶子节点的父节点存储于编号普吕弗序列中；Step 2, store the found leaf node with the smallest postorder traversal sequence number in the marked Prüfer sequence, and store the parent node of the leaf node in the numbered Prüfer sequence;

第3步，从两个待匹配的可扩展标记语言模式文件的模式树中，删除所找到的具有最小后序遍历顺序号的叶子节点；Step 3, from the pattern tree of the two extensible markup language pattern files to be matched, delete the found leaf node with the smallest postorder traversal sequence number;

第4步，判断两个待匹配的可扩展标记语言模式文件的模式树是否为空，如果是，则执行第1步；否则，完成了普吕弗序列的构造。Step 4, judge whether the pattern trees of the two XML pattern files to be matched are empty, and if so, execute step 1; otherwise, complete the construction of the Prüfer sequence.

步骤4，语言匹配。Step 4, language matching.

语言匹配方法基于节点的名称、节点的数据类型和节点的基数约束的三者的相似性，具体实施过程如下：The language matching method is based on the similarity among the name of the node, the data type of the node and the cardinality constraint of the node. The specific implementation process is as follows:

4a)分别从两个加强普吕弗序列的标记普吕弗序列中任意选取一个元素s和元素t。4a) An element s and an element t are arbitrarily selected respectively from the two marked Prüfer sequences that strengthen the Prüfer sequences.

4b)采用名称相似值计算方法，获得元素s和元素t的名称相似值。4b) Obtain the name similarity value of the element s and element t by using the name similarity value calculation method.

在不考虑数据实例的情况下，节点名称是匹配的一个重要信息。节点名称相似可以是语义相似，如People和Staff，也可以是结构相似，如Staff和TechnicalStaff。名称的结构相似可用字符串匹配方法，计算两个名称字符串的相似值。语义相似需要借助外部词典，而频繁的查找外部词典会增加匹配的时间，因此考虑到匹配效率和匹配的全自动性，名称匹配仅包括名称的结构相似。名称匹配方法的具体实施过程如下：The node name is an important information for matching without considering the data instance. Similar node names can be semantically similar, such as People and Staff, or structurally similar, such as Staff and TechnicalStaff. The structural similarity of the name can use the string matching method to calculate the similarity value of two name strings. Semantic similarity requires the help of external dictionaries, and frequent lookup of external dictionaries will increase the matching time. Therefore, considering the matching efficiency and automatic matching, name matching only includes structural similarity of names. The specific implementation process of the name matching method is as follows:

4b1)按照名称令牌化规则，将元素s和元素t的名称进行分割，得到令牌集1和令牌集2。4b1) According to the name tokenization rules, divide the names of element s and element t to obtain token set 1 and token set 2.

在可扩展标记语言模式中，有些节点的名称较长，有些节点表示的信息相同但为了区分常常带有不同数字序号，有些节点的名称带有特殊的符号。为了使节点名称更好的用于字符串匹配算法，节点名称首先规范为由许多子字符串组成的集合-令牌集，每个子字符串叫做令牌。令牌化规则是指：以如”_”、空格、数字、大写字母等的特殊符号为分隔符，将节点名称分割为令牌，并删除令牌集中的非字母令牌，如数字，特殊符号令牌等。In the XML mode, some nodes have longer names, some nodes represent the same information but often have different numerical serial numbers to distinguish them, and some node names have special symbols. In order to make node names better for string matching algorithms, node names are first normalized as a collection of many substrings—the token set, and each substring is called a token. The tokenization rule refers to: using special symbols such as "_", spaces, numbers, uppercase letters, etc. Symbol tokens, etc.

4b2)从令牌集1中任取一个元素a，采用决策树方法，获得元素a与令牌集2中所有元素的字符串相似值，本步骤具体实施过程如下：4b2) Randomly select an element a from the token set 1, and use the decision tree method to obtain the string similarity values between the element a and all elements in the token set 2. The specific implementation process of this step is as follows:

第1步，从令牌集2中任取一个元素b。Step 1, randomly select an element b from token set 2.

第2步，比较元素a和元素b的字符串值，如果完全相同，则元素a和元素b的字符串相似值为1；否则，计算元素a和元素b的编辑距离相似值。Step 2: Compare the string values of element a and element b. If they are identical, the string similarity value of element a and element b is 1; otherwise, calculate the edit distance similarity value of element a and element b.

第3步，判断编辑距离相似值是否大于等于阈值0.58，如果是，则元素a和元素b的字符串相似值为所计算的编辑距离相似值；否则，采用Jaro-Winkler算法计算元素a和元素b的字符串相似值，同时采用3-gram算法计算元素a和元素b的另一个字符串相似值，将两个相似值的加权平均数作为元素a和元素b的字符串相似值。Step 3: Determine whether the similarity value of the edit distance is greater than or equal to the threshold 0.58. If yes, the string similarity value of element a and element b is the calculated similarity value of the edit distance; otherwise, use the Jaro-Winkler algorithm to calculate the similarity value of element a and element b The string similarity value of b, while using the 3-gram algorithm to calculate another string similarity value of element a and element b, and use the weighted average of the two similarity values as the string similarity value of element a and element b.

第4步，判断令牌集2中是否还有元素，如果有，则转到第1步；否则，认为已得到了元素a与令牌集2中所有元素的字符串相似值，执行步骤4b3)。Step 4, determine whether there are elements in token set 2, and if so, go to step 1; otherwise, consider that the string similarity between element a and all elements in token set 2 has been obtained, and perform step 4b3 ).

4b3)找出步骤4b2)中所得到的相似值中的最大值，将该最大值作为元素a与令牌集2的字符串相似值。4b3) Find the maximum value among the similarity values obtained in step 4b2), and use the maximum value as the string similarity value between element a and token set 2.

4b4)判断令牌集1中是否还有元素，如果有，则转到步骤4b2)；否则，认为已得到了令牌集1所有元素与令牌集2的字符串相似值，执行步骤4b5)。4b4) Judging whether there are elements in token set 1, if so, then go to step 4b2); otherwise, think that all elements of token set 1 and the string similarity values of token set 2 have been obtained, and perform step 4b5) .

4b5)将令牌集1所有元素与令牌集2的字符串相似值累加，得到累加和。4b5) Accumulate all the elements of token set 1 and the character string similarity values of token set 2 to obtain the cumulative sum.

4b6)从令牌集2中任取一个元素b，采用决策树方法，计算元素b与令牌集1中所有元素的字符串相似值。4b6) Randomly select an element b from token set 2, and use the decision tree method to calculate the string similarity between element b and all elements in token set 1.

4b7)找出步骤4b6)中所得到的相似值中的最大值，将该最大值作为元素b与令牌集1的字符串相似值。4b7) Find the maximum value among the similarity values obtained in step 4b6), and use the maximum value as the string similarity value between element b and token set 1.

4b8)判断令牌集2中是否还有元素，如果有，则转到步骤4b6)；否则，认为已得到了令牌集2所有元素与令牌集1的字符串相似值，执行步骤4b9)。4b8) Judging whether there are elements in the token set 2, if so, then go to step 4b6); otherwise, think that all elements of the token set 2 and the string similarity values of the token set 1 have been obtained, and perform step 4b9) .

4b9)将令牌集2所有元素与令牌集1的字符串相似值累加，得到累加和。4b9) Accumulate all the elements of token set 2 and the character string similarity values of token set 1 to obtain the cumulative sum.

4b10)将步骤4b5)、步骤4b9)中所得到的累加和相加，得到总和。4b10) adding up the sums obtained in step 4b5) and step 4b9) to obtain a sum.

4b11)将总和除以两令牌集中元素的总个数，所得商即为元素s和元素t的语言相似值。4b11) Divide the sum by the total number of elements in the two token sets, and the resulting quotient is the linguistic similarity value of element s and element t.

4c)采用数据类型相似值计算方法，获得元素s和元素t的数据类型相似值。4c) Obtain the data type similarity value of the element s and the element t by using the data type similarity value calculation method.

节点名称虽然是语言匹配的重要信息，但名称匹配得到的映射元素中仍有很多错误的匹配。为了提高匹配质量，数据类型成为了语言匹配中的又一个可以利用的模式信息。可扩展标记语言模式的数据类型有内置类型和自定义类型两种，内置类型包括string、int、bool等，自定义类型包括复杂类型和简单类型，而简单类型归结到底也是内置类型。两个内置类型节点的相似性要大于一个内置类型节点和一个复杂类型节点，两个复杂类型节点的类型相似值由节点的结构决定。数据类型匹配方法的具体实施过程如下：Although node names are important information for language matching, there are still many false matches in the mapping elements obtained by name matching. In order to improve the matching quality, data type becomes another pattern information that can be used in language matching. The data types of Extensible Markup Language schema include built-in types and custom types. Built-in types include string, int, bool, etc., and custom types include complex types and simple types, and simple types are also built-in types in the final analysis. The similarity of two built-in type nodes is greater than that of a built-in type node and a complex type node, and the type similarity value of two complex type nodes is determined by the structure of the nodes. The specific implementation process of the data type matching method is as follows:

判断元素s和元素t的数据类型是否均为内置类型，如果是，则查找类型相似表，找出元素s和元素t的数据类型相似值；否则，判断元素s和元素t的数据类型是否均为复杂类型，如果是，则元素s和元素t的数据类型相似值为1；否则，元素s和元素t的数据类型相似值为0。Determine whether the data types of element s and element t are both built-in types, if so, look up the type similarity table to find out the data type similarity value of element s and element t; otherwise, determine whether the data types of element s and element t are the same is a complex type, if yes, the data type similarity value of element s and element t is 1; otherwise, the data type similarity value of element s and element t is 0.

4d)采用基数约束相似值计算方法，获得元素s和元素t的基数约束相似值。4d) The cardinality-constrained similarity value calculation method is used to obtain the cardinality-constrained similarity value of the element s and the element t.

节点的基数约束信息成是语言匹配中可以利用又一个重要的信息，可扩展标记语言模式用“minOccurs”和“maxOccurs”定义了模式中元素或属性的出现次数。文件类型定义(Document Type Definition DTD)中节点的基数表示方法有四种基本基数约束值：“*”、“？”、“+”和“none”，这四种基本基数约束值对应到可扩展标记语言模式定义(XML Schema Definition XSD)中，“none”表示minOccurs＝1且maxOccurs＝1，“？”表示mihOccurs＝0且maxOccurs＝1，“*”表示minOccurs＝0且maxOccurs＝unbounded，“+”表示minOccurs＝1且maxOccurs＝unbounded。如果两个节点的基数约束都可以表示为这四种基本基数约束值，那么这两个节点的约束相似值只需查找基数约束相似表即可，否则，执行下列步骤，计算元素s和元素t的基数约束相似值：Cardinality constraint information of nodes is another important information that can be used in language matching. XML schema uses "minOccurs" and "maxOccurs" to define the occurrence times of elements or attributes in the schema. There are four basic cardinality constraint values for node cardinality representation in Document Type Definition DTD: "*", "?", "+" and "none". These four basic cardinality constraint values correspond to the scalable In the markup language schema definition (XML Schema Definition XSD), "none" means minOccurs=1 and maxOccurs=1, "?" means mihOccurs=0 and maxOccurs=1, "*" means minOccurs=0 and maxOccurs=unbounded, "+ " indicates that minOccurs=1 and maxOccurs=unbounded. If the cardinality constraints of two nodes can be expressed as these four basic cardinality constraint values, then the constraint similarity value of the two nodes only needs to look up the cardinality constraint similarity table, otherwise, perform the following steps to calculate the element s and element t Cardinality constraint similarity values for :

4d1)使用下式计算元素s和元素t的最小基数约束相似值：4d1) Calculate the minimum cardinality-constrained similarity value of element s and element t using the following formula:

$u u = = 11 - - \frac{| | x x - - y the y | |}{x x + + y the y}$

其中，u表示元素s和元素t的最小基数约束相似值，x表示元素s的最小约束出现次数，y表示元素t的最小约束出现次数。Among them, u represents the minimum cardinality constraint similarity value of element s and element t, x represents the minimum constraint occurrence number of element s, and y represents the minimum constraint occurrence frequency of element t.

4d2)使用下式计算元素s和元素t的最大基数约束相似值：4d2) Use the following formula to calculate the maximum cardinality constraint similarity value of element s and element t:

$v v = = 11 - - \frac{| | m m - - n no | |}{m m + + n no}$

其中，v表示元素s和元素t的最大基数约束相似值，m表示元素s的最大约束出现次数，n表示元素t的最大约束出现次数。Among them, v represents the maximum cardinality constraint similarity value of element s and element t, m represents the maximum constraint occurrence number of element s, and n represents the maximum constraint occurrence frequency of element t.

4d3)计算元素s和元素t的最小基数约束相似值和最大基数约束相似值的平均值，该平均值即为元素s和元素t的基数约束相似值。4d3) Calculate the average value of the minimum cardinality constraint similarity value and the maximum cardinality constraint similarity value of element s and element t, and the average value is the cardinality constraint similarity value of element s and element t.

4e)将元素s和元素t的名称相似值、数据类型相似值、基数约束相似值的加权平均数作为元素s和元素t的语言相似值。4e) Take the weighted average of the name similarity value, data type similarity value, and cardinality constraint similarity value of element s and element t as the language similarity value of element s and element t.

步骤5，复杂元素结构匹配。Step 5, complex element structure matching.

复杂节点的结构匹配方法基于节点的四种结构：孩子、叶子、兄弟和祖先，本步骤具体实施过程如下：The structure matching method of complex nodes is based on four structures of nodes: children, leaves, brothers and ancestors. The specific implementation process of this step is as follows:

5a)按照节点在模式树中的后序号从小到大的顺序，分别对两个加强普吕弗序列中的编号普吕弗序列的所有节点进行排序。5a) Sorting all the nodes of the numbered Prüfer sequences in the two enhanced Prüfer sequences according to the descending sequence numbers of the nodes in the pattern tree.

5b)分别从两个排序后的编号普吕弗序列中任意选取一个元素i和元素j。5b) Randomly select an element i and an element j from the two sorted numbered Prüfer sequences respectively.

5c)采用孩子相似值计算方法，获得元素i和元素j的孩子相似值。5c) Using the child similarity value calculation method to obtain the child similarity values of element i and element j.

作为元素结构相似值中最主要的部分，孩子相似值直接反映了元素的基本结构，本步骤具体实施过程如下：As the most important part of the element structure similarity value, the child similarity value directly reflects the basic structure of the element. The specific implementation process of this step is as follows:

5c1)利用加强普吕弗序列所包含的父子关系特性，搜索元素i所对应的加强普吕弗序列，找出元素i的所有孩子，组成元素i的孩子集；搜索元素j所对应的加强普吕弗序列，找出元素j的所有孩子，组成元素j的孩子集；5c1) Using the parent-child relationship characteristics contained in the enhanced Prüfer sequence, search for the enhanced Prüfer sequence corresponding to element i, find out all the children of element i, and form the child set of element i; search for the enhanced Prüfer sequence corresponding to element j Rueff sequence, find out all the children of element j, and form the child set of element j;

5c2)从元素i的孩子集中，任取一个元素p，得到元素p与元素j的孩子集中所有元素的整体相似值。这里的两元素的整体相似值为两元素的语言相似值和两元素的结构相似值的加权平均数；因为两元素的结构相似值的是按照节点在模式树中的后序号从小到大的顺序计算的，所以，此时已经计算出元素p与元素j的孩子集中元素的结构相似值，并且其语言相似值已在步骤(4)中计算出来。5c2) From the child set of element i, randomly select an element p, and obtain the overall similarity value of element p and all elements in the child set of element j. The overall similarity value of the two elements here is the weighted average of the language similarity value of the two elements and the structural similarity value of the two elements; because the structural similarity value of the two elements is in the order of the node's post-sequence number in the pattern tree from small to large Calculated, so, at this time, the structural similarity value of element p and element j's child set has been calculated, and its linguistic similarity value has been calculated in step (4).

5c3)找出第2步中所得到的相似值中的最大值，将该最大值作为元素p与元素j的孩子集的相似值。5c3) Find the maximum value among the similarity values obtained in step 2, and use the maximum value as the similarity value of the child set of element p and element j.

5c4)判断元素i的孩子集中是否还有元素，如果有，则转到步骤5c2)；否则，认为已得到了元素i孩子集中所有元素与元素j的孩子集的整体相似值，执行步骤5c5)。5c4) Determine whether there are elements in the child set of element i, if so, go to step 5c2); otherwise, consider that the overall similarity value of all elements in the child set of element i and the child set of element j has been obtained, and perform step 5c5) .

5c5)将元素i的孩子集中所有元素与元素j的孩子集的所有相似值累加，得到一个累加和。5c5) Accumulate all similar values of all elements in the child set of element i and all similar values in the child set of element j to obtain a cumulative sum.

5c6)将步骤5c5)所得到的累加和除以两个孩子集所含元素数的最大值，所得商即为元素i和元素j的孩子相似值。5c6) Divide the cumulative sum obtained in step 5c5) by the maximum number of elements contained in the two child sets, and the obtained quotient is the child similarity value of element i and element j.

5d)采用叶子相似值计算方法，获得元素i和元素j的叶子相似值。5d) Obtain the leaf similarity value of element i and element j by adopting the leaf similarity value calculation method.

本步骤具体实施过程如下：The specific implementation process of this step is as follows:

5d1)利用加强普吕弗序列所包含的父子关系和兄弟关系特性，搜索元素i所对应的加强普吕弗序列，找出元素i的所有叶子，组成元素i的叶子集；搜索元素j所对应的加强普吕弗序列，找出元素j的所有叶子，组成元素j的叶子集。5d1) Using the parent-child relationship and sibling relationship characteristics contained in the enhanced Prüfer sequence, search for the enhanced Prüfer sequence corresponding to element i, find out all the leaves of element i, and form the leaf set of element i; search for the corresponding element j The strengthened Prufer sequence, find all the leaves of element j, and form the leaf set of element j.

5d2)以元素在模式树中的后序号与元素的叶子集中每个叶子节点在模式树中的后序号的差作为元素的数字向量的分量，分别构建元素i和元素j的数字向量。5d2) The difference between the post-sequence number of the element in the pattern tree and the post-sequence number of each leaf node in the pattern tree in the leaf set of the element is used as the component of the digital vector of the element, and the digital vectors of element i and j are respectively constructed.

5d3)使用余弦定理，计算元素i和元素j的数字向量的相似值，该相似值即为元素i和元素j的叶子相似值。5d3) Using the law of cosines, calculate the similarity value of the numerical vectors of element i and element j, and the similarity value is the leaf similarity value of element i and element j.

5e)采用兄弟相似值计算方法，获得元素i和元素j的兄弟相似值。5e) Obtain the sibling similarity value of element i and element j by adopting the sibling similarity calculation method.

5e1)利用加强普吕弗序列中所包含的兄弟关系特性，搜索元素i所对应的加强普吕弗序列，找出元素i的所有兄弟，组成元素i的兄弟集；搜索元素j所对应的加强普吕弗序列，找出元素j的所有兄弟，组成元素j的兄弟集。5e1) Utilize the characteristics of the sibling relationship included in the enhanced Prüfer sequence, search for the enhanced Prüfer sequence corresponding to element i, find out all brothers of element i, and form a sibling set of element i; search for the enhanced Prüfer sequence corresponding to element j Prufer sequence, find out all siblings of element j, and form the sibling set of element j.

5e2)从元素i的兄弟集中，任取一个元素q，得到元素q与元素j的兄弟集中所有元素的语言相似值。5e2) From the sibling set of element i, randomly select an element q, and obtain the language similarity value of all elements in the sibling set of element q and element j.

5e3)找出步骤5e2)中所得相似值中的最大值，将该最大值作为元素q与元素j的兄弟集的相似值。5e3) Find the maximum value among the similarity values obtained in step 5e2), and use the maximum value as the similarity value of the sibling set of element q and element j.

5e4)判断元素i的兄弟集中是否还有元素，如果有，则转到步骤5e2)；否则，认为已得到了元素i的兄弟集中所有元素与元素j的语言相似值，执行步骤5e5)。5e4) Determine whether there are elements in the sibling set of element i, and if so, go to step 5e2); otherwise, consider that the linguistic similarity values of all elements in the sibling set of element i and element j have been obtained, and perform step 5e5).

5e5)将元素i的兄弟集中所有元素与元素j的兄弟集的所有相似值累加，得到一个累加和。5e5) Accumulate the similarity values of all elements in the sibling set of element i and all the similarity values in the sibling set of element j to obtain an accumulation sum.

5e6)将步骤5e5)所得到的累加和除以两个兄弟集所含元素数的最大值，所得商即为元素i和元素j的兄弟相似值。5e6) Divide the cumulative sum obtained in step 5e5) by the maximum number of elements contained in the two sibling sets, and the obtained quotient is the sibling similarity value of element i and element j.

5f)采用祖先相似值计算方法，获得元素i和元素j的祖先相似值。5f) Using an ancestor similarity value calculation method to obtain the ancestor similarity values of element i and element j.

5f1)利用加强普吕弗序列中所包含的父子关系特性，搜索元素i所对应的加强普吕弗序列，找出元素i的所有祖先，并按照搜索的先后顺序将元素i的所有祖先连接起来构成元素i的祖先路径；搜索元素j所对应的加强普吕弗序列，找出元素j的所有祖先，并按照搜索的先后顺序将元素j的所有祖先连接起来构成元素j的祖先路径。5f1) Use the parent-child relationship characteristics contained in the enhanced Prüfer sequence to search for the enhanced Prüfer sequence corresponding to element i, find out all the ancestors of element i, and connect all the ancestors of element i according to the search sequence Constitute the ancestor path of element i; search the enhanced prufer sequence corresponding to element j, find out all the ancestors of element j, and connect all the ancestors of element j according to the search order to form the ancestor path of element j.

5f2)将祖先路径看作一个字符串序列，路径中的每个节点名称看作一个整体，利用语言匹配方法计算每个节点的语言相似值，基于祖先路径中所有节点的语言相似值，计算元素i的祖先路径和元素j的祖先路径之间的编辑距离；计算编辑距离时，仅考虑节点名称是否语言相似而不要求完全相同。例如，假设两个节点的祖先路径分别为PO/Orders/shipTo和PO/POrders/buyer，其中，PO与PO完全相同，Orders与POrders语言上相似，shipTo和buyer语言上不相似，因此，这两个节点的祖先路径之间的编辑距离为1。5f2) Treat the ancestor path as a sequence of strings, each node name in the path as a whole, use the language matching method to calculate the language similarity value of each node, based on the language similarity value of all nodes in the ancestor path, calculate the element The edit distance between the ancestor path of i and the ancestor path of element j; when calculating edit distance, only consider whether the node names are similar in language and not required to be identical. For example, assuming that the ancestor paths of two nodes are PO/Orders/shipTo and PO/POrders/buyer respectively, among them, PO is exactly the same as PO, Orders is similar to POrders in language, and shipTo and buyer are not similar in language. Therefore, the two The edit distance between the ancestor paths of nodes is 1.

5f3)将步骤5f2)中所得到的元素i的祖先路径和元素j的祖先路径之间的编辑距离除以元素i的祖先路径长度(所含祖先节点数)和元素j的祖先路径长度中的最大值，得到一个商。5f3) Divide the edit distance between the ancestor path of element i and the ancestor path of element j obtained in step 5f2) by the length of the ancestor path of element i (the number of ancestor nodes contained) and the length of the ancestor path of element j Maximum value, to get a quotient.

5f4)单位1减去步骤5f3)所得到的商，即为元素i和元素j的祖先相似值。5f4) Unit 1 minus the quotient obtained in step 5f3) is the ancestor similarity value of element i and element j.

5g)将元素i和元素j的孩子相似值、叶子相似值、兄弟相似值、祖先相似值的加权平均数作为元素i和元素j的结构相似值。5g) Take the weighted average of the child similarity value, leaf similarity value, sibling similarity value, and ancestor similarity value of element i and element j as the structural similarity value of element i and element j.

5h)将元素i和元素j的结构相似值和语言相似值的加权平均数作为元素i和元素j的整体相似值。5h) The weighted average of the structural similarity value and language similarity value of element i and element j is taken as the overall similarity value of element i and element j.

5i)重复执行步骤5c)至步骤5h)，直到得到两个排序后的编号普吕弗序列中所有元素两两之间的整体相似值。5i) Repeat steps 5c) to 5h) until the overall similarity values between all elements in the two sorted Prüfer sequences are obtained.

步骤6，非复杂元素结构匹配。Step 6, non-complex element structure matching.

对由复杂节点结构匹配所得到的每个匹配元素对，计算元素对所对应的原子集中所有节点间的结构相似值。这种匹配方除了可以提高匹配效率外，还可以识别出复杂匹配。本步骤具体实施过程如下：For each matching element pair obtained by complex node structure matching, the structural similarity value between all nodes in the atomic set corresponding to the element pair is calculated. In addition to improving matching efficiency, this matching method can also identify complex matching. The specific implementation process of this step is as follows:

6a)从复杂节点结构匹配所得到的匹配元素对中任取一个元素对，将元素对中的元素分别记为元素e和元素f。6a) Randomly select an element pair from the matching element pairs obtained by complex node structure matching, and denote the elements in the element pair as element e and element f respectively.

6b)分别搜索元素e和元素f所在的加强普吕弗序列，找出元素e和元素f的所有原子，组成元素e和元素f的原子集。6b) Search the enhanced Prüfer sequence of element e and element f respectively, find out all atoms of element e and element f, and form the atomic set of element e and element f.

6c)从元素e的原子集中，任取一个元素c，获得元素c与元素f的原子集中所有元素的结构相似值，具体步骤如下：6c) From the atomic set of element e, randomly select an element c, and obtain the structural similarity value of all elements in the atomic set of element c and element f, the specific steps are as follows:

6c1)从元素t的原子集中任取一个元素d。6c1) Randomly select an element d from the atomic set of element t.

6c2)采用5e)中所述的兄弟相似值计算方法，获得元素c与元素d的兄弟相似值。6c2) Using the calculation method of sibling similarity value described in 5e), obtain the sibling similarity value of element c and element d.

6c3)采用5f)中所述的祖先相似值计算方法，获得元素c与元素d的祖先相似值。6c3) Using the ancestor similarity value calculation method described in 5f), obtain the ancestor similarity values of elements c and d.

6c4)将元素c与元素d的兄弟相似值和祖先相似值的加权平均值，作为元素c与元素d的结构相似值。6c4) The weighted average of sibling similarity values and ancestor similarity values of element c and element d is taken as the structural similarity value of element c and element d.

6c5)将元素c与元素d的结构相似值和语言相似值相加，所得到的和作为元素c与元素d的整体相似值。6c5) Add the structural similarity value and linguistic similarity value of element c and element d, and the obtained sum is used as the overall similarity value of element c and element d.

6c6)判断元素f的原子集中是否还有元素，如果有，则转到步骤6c1)；否则，认为已得到了元素c与元素f的原子集中所有元素的整体相似值，执行步骤6d)。6c6) Determine whether there are elements in the atomic set of element f, and if so, go to step 6c1); otherwise, consider that the overall similarity value of all elements in the atomic set of element c and element f has been obtained, and perform step 6d).

6d)判断元素e的原子集中是否还有元素，如果有，则执行步骤6a)；否则，认为已得到了元素e与元素f的原子集中所有元素两两之间的整体相似值，执行步骤6e)。6d) Determine whether there are elements in the atomic set of element e, and if so, perform step 6a); otherwise, consider that the overall similarity value between all elements in the atomic set of element e and element f has been obtained, and perform step 6e ).

6e)重复执行步骤6a)、步骤6b)、步骤6c)、步骤6d)，直到得到所有复杂节点结构匹配所得到的匹配元素对所对应的原子集中所有元素两两之间的整体相似值。6e) Repeat step 6a), step 6b), step 6c), and step 6d), until the overall similarity value between all elements in the atomic set corresponding to the matched element pairs obtained by all complex node structure matches is obtained.

(7)输出匹配结果：(7) Output matching results:

Claims

1. expandable mark language mode matching process comprises following concrete steps:

(1) two expandable mark language mode documents to be matched of input;

(2) make up scheme-tree:

Two expandable mark language mode documents to be matched are carried out DOM Document Object Model resolve, generate the scheme-tree of two expandable mark language mode files to be matched;

(3) tectonic sequence structure:

Respectively two scheme-trees are carried out the Pu Lvfu sequence structure, strengthen the Pu Lvfu sequence for two that obtain to be formed by numbering Pu Lvfu sequence and mark Pu Lvfu sequence;

(4) language coupling:

4a) from the mark Pu Lvfu sequence of two reinforcement Pu Lvfu sequences, choose an element s and element t arbitrarily respectively;

4b) adopt title similar value computing method, obtain the title similar value of element s and element t;

4c) adopt data type similar value computing method, obtain the data type similar value of element s and element t;

4d) adopt constraint base similar value computing method, obtain the constraint base similar value of element s and element t;

4e) with the weighted mean of the title similar value of element s and element t, data type similar value, the constraint base similar value language similar value as element s and element t;

4f) repeated execution of steps 4a) to step 4e), all elements language similar value between any two in obtaining two mark Pu Lvfu sequences;

(5) complicated element structure coupling:

5a) according to back sequence number from small to large the order of node in scheme-tree, respectively two all nodes of strengthening the numbering Pu Lvfu sequence in the Pu Lvfu sequence are sorted;

5b) from the numbering Pu Lvfu sequence after two orderings, choose an element i and element j respectively arbitrarily;

5c) adopt child's similar value computing method, obtain child's similar value of element i and element j;

5d) adopt leaf similar value computing method, obtain the leaf similar value of element i and element j;

5e) adopt fraternal similar value computing method, obtain the fraternal similar value of element i and element j;

5f) adopt ancestors' similar value computing method, obtain ancestors' similar value of element i and element j;

5g) with the weighted mean of child's similar value of element i and element j, leaf similar value, fraternal similar value, the ancestors' similar value structural similarity value as element i and element j;

5h) weighted mean of the language similar value that the structural similarity value of element i and element j and step (4) are obtained are as the global similarity value of element i and element j;

5i) repeated execution of steps 5c) to step 5h), all elements global similarity value between any two in obtaining two numbering Pu Lvfu sequences after the ordering;

5j) to all elements global similarity value between any two in the numbering Pu Lvfu sequence after two orderings, use threshold method to filter, the complex node that obtains all couplings is right, forms the complex node of coupling to collection;

(6) non-complex element structure coupling:

6a) appoint from the resulting coupling element of complex node structure matching centering that to get an element right, the element of element centering be designated as element e and element f respectively:

6b) the reinforcement Pu Lvfu sequence at searching element e and element f place is respectively found out all atoms of element e and element f, the former subclass of component e and element f;

6c) concentrate from the atom of element e, appoint and get an element c, adopt non-complex element structure matching process, the atom that obtains element c and element f is concentrated the structural similarity value of all elements;

6d) atom of judging element e concentrates whether to also have element, if having, and execution in step 6a then); Otherwise, think that the atom that has obtained element e and element f concentrates all elements global similarity value between any two, execution in step 6e);

6e) repeated execution of steps 6a), step 6b), step 6c), step 6d), up to obtaining the resulting coupling element of all complex node structure matching corresponding atom is concentrated all elements global similarity value between any two;

6f) the global similarity value right to resulting all elements uses threshold method to filter, and the non-complex node that obtains mating is right, forms the non-complex node of coupling to collection;

(7) output matching result:

The union of non-complex node to collecting of the coupling that the complex node of the coupling that obtains of output step (5) obtains collection and step (6).

2. a kind of expandable mark language mode matching process according to claim 1 is characterized in that, the concrete steps of the Pu Lvfu sequence structure described in the step (3) are as follows:

The 1st goes on foot, and searches for the scheme-tree of two expandable mark language mode documents to be matched, therefrom finds the leaf node with minimum postorder traversal serial number;

The 2nd step with the leaf node with minimum postorder traversal serial number that finds, was stored in the mark Pu Lvfu sequence, and the father node with this leaf node is stored in the numbering Pu Lvfu sequence simultaneously;

The 3rd step, from the scheme-tree of two expandable mark language mode files to be matched, the leaf node with minimum postorder traversal serial number that deletion is found;

In the 4th step, judge whether the scheme-tree of two expandable mark language mode files to be matched is empty, if then carried out for the 1st step; Otherwise, finished the structure of Pu Lvfu sequence.

3. a kind of expandable mark language mode matching process according to claim 1 is characterized in that step 4b) described in the performing step of title similar value computing method as follows:

The 1st step, according to title token rule, the title of element s and element t is cut apart, obtain token collection 1 and token collection 2;

The 2nd step, from token collection 1, appoint and get an element a, adopt traditional decision-tree, obtain the character string similar value of all elements in element a and the token collection 2;

In the 3rd step, find out the maximal value in the resulting similar value in the 2nd step, with the character string similar value of this maximal value as element a and token collection 2;

The 4th step, judge in the token collection 1 whether also have element, if having, then forwarded for the 2nd step to; Otherwise, think the character string similar value that has obtained token collection 1 all elements and token collection 2, carried out for the 5th step;

The 5th step, the character string similar value of token collection 1 all elements and token collection 2 is added up, obtain adding up with;

The 6th step, from token collection 2, appoint and get an element b, adopt traditional decision-tree, calculate the character string similar value of all elements in element b and the token collection 1;

In the 7th step, find out the maximal value in the resulting similar value in the 6th step, with the character string similar value of this maximal value as element b and token collection 1;

The 8th step, judge in the token collection 2 whether also have element, if having, then forwarded for the 6th step to; Otherwise, think the character string similar value that has obtained token collection 2 all elements and token collection 1, carried out for the 9th step;

The 9th step, the character string similar value of token collection 2 all elements and token collection 1 is added up, obtain adding up with;

In the 10th step, with resulting adding up and addition in the 5th step, the 9th step, obtain summation;

In the 11st step, with the total number of summation divided by the concentrated element of two tokens, the gained merchant is the language similar value of element s and element t.

4. a kind of expandable mark language mode matching process according to claim 1, it is characterized in that, step 4c) the data type similar value computing method described in refer to, whether the data type of judging element s and element t is built-in type, if then from the similar table of type, find out the data type similar value of element s and element t; Otherwise, judge whether the data type of element s and element t is complicated type, if then the data type similar value of element s and element t is 1: otherwise the data type similar value of element s and element t is 0.

5. a kind of expandable mark language mode matching process according to claim 1 is characterized in that step 4d) described in constraint base similar value computing method refer to,

The 1st step, according to the different values of the constraint base of element s and element t, judge whether the constraint base of element s and element t is basic constraint base value, if, then search the similar table of constraint base, draw the constraint base similar value of element s and element t; Otherwise, carried out for the 2nd step to the 4th step, calculate the constraint base similar value of element s and element t:

In the 2nd step, use following formula to calculate the minimum cardinality constraint similar value of element s and element t:

u = 1 - \frac{| x - y |}{x + y}

Wherein, u represents the minimum cardinality constraint similar value of element s and element t, and x represents the least commitment occurrence number of element s, and y represents the least commitment occurrence number of element t;

In the 3rd step, use following formula to calculate the maximum constraint base similar value of element s and element t:

v = 1 - \frac{| m - n |}{m + n}

Wherein, v represents the maximum constraint base similar value of element s and element t, and m represents the maximum constrained occurrence number of element s, and n represents the maximum constrained occurrence number of element t;

The 4th step, calculate the minimum cardinality constraint similar value of element s and element t and the mean value of maximum constraint base similar value, this mean value is the constraint base similar value of element s and element t.

6. a kind of expandable mark language mode matching process according to claim 1 is characterized in that step 5c) described in the performing step of child's similar value computing method as follows:

The 1st step, utilize to strengthen the set membership characteristic that the Pu Lvfu sequence comprises, corresponding reinforcements of searching element i Pu Lvfu sequence is found out all children of element i, and the child of component i collects; The corresponding reinforcement of searching element j Pu Lvfu sequence is found out all children of element j, child's collection of component j;

The 2nd step, concentrated from the child of element i, appoint and get an element p, the child who obtains element p and element j concentrates the global similarity value of all elements;

In the 3rd step, find out the maximal value in the resulting similar value in the 2nd step, with the similar value of this maximal value as child's collection of element p and element j;

The 4th step, judge the child of element i concentrates whether also have element, if having, then forwarded for the 2nd step to; Otherwise, think to have obtained the global similarity value that element child i concentrates the child of all elements and element j to collect, carried out for the 5th step;

The 5th step, concentrate all elements and all similar value of child's collection of element j to add up the child of element i, obtain one add up with;

The 6th step, the 5th step was resultingly added up and collects the maximal value of contained number of elements divided by two children, the gained merchant is child's similar value of element i and element j.

7. a kind of expandable mark language mode matching process according to claim 1 is characterized in that step 5d) described in the performing step of leaf similar value computing method as follows:

The 1st step, utilize to strengthen set membership and brotherhood characteristic that the Pu Lvfu sequence comprises, the corresponding reinforcement of searching element i Pu Lvfu sequence is found out all leaves of element i, the leaf collection of component i; The corresponding reinforcement of searching element j Pu Lvfu sequence is found out all leaves of element j, the leaf collection of component j;

In the 2nd step, with the difference of the concentrated back sequence number of each leaf node in scheme-tree of leaf of back sequence number and the element of element in the scheme-tree component as the digital vectors of element, make up the digital vectors of element i and element j respectively;

The 3rd step, use the cosine law, calculate the similar value of the digital vectors of element i and element j, this similar value is the leaf similar value of element i and element j.

8. a kind of expandable mark language mode matching process according to claim 1 is characterized in that step 5e) described in the performing step of fraternal similar value computing method as follows:

The 1st step, utilize to strengthen the brotherhood characteristic that comprises in the Pu Lvfu sequence, corresponding reinforcements of searching element i Pu Lvfu sequence is found out all brothers of element i, and the brother of component i collects; The corresponding reinforcement of searching element j Pu Lvfu sequence is found out all brothers of element j, brother's collection of component j;

The 2nd step, concentrated from the brother of element i, appoint and get an element q, the brother who obtains element q and element j concentrates the language similar value of all elements;

In the 3rd step, find out the maximal value in the gained similar value in the 2nd step, with the similar value of this maximal value as brother's collection of element q and element j;

The 4th step, judge the brother of element i concentrates whether also have element, if having, then forwarded for the 2nd step to; Otherwise, think that the brother who has obtained element i concentrates the language similar value of all elements and element j, carries out for the 5th step;

The 5th step, concentrate all elements and all similar value of brother's collection of element j to add up the brother of element i, obtain one add up with;

The 6th step is with resulting add up and divided by the maximal values of two contained number of elements of brother collection, the gained merchant is the fraternal similar value of element i and element j of the 5th step.

9. a kind of expandable mark language mode matching process according to claim 1 is characterized in that step 5f) described in the performing step of ancestors' similar value computing method as follows:

The 1st step, utilize the set membership characteristic that comprises in the reinforcement Pu Lvfu sequence, the corresponding reinforcement of searching element i Pu Lvfu sequence is found out all ancestors of element i, and according to the sequencing of search all ancestors of element i is coupled together the ancestors path that constitutes element i; The corresponding reinforcement of searching element j Pu Lvfu sequence is found out all ancestors of element j, and according to the sequencing of search all ancestors of element j is coupled together the ancestors path that constitutes element j;

The 2nd step, regard the ancestors path as a character string sequence, each nodename in the path is regarded an integral body as, the language similar value of utilizing the language matching process to obtain, the editing distance between the ancestors path of calculating element i and the ancestors path of element j;

In the 3rd step, the editing distance between the ancestors path of the ancestors path of the element i that obtains in the 2nd step and element j divided by the maximal value in ancestors' path of ancestors' path (contained ancestor node number) of element i and element j, is obtained a merchant;

In the 4th step, unit 1 deducts resulting merchant of the 3rd step, is ancestors' similar value of element i and element j.

10. a kind of expandable mark language mode matching process according to claim 1 is characterized in that step 6c) described in the performing step of non-complex element structure coupling as follows:

In the 1st step, get an element d from concentrated of the atom of element t;

The 2nd step, adopt the described fraternal similar value computing method of claim 8, obtain the fraternal similar value of element c and element d;

The 3rd step, adopt the described ancestors' similar value of claim 9 computing method, obtain ancestors' similar value of element c and element d;

The 4th step is with the weighted mean value of fraternal similar value and the ancestors' similar value of element c and element d, as the structural similarity value of element c and element d;

The 5th step, with structural similarity value and the addition of language similar value of element c and element d, resulting and as the global similarity value of element c and element d.