CN107844482A - Multi-data source method for mode matching based on global body - Google Patents

Multi-data source method for mode matching based on global body Download PDF

Info

Publication number
CN107844482A
CN107844482A CN201610826714.7A CN201610826714A CN107844482A CN 107844482 A CN107844482 A CN 107844482A CN 201610826714 A CN201610826714 A CN 201610826714A CN 107844482 A CN107844482 A CN 107844482A
Authority
CN
China
Prior art keywords
similarity
relation
pattern
data source
matching
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610826714.7A
Other languages
Chinese (zh)
Inventor
杨卫东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fudan University
Original Assignee
Fudan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fudan University filed Critical Fudan University
Priority to CN201610826714.7A priority Critical patent/CN107844482A/en
Publication of CN107844482A publication Critical patent/CN107844482A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/80Information retrieval; Database structures therefor; File system structures therefor of semi-structured data, e.g. markup language structured data such as SGML, XML or HTML
    • G06F16/83Querying
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/288Entity relationship models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/80Information retrieval; Database structures therefor; File system structures therefor of semi-structured data, e.g. markup language structured data such as SGML, XML or HTML
    • G06F16/83Querying
    • G06F16/835Query processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/253Grammatical analysis; Style critique
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Animal Behavior & Ethology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention belongs to data source schema to match field, is related to a kind of multi-data source method for mode matching based on global body, including, by multiple patten transformations to be matched into unified data model-pattern body;The related algorithm matched according to various modes, pattern body after each conversion is subjected to pattern match with global body respectively, combine the result of calculation of multiple matching algorithms, respectively obtain the similarity relation between the element in each pattern body and global body, use similarity moment matrix representation;The aggregation strategy of similarity relation is finally used, according to the transitivity of similarity relation, similarity relation obtained above is polymerize, obtains the matching result of multiple data source schemas between any two.The present invention can solve the problems, such as to need to carry out pattern match two-by-two between multiple data source schemas between enterprises or enterprise, can significantly improve the quality and efficiency of multi-data source pattern match, and have preferable autgmentability.

Description

Multi-data source method for mode matching based on global body
Technical field
The invention belongs to data source schema matching technique field, and in particular to a kind of multi-data source mould based on global body Formula matching process.This method is applied to solve to need carry out two between multiple data source schemas between enterprises or enterprise The problem of two-mode matches.
Background technology
With developing rapidly for Internet information technique, each enterprise establishes respective data management system, involved Data are in a manner of representing, different storage forms is stored not in same storage system, and the pattern of these data sources is in grammer With the exchange that data message there is very big otherness, is hindered in structure and shared.With each autonomous field or enterprise Cooperation between industry is more and more closer, the fusion of information and shared also more and more diversified, and the contact of each several part is also more in enterprise Closely, in order to improve the utilization rate of data resource, available data resource can be made full use of, between data resource in the urgent need to address Heterogeneity.The fast development of network and information technology, it is in explosive growth to promote data volume, and data represent and storage form Also it is varied, such as relational database, XML, body.This isomerism and distributivity of data, and its concept or attribute Contrary opinion, undoubtedly add cooperate with each other between each operation system of enterprise, information exchanges and the difficulty of the application such as shared. In practical application, generally require by these isomeries, distribution data and integrated, integrated according to certain demand, to carry out The shared of data, exchange and analysis etc., common issues for these applications are exactly:How in two heterogeneous schemas member is found Between in corresponding relation semantically, i.e. pattern match.In pattern match at this stage important work is played in many application fields With such as data integration, data exchange, ecommerce, data warehouse, semantic web, pattern match are these application field researchs With a basic task of exploitation.Therefore, pattern matching problem is furtherd investigate, there is important realistic meaning.
In the prior art, it is primarily used for Mode integrating, set of patterns in the initial stage of pattern matching problem research, pattern match It is that a global schema is constructed on the set of some single data source schemas into being mainly;Because different patterns is independent Exploitation, different developers has respective understanding and idea, and each pattern has different structures and term, for difference Field between pattern, this otherness just becomes apparent, and the work of all Mode integratings, the first step is exactly in each source mould The problem of foundation is corresponding between formula and global schema contacts, and this is exactly pattern match.
To the nineties in last century, with the big heat of Mode integrating problem, there is the concept of data warehouse.One data bins Storehouse is a decision-making type database positioned at volume of data source upper strata, particularly with many enterprises, with increasing for data source With the aggravation of data volume, data warehouse for enterprise carry out easily and fast, accurate Analysis of Policy Making provide possibility.From each The process of data warehouse is extracted in data source just to be needed to do patten transformation between each data source and data warehouse, i.e., between element Matching operation, there is research to establish initially reflecting between each data source schema and Data warehouse schema using pattern matching operation Relation is penetrated, and then the extraction and conversion of data are completed according to mapping relations and specific element semantic.
In recent decades, the appearance of ecommerce promotes the development of pattern match.In ecommerce, both parties make Message format be often it is different, this otherness of message format be embodied in form grammatically with structure, and Send each different interactive system of message.In order to which both parties smoothly interact, it is necessary to carry out lattice to message format Formula is changed, including the constraint of the masurium of massage pattern, data type, value and structural difference etc., this different message lattice Between formula mutually conversion the problem of, namely pattern match to a certain extent.
Practice display, different patterns has the independence and otherness of itself, even the different pieces of information of same enterprise Pattern is especially become apparent in different fields there is also difference, and they generally have different term, structurally and semantically.Close Existing some in the research of pattern match, but without the pattern match for accomplishing automation, matching process still needs largely Artificial to participate in, this needs to consume huge manpower and time and easily malfunctioned, and the research of current pattern match is not yet It can guarantee that and find out corresponding relation all between source module and target pattern, and the corresponding relation that cannot be guaranteed to find out is just True, target of the matching process automation with result accuracy rate into most of research is improved as far as possible.
The research of pattern match starts from twentieth century eighties, and pattern match research is largely focused on data set at first Into field, with going deep into for Mode integrating Study on Problems, data warehouse field also begins to use pattern matching.Ecommerce and The fast development of network technology promotes the research of pattern match to obtain increasing concern.
The research of pattern match develops more slowly at home, focuses primarily upon foreign countries, and largely studies and be only applicable It is poor in a certain specific area, cross-cutting applicability.Document[8][9]Give several general method for mode matching.With pass It is the extensive use of database, XML, body etc., the mutual pattern match between them turns into a focus of research.
The pattern match research of early stage is matched primarily directed to database schema, and the matching process of use mainly wraps Include element term matching and codomain, the comparison etc. of data type.Since early 1990s, the think of of automatic mode matching Think that system and technical method are begun setting up, be that the research of other field problem has driven the development of pattern match mostly, Some algorithms are only developed, lack the matching system of maturation.
To late 1990s, the appearance of the technology such as machine learning, ontology inference, graph-theory techniques, pattern is promoted Fast development with research, especially some representative prototype systems such as LSD, Cupid and Clio etc. so that pattern match Gradually people are obtained more to pay close attention to.Pattern match is no longer only applied to data integration, some more ripe also occurs Match system, the field that can be applied to also showed increased.
At the beginning of 21 century, pattern match research initially enters the phase of improving, and the original of a variety of matching process of many synthesis occurs Type system, make to have obtained large increase in matching precision and automation, more extensive, such as XML for promoting pattern match to apply Document is changed, the research field such as XML-schema cluster.
The prototype system of many pattern match occurred at present, the matching process and technology that they are used are not quite similar, tool There is respective superior part, but there is also some deficiencies simultaneously.Applicable territory has certain limitation, and Need perfect pattern information and data message with process, at the same matching process still need it is substantial amounts of it is artificial participate in, it is time-consuming, take Power, error-prone, the research of current pattern match is closed it cannot be guaranteed that finding out mapping all between source module and target pattern System, it is all correct for cannot guarantee that the mapping relations found out, it is therefore desirable to which it is high to find out automaticity, can extensive use Method for mode matching.
Has there is a series of method for mode matching and matching system, example as a study hotspot in pattern match Such as, the SemInt based on machine learning techniques]With LSD methods, Element-Level and structural level based on pattern carry out matching primitives Cupid methods, using oriented mark graph structure and the SF of structural matching algorithm]Method, carried out using polytype adaptation The COMA/COMA++ methods of consolidation strategy, and can effectively find the iMA of 1 pair of 1 matching and complex match]Method, etc.. These system and method are in terms of different, using the match information between different information excavating patterns, but matching process simultaneously It is time-consuming, laborious there is still a need for substantial amounts of artificial participate in, especially when there is multiple data source schemas to be required for being mutually matched two-by-two When, it is necessary to workload it is bigger, it is more time-consuming, poorly efficient.
At present, also there is following problem in most of pattern match, in actual applications, it is impossible to it is complete to obtain data Pattern information, the expressed information of exit pattern can not be extracted in matching process completely, and the pattern of data source can not be complete The complete real semanteme for giving expression to respective data sources, so if only by the member considered in data pattern in matching process The similarity relation of title is inaccurate come the mapping relations represented between member, usually it is also conceivable to the structure letter of pattern Breath, and the information of the data instance represented by pattern, matching relationship should extract from many aspects as far as possible with Checking;System at present simultaneously using pattern information and example information is also few, and in some applications, human assistance and field are known Know to help to improve the quality of pattern match most important;Some method for mode matching and specific matching system can't all be protected Card finds out all correctly pattern corresponding relations, more it cannot be guaranteed that the corresponding relation found out is correct;Due to the knot of pattern Structure and information are all more complicated, and most of pattern match all carries subjectivity, and they can only reduce use to a certain extent The workload at family, still needed for obtained matching result and want the further checking of user.
Present situation based on prior art, present inventor intend providing a kind of multi-data source pattern based on global body Matching process.This method is applied to solve to need to carry out two-by-two between multiple data source schemas between enterprises or enterprise The problem of pattern match.
Prior art related to the present invention has:
[1]Fausto G,Pavel S.Semantic Matching[J].In the Knowledge Review journal,2004,18(3):265-280.
[2]Bernstein P A,Madhavan J,Rahm E.Generic schema matching,ten years later[C].Proceedings of the VLDB Endowment,2011,4(11):695-701.
[3]Batini C,Lenzerini M,Navathe SB.A comparative analysis of methodologies for database schema integration[C].ACM Comput Surv 1986,18(4): 323–364.
[4]Sheth AP,Larson JA.Federated database systems for managing distributed,heterogeneous,and autonomous databases[C].ACM Comput Surv 1990,22 (3):183-236.
[5]Parent C,Spaccapietra S.Issues and approaches of database integration.CACM 1998,41(5):166-178.
[6]Bernstein PA,Rohm E.Data warehouse scenarios for model management [C].In:Proc19th Int Conf On Entity-Relationship Modeling,Lecture Notes in Computer Science,vol.1920.Springer,Berlin Heidelberg New York,2000,1-15.
[7]Milo T,ZohaL S.Using Schema Matching to Simplify Heterogeneous Data Translation[C].VLDB Conference,1998,l22-133.
[8]Li Y,Liu D,zhang W.A Generic Algorithm for Heterogeneous Schema Matching[J].Interational Journal of information Technology.2003,9(1):10-15.
[9]Mitra P,Wiederhold G,Kersten M.A Graph-Oriented Model for Articulation of Ontology interdependencies[J].Lecture Notes in Computer Science.2000,1777:86-100.
[10]Su H,Harumi K,Elke A.Rundensteiner.Automating the Transformation of XML Documents[A].1n:Proc.3rd In Workshop on Web Information and Data Management(WIDM).2001,68-75.
[11]Li W S,Clifton C.SEMINT:a tool for identifying attribute correspondences in heterogeneous databases using neural networks[C].Data and Knowledge Engineering,2000,33(1):49-84.
[12]Doan A,Domingos P,Halevy A Y.Reconciling schemas of disparate data sources:a machine-learning approach[C].In:Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data,2002,509-520.
[13]Madhavan J,Bernstein P A,Rahm E.Generic schema matching with cupid[C].In:Proceedings of the 27th International Conference on Very Large Data Bases.San Francisco,USA:Morgan Kaufman Publishers,2001,49-58.
[14]Melnik S,Garcia-Molina H,Rahm E.Similarity flooding:a versatile graph matching algorithm and its application to schema matching[C].In: Proceedings of the 18th International Conference on Data Engineering.San Jose,California:IEEE,2002,117-128.
[15]Do H H,Rahm E.COMA—A system for flexible combination of schema matching approaches[C].In:Proceedings of the 28th International Conference on Very large Data Bases.Hong Kong,China:VLDB,2002,610-621.
[16]Aumiuller D,Do H H,Massmann S,et al.Schema and ontology matching with COMA++[C].SIGMOD Conference,2005,906-908.
[17]Dhamankar R,Lee Y,Doan AH.iMAP:Discovering Complex Semantic Matches between Database Schemas[C].SIGMOD Conference,2004:383-394.
[18]J.Gennari,M.A.Musen,R.W.Fergerson,et al.The Evolution of Prot ég é:An Environment for Knowledge-Based Systems Development,Stanford University, 2002.
[19]Gruber T R.Toward principles for the design of ontologies used for knowledge sharing[J].International Journal of Human-Computer Studies, 1995.
[20]Mike U,Michael G.Ontologies:Principles,methods and applications [J].Knowledge Engineering Review.1996,11(2):93-155.
[21]Cruz F I,Xiao H,Hsu F.An Ontology-based Framework for Semantic Interoperability between XML Sources[C].Eighth International Database Engineering&Applications Symposium.2004,217-226.
[22] Jeong B, Lee D, Cho H, et al.A novel method for measuring semantic Similarity for XML Schema matching [J] .Expert System with Applications.2008,34 (3):1651-1658.
[23]Wu Z,Palmer M,Verb Semantics and Lexical Selection[C].Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics, Las Cruces,New Mexico.1994:133-138.
[24]Lee JS,Lee KH.Computing Simple and Complex Matching's Between XML Schema for Transforming XML Documents[J].Information and Software Technology.2006,48:937-946.
[25]Euzenat J,LeBaeh T,Barra J et al.State of the Art on ontology Alignment[C].Knowledge web project deliverable D,2.。
The content of the invention
It is an object of the invention to the present situation for prior art, there is provided a kind of multi-data source pattern based on global body Matching process.This method is applied to solve to need to carry out two-by-two between multiple data source schemas between enterprises or enterprise The problem of pattern match.
The present invention based on inquire about it is more and more closer from the contact between each operation system of enterprises, data fusion and It is shared also more and more diversified, generally require to be mutually matched two-by-two between multiple isomeric data source modules in each operation system, But workload is very big, time-consuming, and not easy care the problems such as, the keyword search methodology of XML data stream is realized using DTD;This Invention can not only handle pattern matching problem between different types of multiple heterogeneous schemas, hence it is evident that improve multi-data source pattern match Quality and efficiency, and have preferable autgmentability and ease for maintenance, to match when increasing new data source schema or have mould When formula changes, it is only necessary to which with global Ontology Matching once, by experimental verification, this method is substantially subtracting the present invention While few workload, saving are time-consuming, also there is very high quality of match.
Multi-data source method for mode matching proposed by the present invention based on global body, its Organization Chart is as shown in figure 1, specific Step includes:
(1) when global body or domain body are not present in the data source domain of Model Matching to be carried out, this can be used Body the build tool, such as the prot é g é of Stanford University, build corresponding domain body;
(2) multiple data source heterogeneous schemas to be matched are converted to pattern body by use pattern converter, if related The pattern of data source is not present, it is necessary first to extracts the pattern of data source;
(3) element term in pattern body and global body is handled using name authority processor;
(4) pattern body and global body are subjected to pattern match using each pattern matcher in adaptation storehouse, it is each Individual adaptation all exports a similarity matrix, and multiple adaptations will generate a similarity matrix, combines plan using adaptation Slightly different adaptation gives different weights, obtains the similarity matrix between each pattern body and global body;
(5) similarity matrix polymerizer is used, the similarity matrix that (4) step is waited until carries out polymerization two-by-two and calculated, i.e., Similarity matrix between can obtain between any two data source schema;
(6) background knowledge and domain knowledge are combined, to matching result carry out hand inspection, optimization, checking, and then according to The weights of Different matching device in the feedback adjustment adaptation federation policies of family.
1, in of the invention, different fields can be applied to based on body, and application process is also different, generally The structure of body needs to observe certain principle, in order to which ontology knowledge is shared and interoperated between different applications; Five principles proposed using Gurber in nineteen ninety-five:
(1) definition, objectivity and formalization:Body should be able to effectively expressing define term between potentially contain Justice, although being probably to remove the body of definition due to society or technical need, the definition of body should independently of society or Background is calculated, that is, is formalized.When body can be defined using a logical axiom, definition should be made complete as far as possible , that is, sufficient and necessary condition to be met is asserted, after the definition completion of body, to use natural language to carry out documenting;
(2) completeness:The definition of body should turn into complete, you can with all terms or general completely defined in expression The implication of thought;
(3) uniformity:One body should meet uniformity, i.e., by body make inferences obtained conclusion should and body The definition of itself is consistent, and the axiom at least defining logically will be also consistent, and the formal definitions of body also will Meet uniformity, such as those use the document of natural language description, if the conclusion that is inferred to by the axiom in body or Example mutually conflicts with definition, then the body is not just consistent;
(4) largest monotonic scalability:Ontological construction complete after, but it is also incomplete when, it is necessary to add one into body During a little other terms, to keep not changing the content of defined completion as far as possible;
(5) it is minimum to promise to undertake:Minimum promise is should be ensured that when building body, i.e., only need to provide restriction relation as few as possible, Meet specifically shared demand, in order to which other sharers can be instantiated and specialized.
2, in of the invention, data source schema is converted into pattern body,
The pattern match between heterogeneous schemas is carried out, is first had to as one unified mathematical modeling of mode construction, the present invention The problem of for being mutually matched between multiple data source schemas, using the method based on global body, so each source module Unified mathematic(al) mode using this form of body, the first step of this method be for the corresponding pattern body of each mode construction,
Different types of patten transformation is that pattern body needs different construction methods, below for XML Schema, is closed Be that how forming types body is described in detail pattern respectively, this method it is same be also applied for other kinds of pattern, OWL It is the standard of ontology description language in the web that W3C recommends, from it as ontology description language in the present invention, and with Prot é g é To create and safeguard body;
The pattern body of 2.1 structure relation schemas
The construction of relational model is often based on ER figures, and different relational models usually contains multiple relation tables, different Relation table has different function and feature in ER figures, such as has something to do table is used for describing entity, and has something to do table is used The contact come between presentation-entity;Structure key from relation schema to pattern body is in the letter contained in relation schema is analyzed Breath, therefore the pattern body of relation schema can be built according to following rule;
The formalized description of relation schema is defined as below shown in 1 the present invention:
Two tuples can be used by defining 1. 1 relation schema S<RS, ∑S>To represent, wherein RSWrapped in intermediate scheme S The related set of institute contained, ∑SRepresent the set constrained in S between each relation and relation, such as entity integrity constraint, ginseng According to integrity constraint etc., for any one relation R ∈ RSIt is abbreviated as R (A1:T1, A2:T2..., An:Tn), wherein R is relation name (i.e. table name), Ai(1≤i≤n) is relation R attribute, TiThe corresponding data type of the attribute;
The process of corresponding pattern body is converted to by relation schema to represent as follows with a function f:f:(<RS, ∑S>)→OS, wherein<RS, ∑S>Represent relation schema S, O to be convertedSRepresent the pattern body after conversion;
Transformation rule between relation schema S and pattern body OS is as shown in table 1, and the relation R in relation schema S is defined as Class (OWL in body OS:Class);R non-external key attribute Ai is defined as the OWL in body OS: DatatypeProperty, the OWL:DatatypeProperty codomain is defined according to Ai data type Ti;R's is outer Key attribute Aj is defined as the OWL in body OS:ObjectProperty, and by defining OWL:ObjectProperty's rdfs:Domain and rdfs:Range represents the reference relation between R and other relation tables associated by it;
If relation table R only includes primary key attribute, and the major key is only made up of two external keys, and such R is used only to retouch The relation of multi-to-multi between other two relation tables is stated, then need not define a class for R in OS, only need to be two external key category Property in body OS create two OWL:ObjectProperty, use OWL:InverserOf defines the two OWL: ObjectProperty is reciprocal;
More specifically, in the present invention, body is represented using a five-tuple:O=(C, I, R, F, A) [25], wherein C are represented Concept in body, I represent example corresponding with concept, and R represents the set of relationship between concept, and F represents to act on generally Collection of functions in thought, A represent the axiom collection on body;Wherein,
(1) concept (Concept)
Concept in body is also known as class (Class), refers to anything, such as animal, behavior, function, a class is one Individual taxonomical hierarchy;
(2) relation (Relation)
Refer to the interaction between those concepts in body, n concept in body, it is possible to the Descartes that a n is tieed up Long-pending subset, relation is all there may be between any two concept;Described relation includes inheritance, inclusion relation, synonymous Relation;
(3) function (Function)
A kind of special relation is represented, some element of the relation can be uniquely determined by other n-1 element, and formalization is fixed Justice is F:C1 × C2 × ... × Cn-1 → Cn, if childOf is a function, childOf (ci, cj) represents that cj is ci child Son;
(4) axiom (Axiom)
Expression really asserts that a such as concept is in the range of another concept forever;
(5) example (Instance)
Refer to the specific entity or instance objects of some concept in body.
2.2 structure XML Schema pattern body
XML Schema are the XML document of a well-formed in itself, for defining the legacy structure of a kind of XML document, The file description rules such as text structure, data type and other restriction relations including XML document;Basic group of XML Schema Into being element (Element) and attribute (Attribute), attribute is only simple data type, and member is known as two types:Simply Type (SimpleType) and complicated type (ComplexType), wherein simple types only include content, and complicated type can wrap Containing other attributes or element;Transformation rule between XML Schema and its pattern body is as shown in table 2, answering in XML Schema Miscellany type is defined as the class (OWL in body:Class), the simple types in XML Schema and attribute are defined as body In OWL:DatatypeProperty, the OWL:DatatypeProperty codomain according to its corresponding simple types or The data type of attribute is defined.
3. in the present invention, name authorityization is handled,
The exploitation of different mode has independence, and developer often names according to the custom of oneself, this often with field The specification literary style of element term have differences, element term has usually contained the common word such as spcial character, preposition, list in practice Word plural number and word abbreviation etc., so first being standardized before pattern match is carried out using natural language technology to element term Processing;Standardization processing is carried out using following 3 steps in the present invention:
(1) compound phrase is split, is accorded with element term character string according to connector, space, punctuate using symbolic analysis device Number, capital and small letter, numeral etc. compound word assembling and dismantling are divided into independent set of letters, such as First_Name → (First, Name);
(2) vocabulary reduces, and will capitalize, plural number, the word of form such as abbreviation are reduced into its citation form, as id → Identifier, NAME → name;
(3) vocabulary such as the preposition in deletion word set, conjunction, such as (longitude, in, city) → (longitude, city);
Element term is by obtaining a word set after standardization processing, and the similarity of title translates into two between element Title similarity between individual word set;Significance level of the different words in whole pattern is different in the word set being typically different, TF/IDF (Term frequency-inverse document frequency) algorithm pair of information retrieval field can be used The weight of each word is measured, and according to the difference of word importance, the similarity between each word pair assigns different Weights, the similarity between two word sets is obtained by weighted sum.
4. in the present invention, enter the Similarity Measure of row mode body and global body,
Pattern match is carried out between the bodies, the semantic association established between body, realizes the mapping of element between the two Relation, key are to calculate the similarity between different ontology elements;The pattern match side proposed according to Rahm and Bernstein The disaggregated model of method, the present invention are mainly mainly examined using Element-Level and the method for mode matching of structural level, Element-Level matching process Consider grammer similarity, semantic similarity and the element data type similarity of element term, structural level matching process is mainly examined The context relation between element and other coherent elements is considered, Element-Level adaptation is mostly simple match device, structural level match party Method needs to use the matching result of Element-Level, can further improve matching accuracy, is to mix adaptation;Corresponding It is as shown in table 3 below with pseudo-code of the algorithm;
4.1 element term similarities
During computation schema element similarity, element term have the function that in input pattern information it is important, Its similarity occupies bigger weight, because designer understands variant, schema elements name nominating to domain-specific knowledge Flexibility, the name of same concept is not fully consistent, two kinds of similarities of main calculating elements title, i.e. grammer similarity And semantic similarity, by the grammer similarity of string matching technology calculating title, such as editing distance, N-gram, prefix With the technology such as suffix, calculating elements title similarity, takes maximum therein as grammer similarity respectively;Semantic similarity For the semantic similarity of the calculating elements title under body (such as WordNet and domain body) auxiliary, calculated by the two The comprehensive similarity of element term, usual way have the higher value for taking the two or assign the two different power according to experimental analysis Value weighting is tried to achieve;
4.1.2 element term grammer similarity
During computation schema element similarity, element term have the function that in input pattern information it is important, Its similarity occupies bigger weight, two kinds of similarities of main calculating elements title:Grammer similarity and semantic similarity, Matching technique of the grammer similarity generally use based on character string of element term is calculated;Semantic similarity is generally in body Or calculated under the auxiliary of dictionary (such as WordNet), by the two come the comprehensive similarity of calculating elements title, two can be taken The higher value of person is tried to achieve according to the different weights weighting of both experimental analysis impartings;
4.1.3 grammer similarity
Grammer similarity algorithm based on element term character string have it is some, as editing distance, N-gram, prefix and It suffix etc., can respectively be calculated using these algorithms, take grammer similarity of the maximum therein as two character strings, this The similarity algorithm based on editing distance is used in invention, editing distance algorithm calculates using two name character strings as input It is standardized after both editing distances, for scope between [0,1], calculation formula is as follows:
Wherein str1 and str2 is two character strings, editDistance (str1,str2) represent two character strings editor Distance, | str1| and | str2| represent str1And str2Length, max (| str1|,|str2|) represent to take the maximum of two length Value,
For example, two couples of entitled telephone and phone, number and streetNum name character string, according to formula (1) the grammer similarity for calculating them is respectively:
Simgram(telephone, phone)=0.56;
Simgram(number, streetNum)=0.11;
4.2 element term semantic similarities
Grammer similarity algorithm can calculate such as id and identifier, telephone and phone this kind of character The similarity of string, but for such as location and address, title and headline a kind of character string and do not apply to, need To wait by the help of a dictionary and calculate its semantic similarity;
WordNet come all vocabulary of tissue, mainly has synonymous pass using one or more tree-like hierarchy structures between vocabulary System, part-of relationship and hyponymy etc., can be according to nearest public ancestor node, the place depth of two vocabulary And concept path length etc., to calculate the semantic similarity of two words, calculation formula is as follows:
Wherein p refers to w1 and w2 last common ancestor node, and depth (w) represents depth of the word w in WordNet, i.e., Path length between root node to w;
If Synset (w1)=w1i | i=1,2 ..., m }, Synset (w2)=w2j | j=1,2 ..., n } it is respectively w1 With TongYiCi CiLins of the w2 in WordNet, then the semantic similarity between w1 and w2 be defined as:
Simsema(w1, w2)=max1≤i≤m,1≤j≤n Sim(w1i,w2j) (3)
Wherein w1i ∈ Synset (w1), w2j∈Synset(w2), Sim (w1i,w2j) it is that gained is calculated according to formula (2);
After the grammer similarity and the semantic similarity that calculate element term, element term similarity Simname is equal to SimgramAnd SimsemaBetween maximum;
Entitled telephone and phone, number and streetNum are calculated respectively for example, being calculated according to formula (3) Semantic similarity is respectively:Simsema (telephone, phone)=1.0;
SimseMa (number, streetNum)=0.66;
Consider the syntax and semantics similarity of name character string, take its maximum, their title similarity can be obtained Respectively:
Simname(telephone, phone)=1.0;
Simname(number, streetNum)=0.66;
Above-mentioned formula (2) and (3) are used for calculating the semantic similarity of two words, and an element term is by standardization After processing, what is frequently resulted in is a set of letters, it is therefore desirable to obtains the semantic similarity between two word sets, first obtains Semantic similarity in two word sets between each pair word, directly this multiple Similarity-Weighted can be averaged, can also Different weights are assigned according to the significance level of each word pair and are weighted summation, thus can obtain the phase between two word sets It is as shown in table 4 below like angle value, specific algorithm:
After the grammer similarity and the semantic similarity that calculate element term, title similarity Simname of element etc. Maximum between Simgram and Simsema;
4.3 data type similarities
Element includes title and data type, and the difference of data type or the difference of linear module will also result in can not Matching completely, therefore only consider that title similarity is not comprehensive enough;Attribute in the body can be divided into object properties and data Type attribute, object properties describe the association between two attributes of equity, and data type attribute definition attribute pass The value of data type is linked to, in order to further improve the accuracy of matching, it is also necessary in view of the data type of attribute;By reference The method of document, according to the loss situation of information after conversion between data type:It is equal conversion, it is lossless conversion, damage conversion and It can not change, the static similarity table changed mutually between a data type, as shown in table 5, displaying are predefined in the present invention The similarity matrix of part data type,
The similarity Sim of two data typestypeValue as in matrix, as shown in formula (4), wherein type1With type2It is attribute e1 and e2 data type respectively,
Simtype(type1,type2)=Matrix [type1][type2] (4)
The Element-Level similarity BasicSim of two elements can be defined as in body:
BasicSim(e1, e2)=α * Simname (e1, e2)+(1- α) * Simtype(e1,e2) (5)
Wherein α is weights, typically takes 0.7, can also be adjusted according to experimental result,
Consider that data type similarity can not find more Match of elemental composition relations, but it is accurate to improve matching Degree, after finding some candidate matches relations according to element term similarity, according to two to match in candidate matches relation Similarity between element finds what whether two elements really matched, if data type is consistent, then between element Similarity will be improved, if damaging conversion, Similarity value between element will reduce, therefore, in the present invention in order to Element data type similarity need to be considered by improving the accurate rate of matching result;
4.4 structural similarities
In order to improve the accuracy of element similarity, it is also necessary to consider according to the contextual information of node element come Computing Meta The structural similarity of element pair, the structural information of element are mainly reflected in its ancestor node, child node and leaf node;
In the present invention, concept, attribute and relation in body constitute a graph structure, therefore can utilize based on figure knot The algorithm of structure, it is contemplated that the structural information in body between concept, mainly counted using the relation in body between parent and subclass The structural similarity between concept pair is calculated, further to improve the accuracy of the similarity got of above-mentioned steps calculating, this The technical scheme of invention is based on the idea, for a concept pair, if father's concept set of the two concepts is Similar, then the two concepts are likely to be similar, similarly, if the sub- concept set of two concepts is similar , then the two concepts are also particularly likely that similar;For body graph structure, in of the invention, mainly examined in terms of three Consider the structural similarity of two concepts:Ancestor node, child node and leaf node, corresponding Similarity value are respectively Simancetor、SimchildAnd Simleaf
For any two concept c1 and c2, their structural similarity is:
Simstr(c1, c2)=α Simancetor(c1,c2)+βSimchild(c1,c2)+γSimleaf(c1,c2)
Wherein 0≤α, beta, gamma≤1, and alpha+beta+γ=1, these three weights can take difference with mean allocation or as needed Value;
The similarity Sim (e1, e2) of any two nodes is equal to the substantially similar of two elements between two bodies to be matched The weighted sum of degree and structural similarity, weights β can take and 0.5 can also be adjusted according to experimental verification,
Sim (e1, e2)=β * Simname(e1,e2)+(1-β)*Simstr(e1, e2), 0≤β≤1 (6);
4.4.1 ancestor node similarity
Ancestors' message reflection of node is in the ancestor node of node, the ancestors of two nodes that can be more to be matched The similitude of node set improves matching accuracy,
(1) two concepts c1 and c2 all ancestor nodes, respectively Ancetors (c1) and Ancetors are obtained first (c2);
(2) the basic similarity between the two node set any two nodes pair is calculated, obtains two node sets Similarity matrix;
(3) a threshold value th is setaccept, the then maximum node pair of selective value from similarity matrix, then from similar The node is removed to the row and column at place in degree matrix;
(4) all nodes pair more than threshold value are selected by the iteration of step (3) from matrix;
(5) all values selected are added, be standardized according to the node logarithm selected, obtained value is c1 and c2 Ancestor node similarity Simancetor
4.4.2 child node similarity
Child node similarity mainly reflects the similarity of adjacent context, by the direct child node for calculating two nodes The similarity for gathering interior joint is completed, including the attribute node of concept and its child class node in body;Node c1 direct son Set of node is Children (c1), and the direct child nodes of node e2 integrate as Children (c2), calculates in two sub- node sets and appoints The basic similarity Basic of what two nodeSim, obtain a similarity matrix, its computational methods and ancestor node similarity meter Calculation method is identical, and the obtained Similarity value value is node c1 and c2 child node similarity Simchild(c1,c2);
4.4.3 leaf node similarity
The information of concept is usually contained in the attribute node of its attribute node or its descendant nodes in the body, because This can compare the similarity of the leaf node of two nodes pair to improve matching accuracy, for two node c1 to be matched And c2, obtain their leaf segment point set Leaves (c1) and Leaves (c2) first, then respectively two set in any two Basic similarity between node pair, a similarity matrix is obtained, using with calculating the same scheme of ancestor node similarity, Calculate c1 and c2 leaf node similarity Simleaf(c1, c2), structural similarity algorithm are as shown in table 6;
4.5 similarity matrix federation policies
In order to improve the quality of pattern match, during pattern body is matched with global body, use is a variety of Pattern matcher, every kind of adaptation can all generate a similarity matrix Mi between the element of two bodies, each in matrix For value all in the range of [0,1], a variety of adaptations can produce a similarity cube jointly, and this multiple similarity matrix is carried out Weighted sum can obtain final similarity matrix between two bodies, computational methods M=a1M1+a2M2+…+akMk, wherein 0≤ a1, a2... ak≤ 1 and a1+a2+…+ak=1, each weights can be according to the use after the importance and many experiments of every kind of matching Feed back to be adjusted at family;When the similarity between two elements is more than some threshold value thacceptWhen, it is believed that have therebetween There are mapping relations, while there can be artificial participation during this, expertise is utilized by system manager or domain expert To candidate matches to being adjusted, check, correct obtained corresponding relation, two elements with mapping relations will be determined Similarity is directly set to 1, and the similarity for determining two elements without mapping relations is set into 0;
Above-mentioned steps describe the mapping relations between computation schema body and global body, each pattern body and complete The mapping relations of office's body use a similarity moment matrix representation;
5. in the present invention, the Similarity Measure that enters between row mode body further describes how to calculate two pattern bodies Between similarity matrix, that is, calculate two data source schemas between mapping relations;
Mapping relations after two patterns progress pattern match between any two element can be expressed as a triple Triple (e, e ')=(e, e ', Sim (e, e ')), wherein e and e ' represent the element in two patterns being matched respectively, Sim (e, e ') ∈ [0,1] represents the similarity between e and e ', and two such body carries out the mapping relations collection after pattern match An as triplet sets, a similarity moment matrix representation can be used;
Assuming that global body is O (o1,…,ok), two pattern bodies are respectively S (s1,…,sm) and T (t1,…,tm), mould Formula body S and T carry out pattern match with global body O respectively and respectively obtain corresponding similarity matrix Ms and Mt, for general Pattern match, the similarity relation between element has transitivity, if i.e. element a and b is similar, b and c be it is similar, that It is similar that a is very likely to c, it is assumed that Triple (si,ov) ∈ Ms, Triple (tj,ov) ∈ Mt, wherein si, tjRespectively Element in S, T, ovIt is the element in O, 1≤i≤m, 1≤j≤n, 1≤v≤k,
S may be calculated by this kind of modeiAnd tjBetween multiple Similarity values, take maximum therein as siAnd tj Between similarity, for si∈ S, tj∈ T, calculation formula are as follows:
Sim(si,tj)=max { Sim (si,ou)*Sim(si,ou)*Sim(si,ou)|ou,ov∈O} (7)
It is that can obtain the similarity matrix between Mode S and T by this mode, the similarity calculating method between pattern body False code is as shown in table 7.
Advantages of the present invention has:On the basis of using existing matching process, for enterprises multiple data sources pattern When needing mutually to carry out pattern match two-by-two, it is proposed that a kind of multi-data source method for mode matching based on global body.The party Method constructs unified mathematical modeling-pattern body for different types of heterogeneous schemas first, using rule-based match party Method, from element term, data type and it is structural three in terms of be respectively calculated pattern body to be matched and global body Between element similarity relation, then using the transitivity of similarity relation between element, calculate the similar pass between each pattern body System, and then the mapping relations between discovery mode element, this method can not only handle mould between different types of multiple heterogeneous schemas Formula matching problem, and there is preferable autgmentability and ease for maintenance;This method use in pattern matching process is arrived The semantic similarity of WordNet vocabulary body calculating elements titles, it is bright also using domain body or global body as middleware It is aobvious to improve the quality and efficiency of multi-data source pattern match, and have preferable autgmentability, when the new data source schema of increase When matching or thering is the pattern to change, it is only necessary to global Ontology Matching once.
Corresponding relation between the relation schema of table 1 and pattern body
The XML Schema of table 2 and pattern body corresponding relation
The Similarity Measure algorithm of the pattern body of table 3 and global body
The element term standardization processing pseudo-code of the algorithm of table 4
The data type similarity table of table 5
DataType string boolean int double time date
string 1.0 0.0 0.35 0.35 0.35 0.35
boolean 0.0 1.0 0.35 0.0 0.0 0.0
int 0.7 0.35 1.0 0.7 0.0 0.0
double 0.35 0.0 0.7 1.0 0.0 0.0
time 0.35 0.0 0.0 0.0 1.0 0.7
date 0.35 0.0 0.0 0.0 0.7 1.0
The structural similarity algorithm false code of table 6
Similarity Measure pseudo-code of the algorithm between the pattern body of table 7
In order to make it easy to understand, the present invention will be described in detail by specific drawings and examples below.Need It is emphasized that instantiation and accompanying drawing are merely to explanation, it is clear that one of ordinary skill in the art can be according to herein Illustrate, make various modifications and variations to the present invention within the scope of the invention, these modifications and variations also include this In the range of invention.In addition, the present invention refer to open source literature, these documents be in order to more clearly describe the present invention, they Entire contents include and referred to herein, just look like that repeated description herein has been excessively for their full text.
Brief description of the drawings
The flow chart of multi-data source method for mode matching of the Fig. 1 based on global body.
Fig. 2 Book two patterns use the experimental result of three kinds of schemes.
Fig. 3 Auto two patterns use the experimental result of three kinds of schemes.
The result of method for mode matching of the Fig. 4 based on global body.
The result of Fig. 5 in general method for mode matching.
Specific embodiment
Embodiment 1
In order to prove the effect of the present invention, corresponding prototype system is realized, has carried out a series of experiments, system realizes institute Programmed environment is Eclipse, and the version of Java Virtual Machine is 1.5, and the running environment of experiment is HP4000 notebooks, dominant frequency For 1500MHz, it is Windows XP professional versions inside to save as 256M operating systems.
In order to verify the quality of match of the inventive method, some evaluation indexes are used, the present embodiment introduces three patterns The conventional evaluation index in matching field is respectively by the validity of experimental verification the inventive method, three evaluation indexes:Accurately Rate (precision), recall rate (recall) and comprehensive (overall), it is assumed that special by field in two patterns to be matched All correctly coupling numbers for the physical presence that family manually finds are R, and all results that this matching process returns are P, wherein just True coupling number is T, and erroneous matching number is F, that is to say P=T+F, then 3 evaluation indexes are defined as follows:
(1) accurate rate:Ratio in the matching result that matching algorithm returns shared by correct matching result;
(2) recall rate:The correct result that matching algorithm returns accounts for the ratio of actual correct matching result;
(3) it is comprehensive:Later stage matching workload is assessed, synthesis has used accurate rate and recall rate.
Experimental data in the present embodiment, from http:On //metaquerier.cs.uiuc.edu/repository/ The data pattern of offer, these patterns come from practical application, are generally used to the common test data as pattern match field Collection, they also have eaily reference in terms of domain body, and the present embodiment uses Book classes therein and Auto classes Data pattern, the validity of this method is verified, because above-mentioned mode data is not the XML Schema or relational model of specification, and And data volume is smaller, therefore element term is extended by the way of artificial, and XML format weight is carried out to mode data Write, while be also configured as relation schema, table 8 list the mode type that two patterns to be matched are related to, number of attributes and The data characteristicses such as the number of attributes that can be matched.
Experimental contrast analysis is carried out in terms of two;
First aspect be for it is used herein to each matching algorithm tested, verify each matching algorithm Validity, three kinds of schemes are respectively adopted:(1) only with title similarity based method (N_Sim);(2) title similarity and data class Type similarity is combined (NT_Sim);(3) the mixing matching that title similarity, data type similarity and structural similarity combine Algorithm (NTS_Sim), three kinds of schemes are recorded respectively accuracy rate, recall rate and it is comprehensive analyze, the number in the two fields The result that pattern match is carried out using three kinds of schemes according to pattern is distinguished as shown in Figure 3 and Figure 4, it can be seen that being based on name The similarity mode algorithm of title has discovered that most correct matching, by data type similarity and title similarity phase With reference to rear, accuracy is further improved, but recall rate does not change, because data type similarity is simply corrected The incorrect matching in part, is not found new matching, by title similarity, data type similarity and structural similarity With reference to rear, more mapping relations are shown, accuracy rate, recall rate and comprehensive be improved;
Second aspect is to be directed to same data, and this method and in general are directly entered into row mode without using global body The commonsense method matched somebody with somebody is contrasted, to verify the validity of this method;During using matching process based on global body, two Pattern to be matched is first converted to pattern body and matched again using global body as intermediary, then carries out the poly- of similarity relation again Close, caused result is as shown in Figure 5;Method for mode matching using in general not using global body as middleware, data source mould Directly it is mutually matched between formula, caused result shows that matching result mass difference caused by two methods is little, but foot See that this method has high accurate rate, recall rate and comprehensive, quality of match is high, illustrates and traditional multi-data source pattern two Two methods being mutually matched are compared, and this method also has very high matching matter while significantly reducing workload, saving time-consuming Amount.
The experimental data feature of table 8

Claims (8)

1. a kind of multi-data source method for mode matching based on global body, it is characterised in that utilize similarity relation between element Transitivity calculates multiple isomeric data source modules similarity relation between any two, and it includes step:
(1) the global body of multiple data source arts is obtained or constructs first, when the data source neck for carrying out Model Matching When global body or domain body being not present in domain, using ontology edit tool, corresponding domain body is built;
(2) multiple data source heterogeneous schemas to be matched are converted to pattern body by use pattern converter, if related data The pattern in source is not present, it is necessary to extract the pattern of data source;
(3) element term in pattern body and global body is handled using name authority processor;
(4) pattern body and global body are subjected to pattern match using each pattern matcher in adaptation storehouse, each Orchestration exports a similarity matrix, and multiple adaptations will generate a similarity matrix, using adaptation federation policies for not Same adaptation gives different weights, obtains the similarity matrix between each pattern body and global body;
(5) similarity matrix polymerizer is used, the similarity matrix of step (4) is carried out into polymerization two-by-two calculates, and obtains any two Similarity matrix between individual data source schema;
(6) background knowledge and domain knowledge are combined, hand inspection, optimization, checking are carried out to matching result, so it is anti-according to user The weights of Different matching device in feedback adjustment adaptation federation policies.
2. according to the method for claim 1, it is characterised in that described ontology edit tool is selected from Stanford University protégé。
3. according to the method for claim 1, it is characterised in that described domain body, the definition of body are:
1. bodies are defined to represent using a five-tuple:O=(C, I, R, F, A) [25], wherein C represent the concept in body, I tables Show example corresponding with concept, R represents the set of relationship between concept, and F represents that notional collection of functions, A tables can be acted on Show the axiom collection on body;Wherein,
(1) concept (Concept)
Concept in body is also known as class (Class), refers to anything, such as animal, behavior, function, a class is one point Class hierarchy;
(2) relation (Relation)
Refer to the interaction between those concepts in body, n concept in body, it is possible to the cartesian product that a n is tieed up Subset, relation is all there may be between any two concept;Described relation includes inheritance, inclusion relation, synonymy;
(3) function (Function)
A kind of special relation is represented, some element of the relation can be uniquely determined by other n-1 element, and formal definitions are F:C1 × C2 × ... × Cn-1 → Cn, if childOf is a function, childOf (ci, cj) represents that cj is ci child;
(4) axiom (Axiom)
Expression really asserts that a such as concept is in the range of another concept forever;
(5) example (Instance)
Refer to the specific entity or instance objects of some concept in body.
4. according to the method for claim 1, it is characterised in that be converted to each data source schema for needing to be matched Corresponding pattern body, is defined as follows:
Different types of patten transformation is that pattern body needs different construction methods, wherein, from OWL as ontology describing language Speech, OWL are the standard of ontology description language in the web that W3C recommends, and create and safeguard body with ontology edit tool;It is right XML Schema, relation schema forming types body are respectively;
(1) the pattern body of relation schema is built
Two tuples can be used by defining 1. 1 relation schema S<RS, ∑ S>To represent, included in wherein RS intermediate schemes S The related set of institute, ∑ S represent that the set constrained in S between each relation and relation, such as entity integrity constrain, with reference to complete Whole property constraint, R (A1 are abbreviated as any one relation R ∈ RS:T1, A2:T2 ..., An:Tn), wherein R be relation name (i.e. Table name), Ai (1in) is relation R attribute, the corresponding data type of the Ti attributes;
(a) the corresponding pattern body of each database, pre-defined dataSource link is referred to as to the filename of pattern body;
(b) if relation table describes some entity, and when foreign key relationship is not present in other tables:Table name is defined as pattern sheet A class (or concept) in body, the common property in the table is defined as such attribute, and attribute-name is the field name, together When the attribute domain be defined according to the corresponding data type of the field;
(c) if relation table describes some entity, and when other tables have foreign key relationship:External key is defined as to such pair As attribute, i.e. OWL:ObjectProperty, and itself domain and range value is defined, to represent the relation between two classes;
(d) if relation table describes the contact of inter-entity, and when this relation table only includes external key attribute, such as this relation table Major key there was only two external keys composition:Only need two tables for table association to define class, new class created without the table, Two object properties are created simultaneously to represent the relation of multi-to-multi between the two tables;
If (e) relation table describes the contact of inter-entity, and this relation table also includes other in addition to comprising primary key attribute Non-primary key attribute:Then need also to create a class for the table, and be defined according to regular (c);
(2) XML Schema pattern body is built
(a) complex type element in XML Schema is defined as the class in pattern body, simpletype element and attribute are determined Justice is the attribute in pattern body;
(b) restriction relation being directed in XML Schema is defined, and is pattern body by the contextual definition of element and daughter element Subclass relation (OWL in middle OWL forms:SubClassOf), the data attribute by the contextual definition of element and attribute for OWL forms Relation (OWL:DatatypeProperty);
XML Schema can be changed rule more than, can be by people to general after pattern ontological construction is completed Understanding between thought carries out semantic amendment.
5. according to the method for claim 1, it is characterised in that the element term method of standardization management is defined as follows:
(1) compound phrase is split, using symbolic analysis device by element term character string according to connector, space, punctuation mark, big Small letter or digital complex phrase split into independent set of letters, such as First_Name → (First, Name);
(2) vocabulary reduces, and will capitalize, plural number or the word of abbreviated form are reduced into its citation form, such as id → identifier, NAME→name;
(3) preposition or conjunction vocabulary in word set, such as (longitude, in, city) → (longitude, city) are deleted;
Element term is by obtaining a word set after standardization processing, and the similarity of title is then converted into two words between element Title similarity between collection.
6. according to the method for claim 1, it is characterised in that the similarity matrix is defined as follows:
Similarity matrix Mm×nRepresented with the math matrix of two-dimentional m × n size, m and n represent into row mode respectively The element number for two source modules matched somebody with somebody, the wherein value of matrix are [0,1], and the similarity of value two elements of bigger expression is bigger, When such as value being 0, two elements not matrix similarity completely is represented.
7. according to the method for claim 1, it is characterised in that the joint of similarity matrix caused by the multiple adaptation Policy definition is as follows:
To improve the quality of pattern match, during pattern body is matched with global body, various modes are used Orchestration, every kind of adaptation generate a similarity matrix Mi between the element of two bodies, and each value in matrix is in [0,1] model In enclosing, a variety of adaptations can produce a similarity cube jointly;Being weighted summation to the plurality of similarity matrix can obtain Similarity matrix final between two bodies, computational methods M=a1M1+a2M2+…+akMk, wherein 0≤a1, a2... ak≤1 And a1+a2+…+ak=1, each weights can be adjusted according to the user feedback after the importance and many experiments of every kind of matching It is whole, when the similarity between two elements is more than some threshold value thacceptWhen, it is believed that therebetween with mapping relations, simultaneously Should during can have artificial participation, by system manager or domain expert using expertise to candidate matches to adjusting It is whole, check, correct obtained corresponding relation, the similarity for determining two elements with mapping relations is directly set to 1, will It is determined that the similarity of two elements without mapping relations is set to 0.
8. according to the method for claim 1, it is characterised in that the polymerization of the similarity relation calculates, and is defined as follows:
Define 1:Triple Triple (e, e ')=(e, e ', Sim (e, e ')), it is any after pattern match to represent that two patterns are carried out Mapping relations between two elements, wherein e and e ' represent the element in two patterns being matched respectively, and Sim (e, e ') ∈ [0,1] similarity between e and e ' is represented, the mapping relations collection after two bodies progress pattern match is a triple Set, can use a similarity moment matrix representation;
Assuming that global body is O (o1,…,ok), two pattern bodies are respectively S (s1,…,sm) and T (t1,…,tm), pattern sheet Body S and T carry out pattern match with global body O respectively and respectively obtain corresponding similarity matrix Ms and Mt;For in general mould Formula matches, and the similarity relation between element has transitivity, if i.e. element a and b is similar, b and c are similar, then a and C is probably then similar;Assuming that Triple (si,ov) ∈ Ms, Triple (tj,ov) ∈ Mt, wherein si, tjIn respectively S, T Element, ovIt is the element in O, 1≤i≤m, 1≤j≤n, 1≤v≤k, then s can be calculatediAnd tjBetween multiple Similarity values, Maximum therein is taken as siAnd tjBetween similarity, for si∈ S, tj∈ T, calculation formula are:
Sim(si,tj)=max { Sim (si,ou)*Sim(si,ou)*Sim(si,ou)|ou,ov∈O}。
CN201610826714.7A 2016-09-17 2016-09-17 Multi-data source method for mode matching based on global body Pending CN107844482A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610826714.7A CN107844482A (en) 2016-09-17 2016-09-17 Multi-data source method for mode matching based on global body

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610826714.7A CN107844482A (en) 2016-09-17 2016-09-17 Multi-data source method for mode matching based on global body

Publications (1)

Publication Number Publication Date
CN107844482A true CN107844482A (en) 2018-03-27

Family

ID=61656498

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610826714.7A Pending CN107844482A (en) 2016-09-17 2016-09-17 Multi-data source method for mode matching based on global body

Country Status (1)

Country Link
CN (1) CN107844482A (en)

Cited By (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107330007A (en) * 2017-06-12 2017-11-07 南京邮电大学 A kind of Method for Ontology Learning based on multi-data source
CN108536796A (en) * 2018-04-02 2018-09-14 北京大学 A kind of isomery Ontology Matching method and system based on figure
CN108920716A (en) * 2018-07-27 2018-11-30 中国电子科技集团公司第二十八研究所 The data retrieval and visualization system and method for knowledge based map
CN109348456A (en) * 2018-10-17 2019-02-15 安徽大学 Relation excavation method based on short-distance wireless communication data
CN109408578A (en) * 2018-10-30 2019-03-01 环境保护部华南环境科学研究所 One kind being directed to isomerous environment monitoring data fusion method
CN109492114A (en) * 2018-11-16 2019-03-19 南京茂毓通软件科技有限公司 A kind of entity information recognition methods
CN109710653A (en) * 2018-12-29 2019-05-03 北京航天数据股份有限公司 A kind of test data source configuration method and device
CN109902828A (en) * 2019-03-18 2019-06-18 中科院合肥技术创新工程院 Emergency event Emergency decision knowledge data model building method based on multi-level knowledge units
CN109993152A (en) * 2019-04-15 2019-07-09 武汉轻工大学 Mode conversion method, equipment, storage medium and the device of coordinate curve integral
CN110795607A (en) * 2019-10-29 2020-02-14 中国人民解放军32181部队 Equipment guarantee data matching method and system based on multi-stage similarity calculation
CN111274400A (en) * 2020-01-20 2020-06-12 医惠科技有限公司 Construction method, device, equipment and storage medium of medical term system
CN111597788A (en) * 2020-05-18 2020-08-28 腾讯科技(深圳)有限公司 Attribute fusion method, device and equipment based on entity alignment and storage medium
CN112633013A (en) * 2021-01-06 2021-04-09 福建工程学院 Global ontology element matching method based on heterogeneous characteristics
CN112965968A (en) * 2021-03-04 2021-06-15 湖南大学 Attention mechanism-based heterogeneous data pattern matching method
CN113360518A (en) * 2021-06-07 2021-09-07 哈尔滨工业大学 Hierarchical ontology construction method based on multi-source heterogeneous data
CN113392228A (en) * 2021-08-03 2021-09-14 广域铭岛数字科技有限公司 Abnormity prediction and tracing method, system, equipment and medium based on automobile production
CN114625875A (en) * 2022-03-09 2022-06-14 平安科技(深圳)有限公司 Pattern matching method, device, storage medium and equipment for multi-data source information
CN115757655A (en) * 2022-11-14 2023-03-07 中国兵器工业计算机应用技术研究所 Data blood relationship analysis system and method based on metadata management

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102402507A (en) * 2010-09-07 2012-04-04 重庆邮电大学 Heterogeneous data integration system for service-oriented architecture (SOA) multi-message mechanism
CN102945230A (en) * 2012-10-17 2013-02-27 刘运通 Natural language knowledge acquisition method based on semantic matching driving
CN104346438A (en) * 2014-09-14 2015-02-11 北京航空航天大学 Data management service system based on large data

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102402507A (en) * 2010-09-07 2012-04-04 重庆邮电大学 Heterogeneous data integration system for service-oriented architecture (SOA) multi-message mechanism
CN102945230A (en) * 2012-10-17 2013-02-27 刘运通 Natural language knowledge acquisition method based on semantic matching driving
CN104346438A (en) * 2014-09-14 2015-02-11 北京航空航天大学 Data management service system based on large data

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
石浩宏 等: "基于全局本体的多数据源模式匹配方法的研究", 《小型微型计算机系统》 *

Cited By (27)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107330007A (en) * 2017-06-12 2017-11-07 南京邮电大学 A kind of Method for Ontology Learning based on multi-data source
CN108536796A (en) * 2018-04-02 2018-09-14 北京大学 A kind of isomery Ontology Matching method and system based on figure
CN108920716A (en) * 2018-07-27 2018-11-30 中国电子科技集团公司第二十八研究所 The data retrieval and visualization system and method for knowledge based map
CN108920716B (en) * 2018-07-27 2022-11-25 中国电子科技集团公司第二十八研究所 Data retrieval and visualization system and method based on knowledge graph
CN109348456A (en) * 2018-10-17 2019-02-15 安徽大学 Relation excavation method based on short-distance wireless communication data
CN109348456B (en) * 2018-10-17 2021-07-27 安徽大学 Relation mining method based on short-distance wireless communication data
CN109408578B (en) * 2018-10-30 2020-07-31 环境保护部华南环境科学研究所 Monitoring data fusion method for heterogeneous environment
CN109408578A (en) * 2018-10-30 2019-03-01 环境保护部华南环境科学研究所 One kind being directed to isomerous environment monitoring data fusion method
CN109492114A (en) * 2018-11-16 2019-03-19 南京茂毓通软件科技有限公司 A kind of entity information recognition methods
CN109710653A (en) * 2018-12-29 2019-05-03 北京航天数据股份有限公司 A kind of test data source configuration method and device
CN109902828A (en) * 2019-03-18 2019-06-18 中科院合肥技术创新工程院 Emergency event Emergency decision knowledge data model building method based on multi-level knowledge units
CN109993152A (en) * 2019-04-15 2019-07-09 武汉轻工大学 Mode conversion method, equipment, storage medium and the device of coordinate curve integral
CN110795607A (en) * 2019-10-29 2020-02-14 中国人民解放军32181部队 Equipment guarantee data matching method and system based on multi-stage similarity calculation
CN111274400A (en) * 2020-01-20 2020-06-12 医惠科技有限公司 Construction method, device, equipment and storage medium of medical term system
CN111274400B (en) * 2020-01-20 2021-02-12 医惠科技有限公司 Construction method, device, equipment and storage medium of medical term system
CN111597788A (en) * 2020-05-18 2020-08-28 腾讯科技(深圳)有限公司 Attribute fusion method, device and equipment based on entity alignment and storage medium
CN111597788B (en) * 2020-05-18 2023-11-14 腾讯科技(深圳)有限公司 Attribute fusion method, device, equipment and storage medium based on entity alignment
CN112633013A (en) * 2021-01-06 2021-04-09 福建工程学院 Global ontology element matching method based on heterogeneous characteristics
CN112965968A (en) * 2021-03-04 2021-06-15 湖南大学 Attention mechanism-based heterogeneous data pattern matching method
CN112965968B (en) * 2021-03-04 2023-10-24 湖南大学 Heterogeneous data pattern matching method based on attention mechanism
CN113360518A (en) * 2021-06-07 2021-09-07 哈尔滨工业大学 Hierarchical ontology construction method based on multi-source heterogeneous data
CN113360518B (en) * 2021-06-07 2023-03-21 哈尔滨工业大学 Hierarchical ontology construction method based on multi-source heterogeneous data
CN113392228A (en) * 2021-08-03 2021-09-14 广域铭岛数字科技有限公司 Abnormity prediction and tracing method, system, equipment and medium based on automobile production
CN113392228B (en) * 2021-08-03 2023-07-21 广域铭岛数字科技有限公司 Anomaly prediction and tracing method, system, equipment and medium based on automobile production
CN114625875A (en) * 2022-03-09 2022-06-14 平安科技(深圳)有限公司 Pattern matching method, device, storage medium and equipment for multi-data source information
CN114625875B (en) * 2022-03-09 2024-03-29 平安科技(深圳)有限公司 Pattern matching method, device, storage medium and equipment for multiple data source information
CN115757655A (en) * 2022-11-14 2023-03-07 中国兵器工业计算机应用技术研究所 Data blood relationship analysis system and method based on metadata management

Similar Documents

Publication Publication Date Title
CN107844482A (en) Multi-data source method for mode matching based on global body
Song et al. An ontology-driven framework towards building enterprise semantic information layer
Thiéblin et al. Survey on complex ontology matching
Udrea et al. Leveraging data and structure in ontology integration
Nebot et al. Multidimensional integrated ontologies: A framework for designing semantic data warehouses
Vavliakis et al. RDOTE–publishing relational databases into the semantic web
Vaccari et al. An evaluation of ontology matching in geo-service applications
Rivero et al. Benchmarking data exchange among semantic-web ontologies
Pamungkas et al. B-BabelNet: business-specific lexical database for improving semantic analysis of business process models
Arch-Int et al. Graph‐Based Semantic Web Service Composition for Healthcare Data Integration
Gao et al. Semantic mapping from natural language questions to OWL queries
Tournaire et al. Discovery of probabilistic mappings between taxonomies: principles and experiments
Da Silva et al. Semantic interoperability of heterogeneous semantic resources
Fernández-Pena et al. A conceptual data model for the automatic generation of data views
Ding BayesOWL: a probabilistic framework for uncertainty in semantic web
Casanova et al. Operations over lightweight ontologies
Euzenat et al. Overview of matching systems
Renner et al. Using an artificial neural network to map cancer common data elements to the biomedical research integrated domain group model in a semi-automated manner
Ramanujam et al. Relationalization of provenance data in complex RDF reification nodes
Buccella et al. A federated layer to integrate heterogeneous knowledge
Khéfifi et al. Modeling and querying context-aware personal information spaces
Thiéblin Automatic generation of complex ontology alignments
Xue Ontological View-driven Semantic Integration in Open Environments
Zhao Ontology mapping techniques in information integration
Smiljanic XML schema matching: balancing efficiency and effectiveness by means of clustering

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20180327

RJ01 Rejection of invention patent application after publication