WO2006026636A2

WO2006026636A2 - Metadata management

Info

Publication number: WO2006026636A2
Application number: PCT/US2005/030897
Authority: WO
Inventors: M'hamed Bouziane; Brian Dellert; David M. Kantor; Boris Krylov; Occhio G. Orsini; Cassio Dos Santos; Charles K. Shank; Mark R. Tucker; Hong Zhang
Original assignee: Ascential Software Corporation
Priority date: 2004-08-31
Filing date: 2005-08-31
Publication date: 2006-03-09
Also published as: EP1805645A2; WO2006026636A3; JP2008511928A; EP1805645A4; CN101040280A

Abstract

Metadata is managed (5202) and used in connection with data integration in an enterprise computing environment. An integrated, platform-independent approach to metadata management (5202) allows enterprise-wide access to data integration services (104) and underlying data, and facilitates reuse and redesign of tools and jobs in the data integration environment. Tools (5204) are provided for managing metadata, including maintaining versioned metadata models (5212) that may be branched and merged during a design cycle, and dynamically implemented across the enterprise. The platform-independent approach facilitates varied uses including implementations in heterogeneous hardware and software computing environments.

Description

IN THE UNITED STATES PATENT AND TRADEMARK OFFICE

A PCT APPLICATION FOR

METADATA MANAGEMENT

METADATA MANAGEMENT RELATED APPLICATIONS

This application claims priority to U.S. Provisional Application Number 60/606,301, filed August 31, 2004 and entitled "Metadata Management".

BACKGROUND

1. Field.

This invention relates to the field of information technology, and more particularly to the field of data integration systems.

2. Description of the Related Art.

The advent of computer applications made many business processes much faster and more efficient; however, the proliferation of different computer applications that use different data structures, communication protocols, languages and platforms has led to great complexity in the information technology infrastructure of the typical business enterprise. Different business processes within the typical enterprise may use completely different computer applications, each computer application being developed and optimized for the particular business process, rather than for the enterprise as a whole. For example, a business may have a particular computer application for tracking accounts payable and a completely different one for keeping track of customer contacts. In fact, even the same business process may use more than one computer application, such as when an enterprise keeps a centralized customer contact database, but employees keep their own contact information, such as in a personal information manager.

While specialized computer applications offer the advantages of custom-tailored solutions, the proliferation leads to inefficiencies, such as repetitive entry and handling of the same data many times throughout the enterprise, or the failure of the enterprise to capitalize on data that is associated with one process when the enterprise executes another process that could benefit from that data. For example, if the accounts payable process is separated from the supply chain and ordering process, the enterprise may accept and fill orders from a customer whose credit history would have caused the enterprise to decline the order. Many other examples can be provided where an enterprise would benefit from consistent access to all of its data across varied computer applications. A number of companies have recognized and addressed the need for sharing of data across different applications in the business enterprise. Thus, enterprise application integration, or EAI, has emerged as a message- based strategy for addressing data from disparate sources. As computer applications increase in complexity and number, EAI efforts encounter many challenges, ranging from the need to handle different protocols, the need to address ever-increasing volumes of data and numbers of transactions, and an ever-increasing appetite for faster integration of data. Various approaches to EAI have been taken, including least-common-denominator approaches, atomic approaches, and bridge-type approaches. However, EAI is based upon communication between individual applications. As a significant disadvantage, the complexity of EAI solutions grows geometrically in response to linear additions of platforms and applications.

While data integration systems provided useful tools for addressing the needs of an enterprise, such systems are typically deployed as custom solutions. They have a lengthy development cycle, and may require sophisticated technical training to accommodate changes in business structure and information requirements. There remains a need for data integration system tools that permit use, reuse, and modification of functionality in a changing business environment. More particularly, there remains a need for more flexible metadata management tools for use with data integration in an enterprise computing environment.

SUMMARY Provided herein are methods and systems for managing and using metadata in connection with data integration in an enterprise computing environment. An integrated, platform-independent approach to metadata management may allow enterprise-wide access to data integration services and underlying data, and facilitate reuse and redesign of tools and jobs in the data integration environment. Tools are providing for managing metadata, including maintaining versioned metadata models that may be branched and merged during a design cycle, and dynamically implemented across the enterprise. The platform-independent approach may facilitate varied uses including implementations in heterogeneous hardware and software computing environments.

In one aspect, a method described herein includes: expressing a query in terms native to a first model; translating the query into terms native to a second model using mapping information that describes one or more relationships between the first model and the second model; and translating the query into a native data source format. In another aspect, a system includes means for expressing a query in terms native to a first model; a mapping model that translates the query into terms native to a second model using mapping information that describes one or more relationships between the first model and the second model; and means for translating the query into a native data source format in which the query is executed against the data source.

The mapping information may be queried. The mapping information may be available during the translating steps. The first model may be a view. The second model may be a hub. The data source may be a database. The database may store metadata for one or more data sources. The database may store a persistent model representing enterprise metadata. The database may be a relational database and/or a file. The method may be performed in an enterprise computing system, or the system may be within an enterprise computing system. The method may be performed in a data integration system, or the system may be within a data integration system. The terms native to the first model may include a syntax native to an external client. The first model may be a view for a user interface. The method may further include displaying a result of the query in the user interface, or the system may include a user interface for displaying the result of the query. The first model may be a view for a service. The service may include a data integration system service. The service may include a remote tool and/or a real time integration service. At least one of the first model and the second model may be a metadata model stored in a repository. The method may further include translating a result of the query into the first model with a translation tool, or the system may include a corresponding translation tool. The translation tool may be stored in a repository.

In another aspect, a method as described herein may include: registering a metadata model with a repository; associating a first storage mechanism with one or more design properties of the metadata model; and associating a second storage mechanism with one or more operational properties of the metadata model, wherein the second storage mechanism stores a time stamp for at least one of the one or more operational properties of the metadata model.

In the method, the first storage mechanism may be a versioned storage mechanism that stores one or more versions of at least one of the one or more design properties of the metadata model. The method may further include annotating the one or more design properties and the one or more operational properties of the metadata model to associate them with either the first storage mechanism or second storage mechanism. The method may further include providing a package structure to allocate the one or more design properties and the one or more operational properties of the metadata model between the first storage mechanism and the second storage mechanism. The method may further include providing a manifest associated with the metadata model to allocate the one or more design properties and the one or more operational properties of the metadata model between the first storage mechanism and the second storage mechanism. The method may further include registering the operational properties a first model and registering the design properties as a second model. The metadata model may be queried across the one or more operational properties and the one or more design properties. The method may further include registering one or more mappings with the metadata model, the one or more mappings describing a relationship of the metadata model to one or more other metadata models.

In another aspect, a system may include: a repository including a registered metadata model; a first storage mechanism within the repository, the first storage mechanism associated with one or more design properties of the metadata model; and a second storage mechanism within the repository, the second storage mechanism associated with one or more operational properties of the metadata model and the second storage mechanism, the second storage mechanism adapted to store a time stamp for at least one of the one or more operation properties of the metadata model. The first storage mechanism may be a versioned storage mechanism that stores one or more versions of at least one of the one or more design properties of the metadata model. The system may include annotations to associate the one or more design properties of the metadata model and the one or more operational properties of the metadata model with either the first storage mechanism or second storage mechanism. The system may include a package structure to allocate the one or more design properties and the one or more operational properties of the metadata model between the first storage mechanism and the second storage mechanism. The system may include a manifest associated with the metadata model to allocate the one or more design properties and the one or more operational properties of the metadata model between the first storage mechanism and the second storage mechanism. The operational properties may be registered as a first model and the design properties are registered as a second model. The metadata model may be queried across the one or more operational properties and the one or more design properties. The system may further include one or more mappings registered with the metadata model, the one or more mappings describing a relationship of the metadata model to one or more other metadata models.

In another aspect, a method for persisting a model includes: registering a first model; identifying a second model and a mapping of at least one property of the first model to the second model; and persisting the mapping of the at least one property of the first model to the second model.

The method may include identifying at least one other property of the first model not mapped to the second model; and persisting the at least one other property of the first model. The first model may include a plurality of classes. The second model may include a plurality of classes. The method may include providing a storage mechanism for persisting the mapping of the at least one property of the first model to the second model that is a reflective storage mechanism. The method may further include defining a schema for representing metadata models in a relational database, and using the schema to persist the mapping of the at least one property of the first model to the second model. The method may further include revising the first model by changing the schema, by changing one or more properties in the relational database, and/or by changing the mapping. The first model and the second model may be metadata models. In another aspect, as system for persisting a model may include: a mapping of at least one property of a first model to a second model; and a repository for registering the first model, the repository configured to persist the mapping of the at least one property of the first model to the second model.

At least one other property of the first model may be not mapped to the second model, and the repository configured to persist the at least one other property of the first model. The first model and/or the second model may each include a plurality of classes. The system may further include a storage mechanism for persisting the mapping of the at least one property of the first model to the second model, the storage mechanism including a reflective storage mechanism. The system may further include a schema for representing metadata models in a relational database, the schema persisting the mapping of the at least one property of the first model to the second model. The first model is revised by changing the schema, by changing one or more properties in the relational database, and/or by changing the mapping. The first model and the second model may be metadata models.

In another aspect, a model driven metadata transformation architecture may include: a plurality of translation engines that use one or more model-to-model mappings to translate between one or more models; and a translation registry for dynamically selecting one of the plurality of translation engines. The translation engines may include one or more of a compiled language engine, an interpreted language engine, or an interpreted mapping engine. The model-to-model mappings may be between a hub and one or more views in hub-and-spoke architecture. The one or more model-to-model mappings may be user configurable. One of the model-to-model mappings may be configured after the corresponding models have been deployed. One of the model-to-model mappings may be repeated in a plurality of translation engines for translation between a hub and a plurality of identical views. Different model-to-model mappings may be realized in a plurality of translation engines for translation between a hub and a plurality of different views.

In another aspect, a method for transforming metadata between models includes: receiving a request to translate metadata between a first model and a second model; retrieving a model-to-model mapping characterizing a translation between the first model and the second model; and translating the metadata from the first model to the second model using the model-to-model mapping.

The model-to-model mapping may include one or more of a compiled language, an interpreted language, or a mapping adapted for translation by a translation engine. The model-to-model mapping may be between a hub and a view in a hub-and-spoke architecture. The method may further include providing a user interface for configuring the model-to-model mapping. The method may further include storing the model-to-model mapping in a registry for dynamic access. The method may further include configuring the model-to-model mapping after at least one of the first model and the second model have been deployed. The model-to-model mapping may be used concurrently by a plurality of translation engines for translation between a hub and a plurality of identical views. The method may further include registering a plurality of different model-to-model mappings wherein the different model-to-model mappings are used concurrently by a plurality of translation engines for translation between a hub and a plurality of different views.

In one aspect, a method of managing metadata disclosed herein includes: organizing an object-oriented metadata model into an operational model that includes operational properties and a design model that includes design properties; storing the operational model in an operational repository; and storing the design model in a common repository. The method may further include time-stamping at least one item of metadata for the operational model.

The common repository may support more than one version of the design model. The method may further include providing a metadata environment for user interaction with the model. The user environment may include a workspace for editing the model. The workspace may exclusive to a user and/or shared. The metadata environment may include a team space. The team space may support versioning of metadata instances. The metadata environment may reside locally on a user computer or on a remote server accessible to a user computer. The method may include dynamically comparing one or more different versions of the design model in the common repository. The common repository may support branching of versions of the design model. The method may include reconciling a plurality of versions of the design model and/or dynamically reconciling a plurality of versions of the design model. The method may include using the metadata model in a metadata service by asynchronously calling the metadata model through a message-oriented service, and/or using the metadata model in a metadata service by synchronously calling the metadata model through an application programming interface.

The method may include concurrently executing a service that uses the metadata model, and/or using parallelism to execute a service that uses the model.

A system for managing metadata as described herein may include: an object-oriented metadata model including an operational model having one or more operational properties of the metadata model and a design model having one or more design properties of the metadata model; an operational repository that stores the operational model; and a common repository that stores the design model

At least one item of metadata from the operational model may be time-stamped. The common repository may support more than one version of the design model. The system may include a metadata environment for user interaction with the model. The user environment may include a workspace for editing the model. The workspace may be exclusive to a user, or shared. The metadata environment may include a team space. The team space may support versioning of metadata instances. The metadata environment may reside locally on a user computer, or on a remote server. The common repository may support dynamic comparison one or more different versions of the design model. The common repository may support branching of versions of the design model. The common repository may support reconciliation of a plurality of versions of the design model. The common repository may support dynamic reconciliation of a plurality of versions of the design model. The system may include a metadata service that uses the metadata model by asynchronously calling the metadata model through a message-oriented metadata service, and/or a metadata service that uses the metadata model by synchronously calling the metadata model through an application programming interface. The metadata model may be used in a service that executes at least one of concurrently or in parallel. A method for reconciling metadata as disclosed herein may include: associating a reconciliation zone property with a metadata object, the reconciliation zone property identifying a reconciliation zone characterized by a common set of reconciliation rules; and reconciling a plurality of instances of the metadata object according to the common set of reconciliation rules to provide a reconciled instance of the metadata object within the reconciliation zone. The method may include defining a second reconciliation zone for reconciling the reconciled instance of the metadata object with one or more additional instances of the metadata object. The reconciliation zone may include instances of a plurality of metadata objects. The method may further include associating a match type with the reconciliation zone property, the match type defining a treatment of the instance of the metadata object. The method may further include associating an identity with the instance of the metadata object that uniquely identifies the instance of the metadata object within the reconciliation zone. The method may further include providing a reconciliation lineage for a metadata object. The reconciliation lineage may describe a path through one or more reconciliation zones, identify one or more data sources, identify one or more reconciliation rules, and/or include a history of instances of the metadata object.

In another aspect, as system for reconciling metadata as described herein may include: a reconciliation zone characterized by a common set of reconciliation rules; a plurality of instances of a metadata object including a reconciliation zone property that associates each one of the plurality of instances of the metadata object with the reconciliation zone; and a reconciliation engine that generates a reconciled instance of the metadata object within the reconciliation zone by reconciling the plurality of instances of the metadata object according to the common set of reconciliation rules for the reconciliation zone with which the plurality of instances of the metadata object are associated. The system may include a second reconciliation zone for reconciling the reconciled instance of the metadata object with one or more additional instances of the metadata object. The reconciliation zone may include instances of a plurality of metadata objects. A match type may define a treatment of the instances of the metadata object within the reconciliation zone. An identify associated with each instance of the metadata object may uniquely identify that instance of the metadata object within the reconciliation zone. A reconciliation lineage may be provided for a metadata object. The reconciliation lineage may describe a path through one or more reconciliation zones, identify one or more data sources, identify one or more reconciliation rules, and/or include a history of instances of the metadata object.

In another aspect, a method for providing concurrency for metadata services for a data integration system may include: dividing a metadata service into a stream of objects; identifying a cluster of the objects having primarily internal references based upon metadata for the objects; executing the cluster of objects on a single one of a plurality of processors; identifying at least one object outside the cluster of objects; and executing the at least one object on another one of the plurality of processors.

The objects may include at least one metadata model. The processors are on physically separate hardware.

The service may include a reconciliation process that resolves metadata conflicts. The objects may include a metadata import. The primarily internal references may be identified using graphs of data dependencies. The service may be organized as a pipeline for concurrency. The pipeline may include at least an identify objects phase, a fetch candidates phase, a reconcile phase, a merge phase, and a store phase.

In other aspects, a computer program product may include a computer useable medium including computer readable program code, wherein the computer readable program code when executed on one or more computers causes the one or more computers to perform any one or more of the methods above.

"International Business Machines" or "IBM" as used herein shall refer to International Business Machines

Corporation of Armonk, New York.

As used herein, "data source" or "data target" are intended to have the broadest possible meaning consistent with these terms, and shall include a database, a plurality of databases, a repository information manager, a queue, a message service, a repository, a data facility, a data storage facility, a data provider, a website, a server, a computer, a computer storage facility, a CD, a DVD, a mobile storage facility, a central storage facility, a hard disk, a multiple coordinating data storage facilities, RAM, ROM, flash memory, a memory card, a temporary memory facility, a permanent memory facility, magnetic tape, a locally connected computing facility, a remotely connected computing facility, a wireless facility, a wired facility, a mobile facility, a central facility, a web browser, a client, a laptop, a personal digital assistant ("PDA"), a telephone, a cellular phone, a mobile phone, an information platform, an analysis facility, a processing facility, a business enterprise system or other facility where data is handled or other facility provided to store data or other information, as well as any files or file types for maintaining structured or unstructured data used in any of the above systems, or any streaming, messaged, event driven, or otherwise sourced data, and any combinations of the foregoing, unless a specific meaning is otherwise indicated or the context of the phrase requires otherwise. A storage mechanism is any physical or logical device, resource, or facility capable of serving as a data source or a data target, or otherwise storing data in a retrievable form.

"Enterprise Java Bean (EJB)" shall include the server-side component architecture for the J2EE platform. EJBs support rapid and simplified development of distributed, transactional, secure and portable Java applications. EJBs support a container architecture that allows concurrent consumption of messages and provide support for distributed transactions, so that database updates, message processing, and connections to enterprise systems using the J2EE architecture can participate in the same transaction context.

"JMS" shall mean the Java Message Service, which is an enterprise message service for the Java-based J2EE enterprise architecture. "JCA" shall mean the J2EE Connector Architecture of the J2EE platform described more particularly below. It should be appreciated that, while EJB, JMS, and JCA are commonly used software tools in contemporary distributed transaction environments, any platform, system, or architecture providing similar functionality may be employed with the data integration systems described herein.

"Real time" as used herein, shall include periods of time that approximate the duration of a business transaction or business and shall include processes or services that occur during a business operation or business process, as opposed to occurring off-line, such as in a nightly batch processing operation. Depending on the duration of the business process, real time might include seconds, fractions of seconds, minutes, hours, or even days.

"Business process," "business logic" and "business transaction" as used herein, shall include any methods, service, operations, processes or transactions that can be performed by a business, including, without limitation, sales, marketing, fulfillment, inventory management, pricing, product design, professional services, financial services, administration, finance, underwriting, analysis, contracting, information technology services, data storage, data mining, delivery of information, routing of goods, scheduling, communications, investments, transactions, offerings, promotions, advertisements, offers, engineering, manufacturing, supply chain management, human resources management, data processing, data integration, work flow administration, software production, hardware production, development of new products, research, development, strategy functions, quality control and assurance, packaging, logistics, customer relationship management, handling rebates and returns, customer support, product maintenance, telemarketing, corporate communications, investor relations, and many others.

"Service oriented architecture (SOA)", as used herein, shall include services that form part of the infrastructure of a business enterprise. In the SOA, services can become building blocks for application development and deployment, allowing rapid application development and avoiding redundant code. Each service may embody a set of business logic or business rules that can be bound to the surrounding environment, such as the source of the data inputs for the service or the targets for the data outputs of the service. Various instances of SOA are provided in the following description.

"Metadata," as used herein, shall include data that brings context to the data being processed, data about the data, information pertaining to the context of related information, information pertaining to the origin of data, information pertaining to the location of data, information pertaining to the meaning of data, information pertaining to the age of data, information pertaining to the heading of data, information pertaining to the units of data, information pertaining to the field of data, and/or information pertaining to any other information relating to the context of the data

"WSDL" or "Web Services Description Language" as used herein, includes an XML format for describing network services (often web services) as a set of endpomts operating on messages containing either document- oπented or procedure-oriented information The operations and messages are described abstractly, and then bound to a concrete network protocol and message format to define an endpomt Related concrete endpomts are combined into abstract endpomts (services) WSDL is extensible to allow descπption of endpomts and their messages regardless of what message formats or network protocols are used to communicate

BRIEF DESCRIPTION OF THE FIGURES

Fig 1 is a schematic diagram of a business enterprise with a plurality of business processes, each of which may include a plurality of different computer applications and data sources

Fig 2 is a schematic diagram showing data integration across a plurality of business processes of a business enterprise Fig 3 is a schematic diagram showing an architecture for providing data integration for a plurality of data sources for a business enterprise

Fig 4 shows an architecture for a metadata management system

Fig 5 shows communication through a view model and data model to query a database

Fig 6 shows a translation engine being accessed to translate a query result for a view model Fig 7 shows a translation engine being accessed to translate a query result for an external service

Fig 8 shows a static model mapping

Fig 9 shows an extensible model mapping

Fig 10 shows a combination of model mappings

Fig 11 depicts an architecture that exposes a plurality of internal services to external metadata Fig 12 depicts a mapped-model driven transformation of metadata

Fig 13 shows interaction with a metadata environment

Fig 14 shows a common repository storing a plurality of versions of metadata

Fig 15 depicts a client dynamically comparing metadata versions in a versioned repository

Fig 16A shows a process of metadata reconciliation Fig 16B depicts phased ieconciliation across reconciliation zones

Fig 17 depicts reconciliation of versioned metadata objects

Fig 18 shows an example of the use of concurrency in a metadata process

Fig 19 is a diagram of entities involved in a query process from a user interface 6702 to a metadata database 6712 Fig 20 shows the entities involved in a process of extending a metadata database from a metadata model

Fig 21 shows the entities involved in a process for accessing a repository from a tool

Fig 22 shows the entities involved in a process by which a tool accesses versioned and unversioned metadata models

Fig 23 shows the entities involved in a process by which a user interface accesses multiple versions of metadata m a common repository

Fig 24 shows the entities involved m a reconciliation process for versions of metadata Fig. 25 shows the entities involved in a reconciliation process using concurrency.

DETAILED DESCRIPTION

Throughout the following discussion, like element numerals are intended to refer to like elements, unless specifically indicated otherwise.

The invention(s) disclosed herein can take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment containing both hardware and software elements. In a preferred embodiment, the invention is implemented in software, which includes but is not limited to firmware, resident software, microcode, etc. Furthermore, the invention(s) can take the form of a computer program product accessible from a computer-usable or computer-readable medium providing program code for use by or in connection with a computer or any instruction execution system. For the purposes of this description, a computer-usable or computer readable medium can be any apparatus that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. The medium can be an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system

(or apparatus or device) or a propagation medium. Examples of a computer-readable medium include a semiconductor or solid state memory, magnetic tape, a removable computer diskette, a random access memory (RAM), a read-only memory (ROM), a rigid magnetic disk and an optical disk. Current examples of optical disks include compact disk - read only memory (CD-ROM), compact disk - read/write (CD-BfW) and DVD. A data processing system suitable for storing and/or executing program code will include at least one processor coupled directly or indirectly to memory elements through a system bus. The memory elements can include local memory employed during actual execution of the program code, bulk storage, and cache memories which provide temporary storage of at least some program code in order to reduce the number of times code must be retrieved from bulk storage during execution. Input/output or I/O devices (including but not limited to keyboards, displays, pointing devices, etc.) can be coupled to the system either directly or through intervening I/O controllers.

Network adapters may also be coupled to the system to enable the data processing system to become coupled to other data processing systems or remote printers or storage devices through intervening private or public networks. Modems, cable modem and Ethernet cards are just a few of the currently available types of network adapters.

Fig. 1 represents a platform 100 for facilitating integration of various data of a business enterprise. The platform includes a plurality of business processes, each of which may include a plurality of different computer applications and data sources. The platform may include several data sources 102, which may be data sources such as those described above. These data sources may include a wide variety of data types from a wide variety of physical locations. For example, the data source may include systems from providers such as such as Sybase,

Microsoft, Informix, Oracle, Inlomover, EMC, Trillium, First Logic, Siebel, PeopleSoft, IBM, Apache, or Netscape. The data sources 102 may include systems using database products or standards such as IMS, DB2, ADABAS, VSAM, MD Series, UDB, XML, complex flat files, or FTP files. The data sources 102 may include files created or used by applications such as Microsoft Outlook, Microsoft Word, Microsoft Excel, Microsoft Access, as well as files in standard formats such as ASCII, CSV, GIF, TIF, PNG, PDF, and so forth. The data sources 102 may come from various locations or they may be centrally located. The data supplied from the data sources 102 may come in various forms and have different formats that may or may not be compatible with one another.

Data targets are discussed later in this description. In general, these data targets may be any of the data sources 102 noted above. This difference in nomenclature typically denotes whether a data system provides data or receives data in a data integration process. However, it should be appreciated that this distinction is not intended to convey any difference in capability between data sources and data targets (unless specifically stated otherwise), since in a conventional data integration system, data sources may receive data and data targets may provide data.

The platform illustrated in Fig.l also includes a data integration system 104. The data integration system may, for example, facilitate the collection of data from the data sources 102 as the result of a query or retrieval command the data integration system 104 receives. The data integration system 104 may send commands to one or more of the data sources 102 such that the data source(s) provides data to the data integration system 104. Since the data received may be in multiple formats including varying metadata, the data integration system may reconfigure the received data such that it can be later combined for integrated processing. The functions that may be performed by the data integration system 104 are described in more detail below. The platform 100 also includes several retrieval systems 108. The retrieval systems 108 may include databases or processing platforms used to further manipulate the data communicated from the data integration system 104. For example, the data integration system 104 may cleanse, combine, transform or otherwise manipulate the data it receives from the data sources 102 such that a retrieval system 108 can use the processed data to produce reports 110 useful to the business. The reports 110 may be used to report data associations, answer complex queries, answer simple queries, or form other reports useful to the business or user, and may include raw data, tables, charts, graphs, and any other representations of data from the retrieval systems 108.

The platform 100 may also include a database or data base management system 112. The database 112 may be used to store information temporally, temporarily, or for permanent or long-term storage. For example, the data integration system 104 may collect data from one or more data sources 102 and transform the data into forms that are compatible with one another or compatible to be combined with one another. Once the data is transformed, the data integration system 104 may store the data in the database 112 in a decomposed form, combined form or other form for later retrieval.

Fig. 2 is a schematic diagram showing data integration across a plurality of entities and business processes of a business enterprise. In the illustrated embodiment, the data integration system 104 facilitates the information flowing between user interface systems 202 and data sources 102. The data integration system 104 may receive queries from the interface systems 202, where the queries necessitate the extraction and possibly transformation of data residing in one or more of the data sources 102. The interface systems 202 may include any device or program for communicating with the data integration system 104, such as a web browser operating on a laptop or desktop computer, a cell phone, a personal digital assistant ("PDA"), a networked platform and devices attached thereto, or any other device or system that might interface with the data integration system 104.

For example, a user may be operating a PDA and make a request for information to the data integration system 104 over a WiFi or Wireless Access Protocol/Wireless Markup Language ("WAP/WML") interface. The data integration system 104 may receive the request and generate any required queries to access information from a website or other data source 102 such as an FTP file site. The data from the data sources 102 may be extracted and transformed into a format compatible with the requesting interface system 202 (a PDA in this example) and then communicated to the interface system 202 for user viewing and manipulation. In another embodiment, the data may have previously been extracted from the data sources and stored in a separate database 112, which may be a data warehouse or other data facility used by the data integration system 104. The data may have been stored in the database 112 in a transformed condition or in its original state. For example, the data may be stored in a transformed condition such that the data from a number of data sources 102 can be combined in another transformation process. For example, a query from the PDA may be transmitted to the data integration system 104 and the data integration system 104 may extract the information from the database 112. Following the extraction, the data integration system 104 may transform the data into a combined format compatible with the PDA before transmission to the PDA.

Fig. 3 is a schematic diagram showing an architecture for providing data integration for a plurality of data sources 102 for a business enterprise. An embodiment of a data integration system 104 may include a discover data stage 302 to perform, possibly among other processes, extraction of data from a data source and analysis of column values and table structures for source data. A discover data stage 302 may also generate recommendations about table structure, relationships, and keys for a data target. More sophisticated profiling and auditing functions may include date range validation, accuracy of computations, accuracy of if-then evaluations, and so forth. The discover data stage 302 may normalize data, such as by eliminating redundant dependencies and other anomalies in the source data. The discover data stage 302 may provide additional functions, such as drill down to exceptions within a data source 102 for further analysis, or enabling direct profiling of mainframe data. A non-limiting example of a commercial embodiment of a discover data stage 302 may be found in IBM's Websphere ProfileStage product.

The data integration system 104 may also include a data preparation stage 304 where the data is prepared, standardized, matched, or otherwise manipulated to produce quality data to be later transformed. The data preparation stage 304 may perform generic data quality functions, such as reconciling inconsistencies or checking for correct matches (including one-to-one matches, one-to-many matches, and deduplication) within data. The data preparation stage 304 may also provide specific data enhancement functions. For example, the data preparation stage 304 may ensure that addresses conform to multinational postal references for improved international communication. The data preparation stage 304 may conform location data to multinational geocoding standards for spatial information management. The data preparation stage may modify or add to addresses to ensure that address information qualifies for U.S. Postal Service mail rate discounts under Government Certified U.S. Address Correction. Similar analysis and data revision may be provided for Canadian and Australian postal systems, which provide discount rates for properly addressed mail. A non-limiting example of a commercial embodiment of a data preparation stage 304 may be found in IBM's Websphere QualityStage product.

The data integration system may also include a data transformation stage 308 to transform, enrich and deliver transformed data. The data transformation stage 308 may perform transitional services such as reorganization and reformatting of data, and perform calculations based on business rules and algorithms of the system user. The data transformation stage 308 may also organize target data into subsets known as datamarts or cubes for more highly tuned processing of data in certain analytical contexts. The data transformation stage 308 may employ bridges, translators, or other interfaces (as discussed generally below) to span various software and hardware architectures of various data sources and data targets used by the data integration system 104. The data transformation stage 308 may include a graphical user interface, a command line interface, or some combination of these, to design data integration jobs across the platform 100. A non-limiting example of a commercial embodiment of a data transformation stage 308 may be found in IBM's Websphere DataStage product. The stages 302, 304, 308 of the data integration system 104 may be executed using a parallel execution system 310 or in a serial or combination manner to optimize the performance of the system 104.

The data integration system 104 may also include a metadata management system 312 for managing metadata associated with data sources 102. In general, the metadata management system 312 may provide for interchange, integration, management, and analysis of metadata across all of the tools in a data integration environment. For example, a metadata management system 312 may provide common, universally accessible views of data in disparate sources, such as IBM's Websphere ODBC MetaBroker, CA ER win, IBM Websphere ProfileStage, IBM Websphere DataStage, IBM Websphere QualityStage, IBM DB2 Cube Views, and Cognos Impromptu. The metadata management system 312 may also provide analysis tools for data lineage and impact analysis for changes to data structures. The metadata management system 312 may further be used to prepare a business data glossary of data definitions, algorithms, and business contexts for data within the data integration system 104, which glossary may be published for use throughout an enterprise. A non-limiting example of a commercial embodiment of a metadata management system 312 may be found in IBM's Websphere MetaStage product. Unless otherwise specified or required by the context, the term "mapping" refers to a design time activity of relating metadata and meta-meta data between views, models, or model instances, while "transformation" refers to the corresponding runtime activity. It should also be noted that the following description relates to a metadata management system where the atomic data items are actually metadata for data sources being modeled. Similarly, metadata within the metadata management system is actually metadata describing this metadata, also known as meta-metadata. It is also possible and appropriate to further abstract meta-metadata into meta-meta-metadata. For avoidance of confusion, the nomenclature below generally follows a data, metadata, meta-metadata hierarchy, where data represents the underlying data for one or more data sources/targets. However, it should be appreciated that occasionally metadata may be referred to simply as data (being the data managed by the metadata management system) and that meta-metadata is sometimes correspondingly referred to simply as metadata, i.e., metadata from the perspective of the models within the metadata management system. More generally, the usage should be clear from the context. However, where the usage is ambiguous, it should be interpreted in the broadest possible sense. Figure 4 shows an architecture for a metadata management system 5202, which may be, for example, any of the metadata management systems or metadata facilities 312 described above. The metadata management system 5202 may include a plurality of external users 5204 such as tools or clients, communicating with a hub 5206 through a plurality of views 5208, a repository 5210 including at least one model 5212 that includes at least one operational class 5214 relating to operational metadata for the model 5212 and/or at least one design class 5216 relating to design metadata for the model 5216. Metadata services 5218 may be provided for interacting with the model 5212 in the repository 5210.

The users 5204 may be any of the interface systems 202 described above, or any other client device, tool or other program of software interface, through which a user may run queries or otherwise investigate data in a database. The users 5204 may run a query using a view 5208 adapted for communication between a data model employed by the user 5204 and a data model employed by the hub 5260. The view 5208 may, for example, include fields, data types, data hierarchy, data relationships, temporal information, source information, or any other information relevant to the manner in which data is displayed or used by the users 5202, as well as any appropriate mappings between the data model in the view 5208 provided to an external user 5204 and the data model employed internally by the hub 5260. While only two views 5208 are illustrated in Fig. 4, it will be appreciated that any number of views 5208 may be used, and that the views 5208 may be the same views 5208, as where there are more than one of the same type of external user 5204, or different views 5208 where there are different external users 5204, or any number and combination of these consistent with the processing capabilities of the metadata management system. It should also be appreciated that an external user 5202 may use data or metadata unique to the user 5202, with no corresponding elements within the hub 5260. For example, Erwin design tools employ object "coordinates" that are unique to Erwin, and describe where an object appears in a graphical "canvas." The hub 5260 may be designed to handle special cases by supporting extensions to the hub model in a manner transparent to the user 5202. Optionally, the view 5208 may also, or instead, provide direct mapping to appropriate external data in addition to a connection to the hub 5260.

The hub 5206 may generally employ a data model 5212 defined by the subject matter of the data or its business context. Thus it is generally expected that the hub model for data would not change frequently within a single application. Where changes are made to the hub model, corresponding updates may be required for one or more views 5208. The hub 5260 may interact with underlying data (e.g., metadata for enterprise data) using one or more models 5212 stored within the repository 5210. Although the use of a hub for design classes 5216 of a repository 5210 is one useful architecture with broad applicability, it should be appreciated that the operational classes 5214 typically would not require such a hub 5260. More generally, the metadata management system 5202 described herein may be designed without any hub 5260. This architecture may be useful, for example, where there is little or no commonality between the design models of various views. In such cases, other techniques may be employed for communication between various views in a metadata management system 5202, such as dynamically generating a non-persistent, logical hub as a central connector. Other principles of the systems described herein may be applicable whether or not a central hub 5260 is employed in a metadata management system 5202.

The models 5212 may be stored and manipulated using object-oriented techniques in a platform such as Eclipse and the Eclipse Modeling Framework ("EMF"). A model 5212 may include metadata and mappings to relevant structures in data sources and/or targets, and any other useful, more abstract modeling of metadata. These aspects of the model may be contained in a repository object that is persistently stored within the repository 5210. The repository 5210 may store one or more models 5212 that contain operational classes 5214 and design classes 5216. The models may include metadata, meta-metadata, or any other useful descriptive or functional characteristics of data. As an example, a model 5212 may contain values for weight in units such as ounces. If a system user wishes to implement a new data source or integrate an existing data source that specifies weight in pounds, this information may be included in the model 5212 so that corresponding metadata for these disparate sources can be consistently treated within the hub 5206, and presented to external users 5204 through one or more views 5208 that may provide different perspectives (or the same perspective) on the data. More generally, a model 5212 may contain any information about underlying data and metadata useful for integration and any other uses contemplated by the metadata management system. Models 5210 can usefully capture information about data and how it changes to enable consistent treatment and extensibility of data usage across an enterprise or among enterprises.

When a model 5212 is created in the repository 5210, it may automatically be partitioned into design and operational components which may be independently managed while being collectively and/or uniformly queried. Using object oriented techniques, operational classes 5214 may be stored for the model 5212 and inherit any appropriate properties, methods, and so on, among classes. The operational classes 5214 may, in particular, contain model operational aspects of external processes, or provide persistent storage of runtime results. The operational classes 5214 may be time-stamped or otherwise labeled for unique reference. It will be appreciated that, while the Eclipse platform is one useful tool for building and maintaining the models described herein, any object-oriented tools or techniques may be similarly employed. In the following description, the term "properties" will be used generally to refer to various characteristics of object-oriented descriptions, or other similar descriptions such as elements of a Universal Markup Language ("UML") class model, including classes, sub-classes, packages, package structure, properties, attributes, methods, relationships, inheritance, and so on. Thus an operational class, package structure, or the like may be an operational property as that term is used herein.

Design classes 5216 may also be instantiated from the model 5212 and inherit any properties, methods, and so on. Information within these design classes 5216 may also include versioning information so that multiple object instances may be maintained either sequentially or in branches, or combinations thereof. The versioned metadata objects may be manipulated, edited, updated, or otherwise controlled and managed by users according to the demands and design goals for the enterprise computing system. Using version control or similar techniques, metadata objects for a design class 5216 may be shared, or checked out to individual users or teams. In general, different versions may be employed as different designs are tried, or when there are changes to underlying data. It will be appreciated that various designs may be reconciled, and branches merged, prior to creation of a runtime executable. It will also be appreciated that, while EMF may be a useful platform for modeling classes within the repository 5210 as described above, any similar modeling framework may be employed, such as Object Management Group, Inc.'s Meta-Object Facility. An enterprise computing system may include a data integration system 104. The enterprise computing system may include any combination of computers, mainframes, portable devices, data sources, and other devices, connected locally through one or more local area networks and/or connected remotely through one or more wide area or public networks using, for example, a virtual private network over the Internet. Devices within the enterprise computing system may be interconnected into a single enterprise to share data, resources, communications, and information technology management. Typically, resources within the enterprise computing system are used by a common entity, such as a business, association, or governmental body, or university. However, in certain business models, resources of the enterprise computing system may be owned (or leased) and used by a number of different entities, such as where application service provider offers on-demand access to remotely executing applications. The enterprise computing system may also include a plurality of tools, which access a common data structure, termed herein a repository information manager ("RIM") (also referred to below as the "hub") through respective translation engines (which, in a bridge-based system, may be bridges). The RIM may include any of the data sources 102 described above. The tools generally comprise, for example, diverse types of database management systems and other applications programs that access shared data stored in the RIM. The tools, RIM, and translation engines may be processed and maintained on a single computer system, or they may be processed and maintained on a number of computer systems which may be interconnected by, for example, a network, which transfers data access requests, translated data access requests, and responses between the different components.

While they are executing, the tools may generate data access requests to initiate a data access operation, that is, a retrieval of data from or storage of data in the RIM. Data may be stored in the RIM in an atomic data model and format that will be described below. Typically, the tools will view the data stored in the RIM in a variety of diverse characteristic data models and formats, as will be described below, and each translation engine, upon receiving a data access request, will translate the data between respective tool's characteristic model and format and the atomic model format of RIM as necessary. For example, during an access operation of the retrieval type, in which data items are to be retrieved from the RIM, the translation engine will identify one or more atomic data items in the RIM that jointly comprise the data item to be retrieved in response to the access request, and will enable the RIM to provide the atomic data items to one of the translation engines. The translation engine, in turn, will aggregate the atomic data items that it receives from the RIM into one or more data items as required by the tool's characteristic model and format, or "view" of the data, and provide the aggregated data items to the tool that issued the access request. During data storage, in which data in the RIM is to be updated, the translation engine may receive the data to be stored in a characteristic model and format for one of the tools. The translation engine may translate the data into the atomic model and format for the RIM, and provide the translated data to the RIM for storage. If the data storage access request enables data to be updated, the RIM may substitute the newly-supplied data from the translation engine for the current data. On the other hand, if the data storage access request represents new data, the RIM may add the data, in the atomic format as provided by the translation engine, to the current data in the RIM. The metadata services 5218 may be used to create, edit, delete, or otherwise manipulate objects, classes

5214, 5216 and models 5212 in the repository 5210, or query and investigate the model 5212 and any other data contained therein. The services 5218 may be presented to a user through a user interface, command line interface, programming interface, or other interface. The services 5218 may provide functions such as versioning, branching, merging, and any other operations supported within repository 5210. Some of these operations are described in greater detail below. The metadata services 5218 may also include, for example, data analysis services such as impact analysis (how a change to one model type instance affects other type instances in the model), operational analysis (history of executable objects through event metadata), data lineage (history of data movement in a warehouse or across the enterprise computing system), version drilldown (investigation of version history for metadata objects), object differencing (investigation of differences between metadata objects), and object merge (combining two objects of the same class according to specified rules). The metadata services 5218 may also include import and export services for transforming metadata, for example, as it is moved into and/or out of the repository 5210. The metadata services 5218 may be realized using, for example, a J2EE platform, and provided to users through a service-oriented architecture such as the SOA. Similarly, transactions within the repository 5210 may be managed using, for example the bean container within a J2EE Application Server. It will be appreciated that the services 5218 may also be provided to an end user as one or more tools in a user interface.

It should be noted that, while the functionality described above (e.g., versioning, branching, drilldown, etc.) is primarily directed at details within and between metadata objects, or distinct instances of metadata, this general approach to metadata management may be readily abstracted to address metamodel management, that is, meta-metadata management, or managing models of metadata models. Thus, there is described herein a metamodeling tool or tools that provide for defining mappings between metamodels, generate interfaces for metamodels, and facilitate implementation and transformation of metadata models. The metamodeling tools may be provided through a graphical user interface providing access to a number of related functions. For example, the interface may provide tools to define, validate, test and analyze metamodels and mappings, as well as metadata model output. The interface may also provide tools for documentation of metamodels, metamodel mappings, and any instances of metadata models generated through the metamodeling tools. Metamodeling tools could be usefully employed, for example, to deploy new versions of an enterprise model. Diagramming, modeling and mapping may be supported by a service such as IBM Rational XDE.

The metamodeling tools may be deployed, for example, as services in a service-oriented. The metamodeling tools may provide centrally managed mapping specifications for metadata models, with synchronization, versioning, history tracking and other appropriate capabilities consistent with the metadata tools discussed above. Thus while a mapping model may represent object transformations between a hub and a view (or other models), the mapping model from this metamodeling perspective may also, or instead, represent mapping between different metadata models that may ultimately be employed in transformations between the models themselves, such as when upgrading to a newer version of a metadata model. The metamodeling tools may, for example provide an independent specification language separate from, and loosely coupled to, the model definition, to allow for development control and implementation flexibility. The metamodeling tools may advantageously provide for dynamic browsing of mapping specifications within a development environment, and may provide tools to automatically generate documentation at various levels of detail. With an integrated suite of metamodeling tools, a corresponding test framework may be developed to generate test metadata and dynamically execute mappings so that immediate results can be obtained and incorporated into active development.

In order to maintain a conceptual separation between operational and design attributes of a model, the repository 5210 may be logically and/or physically separated into two or more repositories, such as a common repository (not shown) for persistent storage of design classes 5216 and properties and an operational repository (not shown) for persistent storage of operational classes 5214 and properties. Thus, when a model 5212 is registered with the repository 5210, the operational classes 5214 may be persisted in the operational repository and the design classes 5216 may be persisted in the common repository. Operational and design classes may be distinguished within one physical or logical repository using annotations within classes to define their association. It will be appreciated that other techniques are known, and may be used to separate classes of a model into operational and design aspects, or to provide logically or physically separate operational and design repositories. For example, the common/operational separation may be implicitly designed into the class structure for the model, or a manifest or other list or programming device may accompany the model and list the association of each class or property with its appropriate repository. However implemented, this arrangement may advantageously permit different handling of persistence for design and operational elements of a model 5212. For example, design classes 5216 may be developed and revised by teams of programmers, thus requiring robust versioning capabilities and reconciliation. By contrast, operational classes 5214 may require unique identification for different jobs, such as through use of time stamps or other unique identifiers. Thus appropriate services may be defined for each group of classes, while maintaining a single model that can be queried, transformed, or otherwise manipulated by a user.

Figure 5 shows communication with a database (of metadata) through one or more views or models. A service 5302, a user interface 5303, or any other interface may communicate with a database 5312, which may be any of the data sources 102 described above, such as to submit a query to the database 5312. The communication may be conducted through metadata models, such as a view 5308 and a hub 5310, provided by a repository 5304, such as the repository 5210 described above. These metadata models may include any information about data, such as fields, field names, field attributes, data types, data hierarchy, data relationships, temporal information, source information, or any other information relevant to the structure, location, or use of data, or metadata about such data (i.e., meta-metadata). The service 5302 may generate a query using a view of data native to that service 5302, i.e., having a structure and format defined by the service 5302. This query may be structured by the service 5302 without any information about the structure of data in the database 5312. The view 5308 provided by the repository 5304 to the requesting service 5302 may be mapped to a hub 5310 that provides a model for consistent representation of metadata to a plurality of different views, including the view 5308 receiving the query. The hub 5310 may in turn be mapped to a structure used internally by the database 5312. By exploiting mapping information between the view 5308, the hub 5310, and the database 5312, the query may be advantageously translated into a query using a data model or syntax native to the database 5312. This may result in significant performance advantages because the query can benefit from any optimization or tuning for the database 5312. As a further advantage, mapping information may be queried independently to explore possible optimizations for a particular query 5302.

By contrast, other existing techniques "flatten" metadata models when creating an executable, so that the query must be run against the entire database with results parsed using the view 5308 presented to the service 5302. In effect, all of the potentially relevant objects from the database 5312 must be instantiated in the hub 5310 and transformed into the view 5308 where they can be manipulated in memory to perform a query. This places significant burdens on memory, and loses any performance advantages designed into the database 5312. By translating the query itself into the native syntax of the database 5312 using mapping information for the intervening models, only the query results need to be instantiated and transformed for presentation to the external service 5302.

Similarly, a user interface 5303 may communicate with the database 5312 through a number of models provided by the repository 5304. A user may create a query in the user interface 5303 using fields with a structure and format corresponding to the presentation of data in the user interface 5303. The query may be received by the view 5308 and translated into a query for the hub 5310 using any available mapping information, and in turn translated into a query for the database 5312 using any available mapping information to present the entire query in a syntax native to the database 5312. It will be appreciated that, while a single view 5308 is shown for both the user interface 5303 and the service 5302, each may have its own external model by which it views data, and these models may be maintained and provided by the repository 5304. The query may be run against the database 5312 to produce results that may be returned through the hub 5310 and the view 5308 in a form that is readily useable by the user interface 5303. More generally, while a two-tiered structure is depicted in Fig. 5, consistent with a hub-and-spoke architecture of data integration systems, any number of metadata models in any relative relationships to one another may benefit from the techniques described herein for accessing a database, provided mapping information is available concerning relationships among metadata in the various models.

Fig. 6 shows repository services 5304 including a translation engine that provides metadata translation services between the view 5308 and the hub 5310. The translation engine may provide translations of queries, such as those described above, between various native metadata structures used by the different models and the database 5312, as well as transformation of objects between models. The translation engine, or a plurality of translation engines may be provided as a service within a repository 5304, as generally depicted in Fig. 6, where the translation engines may be registered and/or stored. The repository services 5304 may access a translation engine for translation of the query into a format for the hub 5310. Although not shown, a similar translation may be provided between the hub 5310 and the database 5312, More generally, a translation engine may receive queries in a number of query languages or programming languages from external models, and use mapping information available for the respective models and the database 5312 to translate the queries into queries in a structure optimized for the database 5312. Thus queries may generally be expressed in terms native to a view 5308 (or other model), and presented to the database 5312 in terms native to the database 5312.

Although the translation engine is one conceptual approach to translating queries as described above, it will be appreciated that other approaches may be devised and usefully employed with the systems described herein. In general these approaches will benefit from separately storing mapping information between metadata models used by the system, as well as any mappings to the database 5312. By retaining the mapping information in a form that is accessible to the translation engine, or some other tool or service, at runtime, the metadata management system may achieve significant performance achievements. Fig 7 shows a repository service providing a translation engine for a plurality of external services 5302.

The services 5202 may be, for example, a data transformation stage 308, data preparation stage 304, RTI service 2704, a user interface, or any other service or external client that might perform a query on metadata in the database 5312. The services 5302 may present a query to the view 5308 in a syntax native to the view 5308. The translation engine may translate the query into a syntax native to the hub 5310, which may in turn be translated into a query using a syntax native to the database 5312. The query results may be returned to the services 5202 by accessing the translation engine to translate the query results back into a syntax native to the services 5202. In this way services 5202 may efficiently communicate with the database 5312 using their own native syntax. It should be appreciated that the term "syntax" as used herein to describe queries, refers to any syntax, structure, format, programming language, and/or interface that might be employed to represent queries either externally, such as to services or a database, or internally, such as among metadata models.

Figures 8-10 depict how a metadata model may be mapped to a schema in a relational database for persistent storage. Generally, a metadata model may be described using object-oriented relationship management tools. When such a metadata model is registered to a repository, such as the common repository and the operational repository, the in-memory model may be mapped to a schema in a relational database using a variety of techniques discussed below. This strategy is particularly amenable to management using tools such as the Apache

Object/Relational Bridge ("OJB") to persist Java tools against the relational database. As a significant advantage, this approach allows substantial design flexibility while exploiting the high performance of commercially available relational databases. A number of specific mappings that may be usefully employed to store metadata models are described with reference to the following Figs. 8-10. Figure 8 depicts the correspondence between a metadata model and a relational database. The metadata model 5602 may include a plurality of object-oriented classes 5604 defining various properties of the model 5602, such as information about metadata including fields, field names, field attributes, data types, data hierarchy, data relationships, temporal information, source information, or any other information relevant to the structure, location, or use of data. The database 5608 may include a plurality of tables 5610 representing a relational schema used to physically store the model 5602. The mapping between the model 5602 and the database 5608, as depicted generally by a vertical arrow in the figure, may be through a one-to-one mapping of classes 5604 in the model 5602 to tables 5610 in the database 5608. In this manner, every aspect of the classes 5602 has a corresponding aspect in one of the tables 5608 so that the model 5602 structure is literally reproduced in the database 5608. Thus a conceptually linear translation between the model 5602 and the database 5608 can be maintained. Such a representation may generally provide higher performance, and may be directly compiled at run time, or readily pre- compiled, however, changes to the model 5602 may require reconstruction of the entire database 5608 and corresponding changes to compiled versions

Figure 9 shows an alternative mapping of a metadata model to a relational database The metadata model 5702 may be, for example, the metadata model 5602 described above The mapping between the model 5702 and the database 5704, as depicted generally by a vertical arrow m the figure, may be from properties of the classes within the model 5702 into entries m tables 5706 withm the database 5704 The tables 5706 may be organized to optimize certain uses, such as by organizing version data or runtime artifacts in separate physical tables, regardless of the object-oriented structure employed by the model 5702 This approach advantageously permits an arbitrary model to be fully characteπzed withm a geneπc table structure This approach may enhance extensibility because any change to the model 5702 will only require updates to any affected entries m the database 5702, such as one or two row updates, without otherwise affecting the descπption stored in the tables 5706 In general, this represents a design trade-off between relatively high performance of the database 5704 used for persistence and the relative extensibility of mappings between the model 5702 and its persistent form

Figure 10 shows a combination of the model mappings described in Figs 8 and 9 above The metadata model 5802 may be, for example, the metadata model 5602 descπbed above The mapping between the model 5802 and the database 5808, as depicted generally by a vertical arrow in the figure, may be partially from classes 5804 withm the model 5802 directly to tables 5810 withm the database 5808 having a corresponding structure, as descπbed above with reference to Fig 8 The model 5802 may be modified by a user, such as by adding a property 5806 to the class 5804 A corresponding change may be made to the model stored m the database 5808, such as by recording descriptive entries 5812 in the geneπc table 5814 as descπbed above with reference to Fig 9 Thus the static portions of a model may be mapped to a more performant, fixed schema, while the non-static or user configurable portions of the model may be mapped to an extensible, descπptive schema In this manner, the relational schema for stoπng the model 5802 may be a hybnd that advantageously combines performance for relatively fixed portions of a model with extensibility for user-configurable portions of a model Each registered model may be persistent When registeπng a first model such as a view, the model may be passed to a registration process along with a second model such as a hub and a mapping of the first model to the second model Where properties of the first model can be mapped to the second model, no additional persistence mechanism is required beyond the mapping itself However, where a property of the first model cannot be mapped to the second model, a mechanism may be provided for persisting the unmapped property It will be appreciated that any particular model may have no mapping, a partial mapping, or a complete mapping to another model In those instances where properties require persistence, i e , they are not mapped to an existing model, any of the techniques for an extensible model described above with reference to Figs 8-10 may be employed to provide a storage mechanism for model persistence In particular, the most genenc table form may provide a desirable persistence mechanism through a number of design cycles, while a runtime model may be advantageously deployed by replicating the class structure for the unmapped portions of the model

It will be further appreciated that the geneπc structure descπbed above may provide a reflective storage mechanism for extensible models The storage mechanism may "understand" its environment, and may look to the model descπption to determine related classes, attributes, mappings, and the like for any object These reflective capabilities may be used to provide a higher-level design environment where a schema such as the geneπc table format described above can persist model properties in a manner that accommodates extension Figure 11 depicts an architecture that exposes a plurality of internal services to external metadata. In certain instances, metadata may reside outside the metadata models managed by the metadata management system described herein, such as where data is shared between separate enterprises or enterprise applications. An architecture for accessing such external metadata may include external metadata 5902 with a first view 5904, a hub 5906, and a second view 5908 to a plurality of internal services 5910.

The metadata management system may provide a first view 5904 of the external metadata 5902, which may in turn be connected to the hub 5906 to provide a common internal model for the external metadata 5902. The internal services 5910 may be similarly mapped to the hub 5906 through their own view of metadata, the second view 5908. Through these interconnected models 5904, 5906, 5908, the internal services 5910 may access the external metadata 5902 in a form native to the internal services 5910. The internal services 5910 may, in turn, be deployed in a services-oriented architecture to provide access to the external metadata 5902 as a service within the metadata management system, or more generally throughout the enterprise.

Figure 12 depicts a mapped-model driven transformation of metadata using an interpreted mapping to translate between metadata models such as a view and a hub. A metadata management system 6000 may include a hub 6002, one or more translation engines 6004, and one or more views 6006, 6008. The translation engines 6004 may include mapping models 6010 characterizing one or more mappings between the hub 6002 and the views 6006, 6008. These models may be interpreted when a request is received to determine, using the mapping model 6010, how an instance of an object should be expressed to the requester. The mapping model 6010 may be expressed in a number of forms, including as a model (e.g., a data structure, such as Java classes or EMF objects or instances), which may provide greater design flexibility, or as compiled code, which may provide greater execution efficiency to the translation engines 6004, or as interpreted code. More generally, a single model-to-model mapping, or mapping model 6010, may be instantiated in any number of different translation engines 6004. At the same time, different translation engines 6004 may instantiate any number of different mapping models 6010 in any number of forms ranging from abstract model to compiled code. Mapping models 6010 may be registered in a translation registry (not shown) for the translation engines 6004 to provide common access and consistency.

In existing systems, a view-to-hub mapping is typically generated as a static mapping that does not change once it is deployed. By treating the view 6006, 6008 and the mapping 6010 from the view to the hub as models, the mapping may be interpreted directly when instances of metadata are moved from a view to a hub, or vice versa. The view may be represented internally as, for example, Java classes, Java code, or some interpretation of the underlying model. Similarly, the mapping can be interpreted in various forms, such as Java code, Jython (Java- based scripting), and the like. When a request is received, the request may be parameterized by the view model, the mapping model, and the hub model. The model-driven translation engine can receive an object expressed in one of the models and return objects expressed in another one of the models.

For example, the hub may be an object-oriented construct accessed using interpreted Java code. Similarly, the views 6006, 6008 may be interpreted with Java or some other interpreted programming language. The translation engine 6004 may use the metadata model mapping between the hub 6002 and the views 6006, 6008 to move requests and object instances between the hub and the views 6006, 6008. The translation engine 6004 may be dynamically modified by a user in a manual operation, or automatically (or manually) in response to a change in one or more of the metadata models or objects. It should be appreciated that, whether interpreted, compiled or otherwise executed the software or software engine that interprets/executes the model may be synchronous or asynchronous. In an asynchronous environment, access to the model is through a messaging service or other asynchronous technique In a synchronous environment, calls may be made directly to the engine through an application programming interface or other synchronous interface to the engine

Figure 13 shows interaction with a metadata environment A model 6102 may be represented as unversioned classes 6104 (stored in an operational repository) and versioned classes 6106 (stored m a common repository) A user metadata environment 6108 may be provided for users 6110 to interact with the model An "environment" as used in the following description, is intended to refer to underlying model data and other contextual information for a model or metadata, which one or more users 6110 may view and manipulate through any suitable graphical user interface, command line interface, or other programmatic interface for viewing, querying, and manipulating models and model data, including stored instances of models and metadata, whether in volatile or nonvolatile memory or both, and including operational properties and design properties thereof, along with any versions ot any of the above While the general term "environment" (or "user environment") is intended to refer geneπcally to any model context through which one or moie users might interact with metadata, several environments are specifically contemplated, as descπbe below The examples that follow do not limit the number and variety of user environments that might be usefully employed with the systems described herein

The model 6102 may be, for example, any of the views or hubs described above, or any other metadata model The model may include operational classes and attributes as well as design attributes and classes As noted above, a model 6102 may be stored in two different repositoπes according to the purpose of various model classes Thus an operational repository may be configured for stonng metadata results for jobs executed using a model, while a common repository may be configured to support collaboration and iterative design processes It should be appreciated that the operational and common repositories may be physically and/or logically separate, and that each is defined in part by the subset of model classes that are stored therein, and in part by the services or methods used to access each

The users 6110 may interact with the metadata environment 6108 in a number of different modes, such as a workspace or a team space The workspace, also referred to as a sandbox, may provide live editing to models in an unversioned environment where, for example, metadata changes to design properties are either saved as a new model or overwritten to an existing model The workspace may exist locally on a user's computer, or remotely on a server where the user may interact with metadata Typically, placing a model in a workspace would lock the model for other potential users However, the workspace may provide shared use, such that more than one user may edit and save changes to the workspace The team space may provide versionmg, such that multiple versions may be checked out, checked in, branched, and so on

More generally, the team space may provide a metadata environment for all of the metadata versionmg capabilities discussed above For example, a versioned metadata environment may support versionmg of metadata that is created or edited by individual users Thus a user of the versioned metadata environment may check out a model, and check the model back m as a new version Thus while the workspace may permit collaborative editing, the team space may enable collaborative and/or sequential editing to metadata with version control

A user interface may also provide access to an event space, which is the metadata environment 6108 associated with operational properties and/or the operation repository described above

The user environment 6108 may also be, or include, a federated user environment that provides a centralized, global environment for a number of repositoπes across an enterprise The federated user environment may provide a common view of different repositoπes, or may represent each repository separately The users 6110 may be, for example, human users interacting with the metadata environment 6108 through a graphical or command line interface, or a program or service accessing the metadata models in the repository, such as the discover data stage 302, the data preparation stage 304, or the data transformation stage 308 described above. Figure 14 depicts a common repository 6202 storing a plurality of versions of metadata 6204. The metadata 6204 may be, for example, metadata for the views and hubs discussed above. The metadata database 6206 may be any of the data sources 102 described above. Each version of the metadata 6204 may provide a different, but related, version of metadata stored in a metadata database 6206. The versions of metadata 6204 may be created, for example, by a team of developers working on a data integration project, and compared using, among other things, the instances stored in the database 6206.

Figure 15 depicts a common repository 6302 containing a plurality of object versions 6304 characterizing metadata stored in a metadata database 6306, all as generally described above. A client 6308 may interact with an object version 6304 either directly or in one of the user environments described above, and may perform any of the design operations described generally above. This may include, for example, dynamic comparison of metadata models, drilldown, editing, testing, or any other appropriate functions. The client may also use the common repository 6302 and object versions 6304 to investigate underlying metadata in the metadata database 6306.

Figure 16A depicts a reconciliation of versioned metadata objects. The common repository 6402 and the versioned objects 6404 may be the common repository 6302 and versioned objects 6304 described above. Reconciliation of the versions may be desired at various points in a design cycle, and is typically required for release of an executable model. A reconciliation of the versioned objects 6404 into a single instance 6408 may be controlled through a reconciliation process 6406. A number of techniques are known, and may be used for automated, semi-automated, and manual reconciliation. In general, any such techniques may be employed with the systems described herein. The reconciliation process 6406 may advantageously retain a full version history and reconciliation lineage for the reconciled single instance 6408 to permit modifications to the reconciliation process 6406, to return to any previous unreconciled state, or to investigate source metadata and the lineage of reconciliation. Where direct conflicts in metadata are resolved during reconciliation, such as in a merge, previous attribute values may be recalled for use with alternate reconciliation of branches and various versions.

Fig. 16B depicts phased reconciliation across reconciliation zones. In order to manage complex reconciliation processes and maintain accurate reconciliation lineage for instances of metadata, reconciliation zones may be provided. Before discussing the reconciliation zones of Fig. 16B, some useful properties of a metadata instance are noted.

Each metadata instance in an enterprise may have an associated reconciliation zone property that defines an association of the instance with a reconciliation zone. The reconciliation zone may be selected by a designer of a reconciliation process to reflect, for example, an institutional separation of data such as human resources, accounting, finance, inventory, manufacturing, payroll, engineering, and so on. The reconciliation zone may be geographic at any degree of granularity suitable to the data and the enterprise, such as country, region, state, town, building, facility, and so on. The reconciliation zone may be historical or architectural so as to separate, for example, legacy systems from new systems, employee desktops from mainframes, consultants from employees, and so on. The reconciliation zone may reflect organization of a business into divisions or other sub-groups, such as consumer products, original equipment manufacturing, products, retail operations, e-commerce operations, and so on, or more generally, manufacturing and retail. Similarly, reconciliation zones may be provided for new business units acquired by a company or spun off from a company.

For each reconciliation zone, any number of reconciliation rules may be defined concerning precedence, exceptions, combinations, and so on. Techniques for reconciliation are well known, and all such techniques may be usefully employed for reconciling metadata instances in a reconciliation zone according to reconciliation rules. The reconciliation zone may further define a match type that defines how reconciliation results are propagated in models referencing the instance, such as no match (duplicates are deleted), view match (versions are retained at the view level), and/or extra-view match (versions are retained at the hub level).

Each instance of an object may also have an identifier that uniquely identifies the object within a reconciliation zone. Each item can be described in terms of various contexts or hierarchies, such as to capture the semantic context of the items. The item may be an object, class, attribute, data item, data model, metadata model, model, definition, identity, structure, language, mapping, relationship, instance or other item or concept, including another semantic identifier. The semantic identifier may identify the item based on the item's attributes, the item's physical location, the relationship of the item with one or more other items, such as in a hierarchy, or the like. In some cases a relationship may be defined as the absence of some particular relationship. A relationship may be based on semantics. A relationship may involve the position of the item in a relational hierarchy. For example, an item may be identified based on its relationship with the other items to which it is related, and may be directly related to another item, indirectly related to another item, and/or indirectly related to another item through one or more other items. Relationships may be concatenated or recursively defined to permit dynamic, in addition to static, identifiers. For example, if a relationship between two items changes, a semantic identifier for another item that incorporates one of the two items would also incorporate the changed relationship between the two items.

As a more concrete example, an item Jim may be identified as Jim, residing at 111 Anyroad, Anytown, Anystate USA, with phone number 555-555-5555 and social security number 012-34-5678. Alternatively, Jim may be identified in terms of his relationships with others. Jim may be identified as the son of Betty, brother of Larry and Jeff, father of Jessica and nephew of Frank.

The semantic identifier may be a unique identifier for an item. In the example above, if there were only one Jim in the world who was the son of Betty, brother of Larry and Jeff, father of Jessica and nephew of Frank, this semantic identifier would be a unique identifier for Jim. It is possible that a unique semantic identifier to an item takes into account fewer than all of the relationships of that item with other items. If there were only one Jim in the world who was the son of Betty, brother of Larry and father of Jessica, the existence of these relationships alone would be enough to create a unique semantic identifier. Jim's relationships with Jeff and Frank would not need to be considered. It may be advantageous to create a semantic identifier that is based on the minimum number of relationships that ensure uniqueness. For example, if the semantic identifier was to be stored in a database 112 or processed by a data integration system 104, a less complex semantic identifier would require less space and would allow for faster processing.

The number of relationships required to create a unique semantic identifier for an item may vary based on context. For example, a first item, item 1, may be distinguished from a second item, item 2, within a context, context A, by item 1 's relationship with two additional items, item 3 and item 4. That is, in context A, the unique semantic identifier for item 1 may be that it is directly related to items 3 and 4, and indirectly related to any number of other items through items 3 and 4. In a different context, context B, item 1 may be uniquely identified by its relationship to item 3 (but perhaps not item 4), as well as its relationship to another item, item 5 and the absence of a relationship with item 6. Thus, in embodiments of the data integration methods and systems described herein, a semantic identifier for an item, such as an item related to a data integration job or a data integration platform, may be provided with a context-dependent identifier for the item. In embodiments such a context-dependent identifier may be stored in an atomic format, such as in a data repository. Contexts A and B may be two different imports, mappings, run versions, models, metabroker models, instances, tools, views, objects, classes, items, relationships, attributes, or any combination of any of the foregoing. A reconciliation or comparison facility may compare the value and/or syntax of the identity of an item in different imports, run versions, models, metabroker models, instances, tools and/or items and determine or assist with the determination of what action to take or refrain from taking based on the comparison. For example, a reconciliation engine may compare the model used by import instance A to the model used by metabroker B. Based on this comparison it may be decided that metabroker B can access the data and metadata of import instance A without transformation or modification, and the comparison facility may direct the metabroker B to proceed. In another example, a tool A may be compared to a tool B, and it may be determined to perform a cross-tool object merge, wherein each tool can access the objects of the other tool. In embodiments the reconciliation facility may trigger a translation facility to assist the cross-tool object merge, such as establishing a bridge, metabroker, hub or the like for translating any objects that require translation, such as translation that is based on the different syntax for the handling of the identity of particular items in each respective tool, or based on other differences between the tools as determined by the comparison.

In embodiments a semantic identifier may be stored, maintained, recorded, processed and/or interpreted in a syntax that may be stored, maintained, recorded, processed and/or interpreted in a string structure or format. For example, the syntax may be column name::table name::database name. This syntax may be related, for example, to a semantic identifier that identifies a column of a table in a database. A string composed in this syntax may be age::employee::employee database. This string may be related, for example, to a semantic identifier that identifies the age of an employee in a particular employee database. The string corresponding to the semantic identifier for item 1 in context A (the example above) may be: direct relation to item 3 : :direct relation to item 4. The semantic identifier and corresponding string may also incorporate the lack of a direct relationship between item 1 and item 5, such as occurs in context B above.

A syntax string may be parsed. A syntax and/or string may be truncated, modified and/or the elements of a syntax and/or string may be re-ordered. A translation engine may perform the truncation, modification and/or re- ordering. It may be useful to truncate a syntax and/or string when all of the relationships included in the syntax and/or string are not required for the uniqueness of the semantic identifier. Suppose that in a given context for a syntax string all items were directly related to item 3; for example, item 3 was a database in which all the items were stored. The syntax string could be truncated, such as to create a string omitting a relationship involving item 3, and still remain a unique semantic identifier. Truncating a syntax and/or string may reduce storage requirements and increase processing efficiency. It may also be useful to change the order of the relationships in a syntax and/or string, for example, to reduce processing time for data integration processes. If the less common relationships are processed first, a system will likely need to access and process fewer relationships associated with an item in order to identify the item. For example, if very few items were related to item 3, even fewer related to item 4 and many items related to item 2, depending on the context, the one syntax string may allow for the identification of item 9 in a shorter time than another syntax string. It could be that only certain elements of a syntax string are needed to uniquely identify an item in one context, while all elements of a syntax string are required in another context. A reconciliation engine may perform reconciliations on instances of metadata using the identity of metadata instances, as well as a reconciliation zone that defines rules for reconciliation and any match type specifications. The reconciliation operation may employ semantic identifiers to uniquely identify instances within a reconciliation zone, and may translate or otherwise modify the format, language and/or data model of a semantic identifier for a reconciled instance in another reconciliation zone. A reconciliation operation may involve a reconciliation or mapping to or from one or more data tools, languages, formats and/or data models to or from at least one other data tool, language, format and/or data model. For example, a reconciliation operation may involve a reconciliation or mapping to, from or between known data integration tools, such as WebSphee DataStage 7 from IBM, WebSphere QualityStage from IBM, Business Objects tools, IBM - DB2 Cube Views, UML 1.1, UML 1.3, ERStudio, IBM's WebSphere ProfileStage, PowerDesigner (with added support for Packages and Extended

Attributes) and/or MicroStrategy tools. A reconciliation engine and/or reconciliation operation may optionally be embodied in a metabroker. A reconciliation operation may be performed, executed and/or conducted in batch, real¬ time and/or on a continuous basis. A reconciliation operation may be provided or made available as a service, for example, as part of a service-oriented architecture. Once a reconciliation operation exists for a semantic identifier, database 112, database 112 including one or more semantic identifiers, system of information, system of information including one or more semantic identifiers or other item it can be reconciled to or from, mapped to, linked to, used with or associated with any other semantic identifier, database 112, database 112 including one or more semantic identifiers, system of information, system of information including one or more semantic identifiers or other item sharing at least one reconciliation operation. In embodiments, such as using an atomic data repository as a hub for a translation operation, the mapping of a reconciliation operation can, among other things, trace reconciliation from the execution of the operation backward and forward between an original semantic context and a translated semantic context. Depending on the context, the appropriate identifier for the data item may vary, such as by varying or truncating a syntax and/or string to enable more efficient storage or faster processing, or by varying the relationships used to form a unique identifier where the semantic context varies. Thus, a dynamic identifier may combine the benefits of retraceable reconciliation with the benefits of rapid processing, efficient data processing and effective operation in various contexts in which a data item is used.

Figure 16B depicts reconciliation zones. In general, metadata object or item is uniquely identified within its own data constellation, however, a reconciliation process must also manage identity through a reconciliation process that may combine different instances of an object from different sources. A number of reconciliation zones 6450-6458 may be defined for metadata from a number of sources. For example, the reconciliation zones 6450- 6454 on the left side of Fig. 16B may be source data from various elements of an enterprise, such as departments within a corporation or discrete databases. Using the techniques described above, reconciliation zones, rules, match types, and identifiers may be defined for each metadata instance in each of these source reconciliation zones 6450- 6454. According to the reconciliation rules, a reconciliation engine may reconcile data from two reconciliation zones (e.g., zones 6450 and 6452) into a new reconciliation zone 6456 in which each item is uniquely identified, and represents a reconciled version of metadata instances from the source reconciliation zones. This new reconciliation zone 6456 may in turn be reconciled with one or more other reconciliation zones (e.g., zone 6454) to provide another reconciliation zone 6458 representing a full reconciliation of metadata instances within the enterprise. At the same time, any reconciliation zone may have one or more reconciliation zones between itself and a source of data to more finely reconcile metadata from one or more sources before introducing it to a particular reconciliation zone. Thus, the pattern of Fig. 16B may be repeated, altered, and/or expanded in any manner to achieve any arbitrary pattern or flow for reconciliation of metadata instances.

As a concrete example, the first reconciliation zone 6450 may represent metadata for human resources that may include starting salaries for all new hires. The second reconciliation zone 6452 may represent payroll data that includes weekly pay information for all employees. These reconciliation zones may be reconciled into a new reconciliation zone 6456 by a user such as someone in a company accounting department to track salary information. The metadata within this reconciliation zone 6456 may be analyzed for accuracy and consistency, and may be modified until a satisfactory reconciliation is obtained. Another reconciliation zone 6454 may represent metadata for a corporate financial database. The financial database may include full financial data for the corporation, including metadata for salary costs of the corporation. This data may be characterized as having high quality, and may be audited or otherwise used in other areas of the corporation. The reconciliation rules may be designed with deference to any information about data quality, such as where one data source represents a compilation prepared by an outside contractor known to have low quality assurance standards, while another data source represents data entry from well-trained and supervised employees within the company. The metadata from this reconciliation zone 6454 may be reconciled with salary metadata from another reconciliation zone 6454 in another reconciliation zone 6458 that contains metadata representing a fully integrated view of employee salaries within the corporation. To further expand the example, all of the reconciliation zones 6450-6458 of Fig. 16B may be specific to a corporate division, and may be further integrated with integrated reconciliation zones from other corporate divisions, or from corporate acquisitions. Similarly, data from different corporate departments, geographic locations, subsidiaries, functional business units, and so on, may be progressively integrated using the phased reconciliation described above.

It will be appreciated that a number of significant advantages flow from the phased reconciliation process described above. One advantage is the retention of reconciliation lineage. In a complex data integration environment, there are likely to be multiple sources of metadata including hierarchical files, flat files, federated data sources, and so on. In this environment, it may be important both to track the process of integration and reconciliation, and to maintain the ability to reverse reconciliation steps along the way. The reconciliation lineage provided by the techniques described above permits full auditing, inspection, and modification of reconciliation lineage and provides, for each metadata instance in the fully reconciled model, a defined path to an original data source. As another advantage, phased reconciliation provides visibility into sources of data in an integrated model.

For example, the fully integrated reconciliation zone 6458 of Fig. 16B may be used by analysts or managers as a metadata model for business analytical tools. Prior to forming a business decision based upon the analytical tools, it may be helpful, or even essential, to examine the sources of data and quality thereof. As another example, a business decision may require a particular view of data. The street name of an address may be critical to an in- person marketing campaign, while the zip code may be important for a mailing campaign. Different data sources may carry the relevant information at different levels of detail, and with different levels of accuracy. The reconciliation process may be inspected, and modified as appropriate, to express the best view of the desired metadata in an analytical tool for designing a marketing campaign. Continuing with this example, one data source may define addresses with very fine detail and good accuracy, but be updated only infrequently, for example, bi- annually, or intermittently as information is received. Another data source may contain very up-to-date information (such as phone listings) that includes street addresses but no zip codes. By examining reconciliation lineage for an integrated enterprise data model, a manager may realize that the only instances of zip codes are likely out of data, and redesign the reconciliation process (or optionally, the integrated metadata model itself) to synthesize up-to-date zip codes from street addresses.

As still another advantage, the phased reconciliation provides an ability to propagate reconciliations and modifications upstream from integrated views toward data sources. This may ultimately improve the data structure and quality of metadata and data from original data sources within an enterprise.

The general approach above may have particular utility in highly heterogeneous data environments. For example, in a complex corporate environment, a number of discrete groups such as manufacturing, accounting, human resources, and engineering, may each maintain a separate data silo with a broad array of databases specific to that group. In this environment, data integration may be usefully employed to integrate separate database in a manner that permits improved business intelligence. Integration may be vertical within a group, such as by integrating databases into a comprehensive metadata model for the group, or the integration may be horizontal across groups, such as by integrating payroll from each group into a comprehensive payroll metadata model. Full, corporate-wide data integration may include alternating steps of integrating within a group and integrating across groups.

Figure 17 depicts reconciliation of versioned metadata objects. The common repository 6502, versioned objects 6504, reconciliation process 6506, and reconciled single object instance 6508 may all be as described with reference to the figures above. In addition, each object version 6504 and the single instance 6508 refer to metadata stored in a metadata database 6510. It will be appreciated that the metadata in the metadata database 6510 may change, due to either changes independent of the models (e.g., where a company wishes to track a new, additional characteristic of inventory, or under the influence of some data integration job), or changes to the metadata (e.g., a five day moving average of some number is added to the model for business analytic purposes). Thus it may be desirable to partition the metadata database 6510 along with versions of the metadata objects, or otherwise retain a history of changes to the metadata. With this additional information, a user has complete flexibility in moving backward and forward through revisions to metadata.

Figure 18 shows an example of the use of concurrency in a metadata process. In this example, a plurality of metadata instances 6602 are reconciled in a reconciliation process 6604. During reconciliation, a significant amount of metadata may need to be merged or overwritten to create a reconciled version of the metadata model. The process may be improved by structuring the reconciliation process 6604 as independent process objects that may be streamed to individual processors 6606 for independent or pipelined execution. The independent process objects may be streamed to a single hardware device 6608 that contains the plurality of processors 6606, or may be streamed to different hardware devices 6610, 6608, or may be streamed to any other processors or groups of processors available through a network.

Concurrency and the related concept of parallel processing are well-known in the art, and need not be described in detail here. In general, concurrency and parallelism are appropriate where a process can be broken into "chunks" of primarily self-referential clusters of objects, also known as sub-graphs (referring to a directed graph of dependencies for objects), that can be processed independently or in a pipeline. A reconciliation process may be readily modeled as a pipeline for concurrent execution. For example, the process may include a task for assigning an identity to a stream of objects from a new metadata source, a task for fetching potential conflict candidates from a previous metadata source, a task for reconciling, a task for merging the reconciliation results into an output set of metadata objects, and a task for storing the merged metadata objects. Other metadata processes may also be suitable for concurrency, such as a metadata import.

The following figures describe several methods associated with metadata management. It will be appreciated that these processes may be realized in hardware, software, or some combination of these. The processes may be realized in one or more microprocessors, microcontrollers, embedded microcontrollers, programmable digital signal processors or other programmable devices, along with internal and/or external memory for storing program instructions, program data, and program output or other intermediate or final results. It will further be appreciated that these processes may be realized as computer executable code created using a structured programming language such as C, an object oriented programming language such as C++ or Java, or any other high-level or low-level programming language that may be compiled or interpreted to run on any uniform or heterogeneous group of hardware and software platforms including a computer or computers, networks of computers, and combinations thereof. The processes may also employ a wide variety of tools, platforms and architectures to achieve a scalable enterprise metadata management system. While specific examples of software platforms are provided above, other platforms and technologies exist and may be usefully employed with the systems described herein.

Figure 19 is a diagram of entities involved in a query process from a user interface 6702 to a metadata database 6712. The query may begin at a user interface 6702, where a user prepares a query in a native syntax of the user interface. The query may be passed to a metadata model 6704, such as a view. The query may in turn be translated by a translation engine 6708 or application of mapping information describing the mapping between the first metadata model 6704, such as a view, and a second metadata model 6710, such as a hub. The hub 6710 may pass the translated query to the database 6712 using an additional translation or mapping step to convert the hub- based query into a query in a native syntax of the database 6712. The results may be passed through the various entities and any appropriate translations to the user interface 6702 that originally issued the query.

Figure 20 shows the entities involved in a process of extending a metadata database from a metadata model. A user may add an attribute or the like to a view 6802 using an appropriate editing interface. A translation engine may be updated for translating metadata between the view model 6802 and a hub 6804. Although the hub 6804 of a hub-and-spoke model is generally maintained in a consistent form, the hub 6804 may also be updated, depending on the nature of the change to the view 6802 and the reasons therefore. Additionally, a translation engine may be updated for translation between the hub and the database 6808. The data model 6804 and/or translation engine may also add appropriate rows, columns, or tables to the database 6808 using appropriate, database-specific commands, as appropriate to reflect the new view 6802 within the database 6808. If changes are made to the database 6808, these changes may be pushed back through the model chain up to the view 6802.

Figure 21 shows the entities involved in a process for accessing a repository 6910 from a tool 6902. The tool 6902 may be third party tool communicating in terms of a view 6904. The tool 6902 may request mapped metadata through the view 6904, which may be translated by a translation engine into a form for the hub 6908. The hub may further translate the request through another translation engine for physical access the mapped metadata in the repository 6910. Thus the request may reach the repository through a series of query transformations. The result, one or more metadata objects, may in turn be passed through a number of translation or transformation engines as it moves from the repository 6910 to the hub 6908, to the view 6904, and finally to the requesting tool 6902. Thus there is in one aspect a method for accessing a repository from an external tool that may include transforming a query through one or more models to a repository, and providing one or more objects, such as mapped metadata through one or more object transformations from the repository 6910 to the tool 6902. Advantageously, this method presents a query to the repository 6910 in a native syntax for the repository, while presenting the results to an external tool 6902 in a native syntax for the tool 6902.

Figure 22 shows the entities involved in a process by which a tool accesses versioned and unversioned metadata models. The tool 7002 may communicate with a user environment 7004, which may be, for example, an event user environment, team user environment, or work user environment as described above. The user environment 7004 may be implemented as a Java space, or any other framework or platform suitable for use with metadata tools. Depending on the type of user environment 7004 and the nature of the tool 7002 and the operations being performed in the user environment 7004, the user environment may communicate with either an unversioned model 7008, i.e., the operational classes and attributes in the operational repository, or a versioned model 7010, i.e., the design classes and attributes in the common repository. The metadata model visible in the user environment may be edited and written back to the versioned model 7010 as either a replacement to an existing version or as a new version of the metadata model. It will be appreciated that the tool 7002 may be prevented from checking out versioned metadata 7010 in the common repository if that metadata 7010 is already checked out to another tool or user.

Figure 23 shows the entities involved in a process by which a user interface accesses multiple versions of metadata in a common repository. The user interface 7102 may issue a request to a common repository 7104 to access one or more versions 7108 of metadata, and may further query the common repository 7104 concerning other versions of the metadata and the nature and chronology of changes between the various versions. Figure 24 shows the entities involved in a reconciliation process for versions of metadata. A version 7202 may be reconciled with another version 7204 through a reconciliation process 7212. A similar reconciliation may be performed on two or more additional versions 7208, 7210 with additional reconciliation processes 7214, 7218. After each reconciliation or after all of the reconciliations, the reconciled versions may be merged into a new version of the metadata reflecting changes from previous versions. This reconciliation may be performed in phases, or all at once, and user control may optionally be exercised over reconciliation of conflicts, order of reconciliation, and so on.

Figure 25 shows the entities involved in a reconciliation process using concurrency. The reconciliation process may be the reconciliation process as described above, except that each discrete reconciliation may be independently passed to a plurality of processors 7304, which may be in a cluster 7302 or physically remote from one another, and executed in a pipelined or parallel fashion, depending on the nature of dependencies between each reconciliation phase.

While the invention has been described in connection with certain preferred embodiments, it should be understood that other embodiments would be recognized by one of ordinary skill in the art, and are intended to fall within the scope of this disclosure.

Claims

CLAIMSWhat is claimed is:

1. A method comprising: registering a metadata model with a repository; associating a first storage mechanism with one or more design properties of the metadata model; and associating a second storage mechanism with one or more operational properties of the metadata model, wherein the second storage mechanism stores a time stamp for at least one of the one or more operational properties of the metadata model.

2. The method of claim 1 wherein the first storage mechanism is a versioned storage mechanism that stores one or more versions of at least one of the one or more design properties of the metadata model.

3. The method of claim 1 further comprising annotating the one or more design properties and the one or more operational properties of the metadata model to associate them with either the first storage mechanism or second storage mechanism.

4. The method of claim 1 further comprising providing a package structure to allocate the one or more design properties and the one or more operational properties of the metadata model between the first storage mechanism and the second storage mechanism.

5. The method of claim 1 further comprising providing a manifest associated with the metadata model to allocate the one or more design properties and the one or more operational properties of the metadata model between the first storage mechanism and the second storage mechanism.

6. The method of claim 1 further comprising registering the operational properties as a first model and registering the design properties as a second model.

7. The method of claim 1 wherein the metadata model can be queried across the one or more operational properties and the one or more design properties.

8. The method of claim 1 further comprising registering one or more mappings with the metadata model, the one or more mappings describing a relationship of the metadata model to one or more other metadata models.

9. A system comprising: a repository including a registered metadata model; a first storage mechanism within the repository, the first storage mechanism associated with one or more design properties of the metadata model; and a second storage mechanism within the repository, the second storage mechanism associated with one or more operational properties of the metadata model and the second storage mechanism, the second storage mechanism adapted to store a time stamp for at least one of the one or more operation properties of the metadata model.

10. The system of claim 9 wherein the first storage mechanism is a versioned storage mechanism that stores one or more versions of at least one of the one or more design properties of the metadata model.

11. The system of claim 9 further comprising annotations to associate the one or more design properties of the metadata model and the one or more operational properties of the metadata model with either the first storage mechanism or second storage mechanism.

12. The system of claim 9 further comprising using a package structure to allocate the one or more design properties and the one or more operational properties of the metadata model between the first storage mechanism and the second storage mechanism.

13. The system of claim 9 further comprising a manifest associated with the metadata model to allocate the one or more design properties and the one or more operational properties of the metadata model between the first storage mechanism and the second storage mechanism.

14. The system of claim 9 wherein the operational properties are registered as a first model and the design properties are registered as a second model.

15. The system of claim 9 wherein the metadata model can be queried across the one or more operational properties and the one or more design properties.

16. The system of claim 9 further comprising one or more mappings registered with the metadata model, the one or more mappings describing a relationship of the metadata model to one or more other metadata models.

17. A computer program product comprising a computer useable medium including computer readable program code, wherein the computer readable program code when executed on one or more computers causes the one or more computers to: register a metadata model with a repository; associate a first storage mechanism with one or more design properties of the metadata model; and associate a second storage mechanism with one or more operational properties of the metadata model, wherein the second storage mechanism stores a time stamp for at least one of the one or more operational properties of the metadata model.

18. A method of managing metadata comprising: organizing an object-oriented metadata model into an operational model that includes operational properties and a design model that includes design properties; storing the operational model in an operational repository; and storing the design model in a common repository.

19. The method of claim 18 further comprising time-stamping at least one item of metadata for the operational model.

20. The method of claim 18 wherein the common repository supports more than one version of the design model.

21. The method of claim 18 further comprising providing a user environment for user interaction with the metadata model.

22. The method of claim 21 wherein the user environment includes a workspace for editing the model.

23. The method of claim 22 wherein the workspace is exclusive to a user.

24. The method of claim 21 wherein the workpace supports versioning of metadata instances.

25. The method of claim 18 further comprising dynamically comparing one or more different versions of the design model in the common repository.

26. The method of claim 18 wherein the common repository supports branching of versions of the design model.

27. The method of claim 18 further comprising reconciling a plurality of versions of the design model.

28. The method of claim 18 further comprising using the metadata model in a metadata service by asynchronously calling the metadata model through a message-oriented service.

29. The method of claim 18 further comprising concurrently executing a service that uses the metadata model.

30. A system for managing metadata comprising: an object-oriented metadata model including an operational model having one or more operational properties of the metadata model and a design model having one or more design properties of the metadata model; an operational repository that stores the operational model; and a common repository that stores the design model.

31. A method comprising: expressing a query in terms native to a first model; translating the query into terms native to a second model using mapping information that describes one or more relationships between the first model and the second model; and translating the query into a native data source format.

32. The method of claim 31 wherein the mapping information can be queried.

33. The method of claim 31 wherein the first model is a view and the second model is a hub.

34. The method of claim 31 wherein method is performed in an enterprise computing system.

35. The method of claim 31 wherein the method is performed in a data integration system.

36. A computer program product comprising a computer useable medium including computer readable program code, wherein the computer readable program code when executed on one or more computers causes the one or more computers to: register a first model; identify a second model and a mapping of at least one property of the first model to the second model; and persist the mapping of the at least one property of the first model to the second model.

37. The computer program product of claim 36 further comprising: identifying at least one other property of the first model not mapped to the second model; and persisting the at least one other property of the first model.

38. The computer program product of claim 36 wherein the first model comprises a plurality of classes.

39. The computer program product of claim 36 wherein the second model comprises a plurality of classes.

40. The computer program product of claim 36 wherein the computer readable program code when executed on one or more computers causes the one or more computers to provide a storage mechanism for persisting the mapping of the at least one property of the first model to the second model that is a reflective storage mechanism.

41. The computer program product of claim 36 wherein the computer readable program code when executed on one or more computers causes the one or more computers to define a schema for representing metadata models in a relational database, and using the schema to persist the mapping of the at least one property of the first model to the second model.

42. The computer program product of claim 41 wherein the computer readable program code when executed on one or more computers causes the one or more computers to revise the first model by changing the schema.

43. The computer program product of claim 41 wherein the computer readable program code when executed on one or more computers causes the one or more computers to revise the first model by changing one or more properties in the relational database.

44. The computer program product of claim 36 wherein the computer readable program code when executed on one or more computers causes the one or more computers to revise the first model by changing the mapping.

45. The computer program product of claim 36 wherein the first model and the second model are metadata models.